linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Unlock-lock questions and the Linux Kernel Memory Model
       [not found] <4118cdbe-c396-08b9-a3e3-a0a6491b82fa@nvidia.com>
@ 2017-11-27 21:16 ` Alan Stern
  2017-11-27 23:28   ` Daniel Lustig
                     ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Alan Stern @ 2017-11-27 21:16 UTC (permalink / raw)
  To: Paul E. McKenney, Andrea Parri, Luc Maranget, Jade Alglave,
	Boqun Feng, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Daniel Lustig, Palmer Dabbelt
  Cc: Kernel development list

This is essentially a repeat of an email I sent out before the
Thanksgiving holiday, the assumption being that lack of any responses
was caused by the holiday break.  (And this time the message is CC'ed
to LKML, so there will be a public record of it.)

A few people have said they believe the Linux Kernel Memory Model
should make unlock followed by lock (of the same variable) act as a
write memory barrier.  In other words, they want the memory model to
forbid the following litmus test:


C unlock-lock-write-ordering-1

{}

P0(int *x, int *y, spinlock_t *s)
{
	spin_lock(s);
	WRITE_ONCE(*x, 1);
	spin_unlock(s);
	spin_lock(s);
	WRITE_ONCE(*y, 1);
	spin_unlock(s);
}

P1(int *x, int *y)
{
	r1 = READ_ONCE(*y);
	smp_rmb();
	r2 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 1:r2=0)


Judging from this, it seems likely that they would want the memory
model to forbid the next two litmus tests as well (since there's never
any guarantee that some other CPU won't interpose a critical section
between a given unlock and a following lock):


C unlock-lock-write-ordering-2

{}

P0(int *x, spinlock_t *s)
{
	spin_lock(s);
	WRITE_ONCE(*x, 1);
	spin_unlock(s);
}

P1(int *x, int *y, spinlock_t *s)
{
	spin_lock(s);
	r1 = READ_ONCE(*x);
	WRITE_ONCE(*y, 1);
	spin_unlock(s);
}

P2(int *x, int *y)
{
	r2 = READ_ONCE(*y);
	smp_rmb();
	r3 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)


C unlock-lock-write-ordering-3

{}

P0(int *x, int *y, int *z, spinlock_t *s)
{
	spin_lock(s);
	WRITE_ONCE(*x, 1);
	spin_unlock(s);
	spin_lock(s);
	r1 = READ_ONCE(*z);
	WRITE_ONCE(*y, 1);
	spin_unlock(s);
}

P1(int *x, int *z, spinlock_t *s)
{
	spin_lock(s);
	r2 = READ_ONCE(*x);
	WRITE_ONCE(*z, 1);
	spin_unlock(s);
}

P2(int *x, int *y)
{
	r3 = READ_ONCE(*y);
	smp_rmb();
	r4 = READ_ONCE(*x);
}

exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 2:r4=0)


The general justification is that all the architectures currently 
supported by the kernel do forbid these tests, and there is code in the 
kernel that depends on this behavior.  Thus the model should forbid 
them.  (Currently the model allows them, on the principle that ordering 
induced by a lock is visible only to CPUs that take the lock.)

On the other hand, whether RISC-V would forbid these tests is not
clear.  Basically, it comes down to using an RCsc versus an RCpc
approach for the locking primitives.

Given that spin_lock() and spin_unlock() derive many of their
properties from acting as an acquire and a release respectively, we can
ask if the memory model should forbid the analogous release-acquire
litmus test:


C rel-acq-write-ordering-3

{}

P0(int *x, int *s, int *y)
{
	WRITE_ONCE(*x, 1);
	smp_store_release(s, 1);
	r1 = smp_load_acquire(s);
	WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
	r2 = READ_ONCE(*y);
	smp_rmb();
	r3 = READ_ONCE(*x);
}

exists (1:r2=1 /\ 1:r3=0)


For that matter, what if we take the initial write to be the 
store-release itself rather than something coming before it?


C rel-acq-write-ordering-1

{}

P0(int *s, int *y)
{
	smp_store_release(s, 1);
	r1 = smp_load_acquire(s);
	WRITE_ONCE(*y, 1);
}

P1(int *s, int *y)
{
	r2 = READ_ONCE(*y);
	smp_rmb();
	r3 = READ_ONCE(*s);
}

exists (1:r2=1 /\ 1:r3=0)


And going to extremes, what if the load-acquire reads from a different 
variable, not the one written by the store-release?  This would be 
like unlocking s and then locking t:


C rel-acq-write-ordering-2

{}

P0(int *s, int *t, int *y)
{
	smp_store_release(s, 1);
	r1 = smp_load_acquire(t);
	WRITE_ONCE(*y, 1);
}

P1(int *s, int *y)
{
	r2 = READ_ONCE(*y);
	smp_rmb();
	r3 = READ_ONCE(*s);
}

exists (1:r2=1 /\ 1:r3=0)


The architectures currently supported by the kernel all forbid this 
last test, but whether we want the model to forbid it is a lot more 
questionable.

I (and others!) would like to know people's opinions on these matters.

Alan Stern

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-27 21:16 ` Unlock-lock questions and the Linux Kernel Memory Model Alan Stern
@ 2017-11-27 23:28   ` Daniel Lustig
  2017-11-28  9:44     ` Peter Zijlstra
  2017-11-28  9:58   ` Peter Zijlstra
  2017-11-29 19:04   ` Daniel Lustig
  2 siblings, 1 reply; 30+ messages in thread
From: Daniel Lustig @ 2017-11-27 23:28 UTC (permalink / raw)
  To: Alan Stern, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Peter Zijlstra,
	Will Deacon, David Howells, Palmer Dabbelt
  Cc: Kernel development list

On 11/27/2017 1:16 PM, Alan Stern wrote:> C rel-acq-write-ordering-3
> 
> {}
> 
> P0(int *x, int *s, int *y)
> {
> 	WRITE_ONCE(*x, 1);
> 	smp_store_release(s, 1);
> 	r1 = smp_load_acquire(s);
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
> 	r2 = READ_ONCE(*y);
> 	smp_rmb();
> 	r3 = READ_ONCE(*x);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)
> 
<snip>
> 
> And going to extremes...

Sorry if I'm missing something obvious, but before going to extremes...
what about this one?

"SB+rel-acq" (or please rename if you have a different scheme)

{}

P0(int *x, int *s, int *y)
{
	WRITE_ONCE(*x, 1);
	smp_store_release(s, 1);
	r1 = smp_load_acquire(s);
	r2 = READ_ONCE(*y);
}

P1(int *x, int *y)
{
	WRITE_ONCE(*y, 1);
	smp_store_release(s, 2);
	r3 = smp_load_acquire(s);
	r4 = READ_ONCE(*x);
}

exists (1:r2=0 /\ 1:r4=0)

If smp_store_release() and smp_load_acquire() map to normal TSO loads
and stores on x86, then this test can't be forbidden, can it?

Similar question for the other tests, but this is probably the
easiest one to analyze.

Dan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-27 23:28   ` Daniel Lustig
@ 2017-11-28  9:44     ` Peter Zijlstra
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-28  9:44 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Mon, Nov 27, 2017 at 03:28:03PM -0800, Daniel Lustig wrote:
> On 11/27/2017 1:16 PM, Alan Stern wrote:> C rel-acq-write-ordering-3
> > 
> > {}
> > 
> > P0(int *x, int *s, int *y)
> > {
> > 	WRITE_ONCE(*x, 1);
> > 	smp_store_release(s, 1);
> > 	r1 = smp_load_acquire(s);
> > 	WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> > 	r2 = READ_ONCE(*y);
> > 	smp_rmb();
> > 	r3 = READ_ONCE(*x);
> > }
> > 
> > exists (1:r2=1 /\ 1:r3=0)
> > 
> <snip>
> > 
> > And going to extremes...
> 
> Sorry if I'm missing something obvious, but before going to extremes...
> what about this one?
> 
> "SB+rel-acq" (or please rename if you have a different scheme)
> 
> {}
> 
> P0(int *x, int *s, int *y)
> {
> 	WRITE_ONCE(*x, 1);
> 	smp_store_release(s, 1);
> 	r1 = smp_load_acquire(s);
> 	r2 = READ_ONCE(*y);
> }

Yes, this one doesn't work on TSO and Power.

Ideally it would work for locks though, but that would mean mandating
RCsc lock implementations and currently Power is holding out on that.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-27 21:16 ` Unlock-lock questions and the Linux Kernel Memory Model Alan Stern
  2017-11-27 23:28   ` Daniel Lustig
@ 2017-11-28  9:58   ` Peter Zijlstra
  2017-11-29 19:04   ` Daniel Lustig
  2 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-28  9:58 UTC (permalink / raw)
  To: Alan Stern
  Cc: Paul E. McKenney, Andrea Parri, Luc Maranget, Jade Alglave,
	Boqun Feng, Nicholas Piggin, Will Deacon, David Howells,
	Daniel Lustig, Palmer Dabbelt, Kernel development list

On Mon, Nov 27, 2017 at 04:16:47PM -0500, Alan Stern wrote:
> This is essentially a repeat of an email I sent out before the
> Thanksgiving holiday, the assumption being that lack of any responses
> was caused by the holiday break.  (And this time the message is CC'ed
> to LKML, so there will be a public record of it.)

Right, thanks! No turkey days on this side of the pond, but it still got
lost in the noise, sorry about that.

> A few people have said they believe the Linux Kernel Memory Model
> should make unlock followed by lock (of the same variable) act as a
> write memory barrier.  In other words, they want the memory model to
> forbid the following litmus test:
> 
> 
> C unlock-lock-write-ordering-1
> 
> {}
> 
> P0(int *x, int *y, spinlock_t *s)
> {
> 	spin_lock(s);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(s);
> 	spin_lock(s);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(s);
> }
> 
> P1(int *x, int *y)
> {
> 	r1 = READ_ONCE(*y);
> 	smp_rmb();
> 	r2 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=0)

ACK, that is the pattern I was using.

> Judging from this, it seems likely that they would want the memory
> model to forbid the next two litmus tests as well (since there's never
> any guarantee that some other CPU won't interpose a critical section
> between a given unlock and a following lock):
> 
> 
> C unlock-lock-write-ordering-2
> 
> {}
> 
> P0(int *x, spinlock_t *s)
> {
> 	spin_lock(s);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(s);
> }
> 
> P1(int *x, int *y, spinlock_t *s)
> {
> 	spin_lock(s);
> 	r1 = READ_ONCE(*x);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(s);
> }
> 
> P2(int *x, int *y)
> {
> 	r2 = READ_ONCE(*y);
> 	smp_rmb();
> 	r3 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 2:r2=1 /\ 2:r3=0)

Agreed, I would indeed also expect that to 'work'.

> C unlock-lock-write-ordering-3
> 
> {}
> 
> P0(int *x, int *y, int *z, spinlock_t *s)
> {
> 	spin_lock(s);
> 	WRITE_ONCE(*x, 1);
> 	spin_unlock(s);
> 	spin_lock(s);
> 	r1 = READ_ONCE(*z);
> 	WRITE_ONCE(*y, 1);
> 	spin_unlock(s);
> }
> 
> P1(int *x, int *z, spinlock_t *s)
> {
> 	spin_lock(s);
> 	r2 = READ_ONCE(*x);
> 	WRITE_ONCE(*z, 1);
> 	spin_unlock(s);
> }
> 
> P2(int *x, int *y)
> {
> 	r3 = READ_ONCE(*y);
> 	smp_rmb();
> 	r4 = READ_ONCE(*x);
> }
> 
> exists (0:r1=1 /\ 1:r2=1 /\ 2:r3=1 /\ 2:r4=0)

And this is the same except with one more link in the chain, right? I
would've put it in P3 or somesuch but no matter.

> The general justification is that all the architectures currently
> supported by the kernel do forbid these tests, and there is code in
> the kernel that depends on this behavior.  Thus the model should
> forbid them.  (Currently the model allows them, on the principle that
> ordering induced by a lock is visible only to CPUs that take the
> lock.)
> 
> On the other hand, whether RISC-V would forbid these tests is not
> clear.  Basically, it comes down to using an RCsc versus an RCpc
> approach for the locking primitives.

Not entirely, as Power is currently RCpc but still forbids these.

> Given that spin_lock() and spin_unlock() derive many of their
> properties from acting as an acquire and a release respectively, we can
> ask if the memory model should forbid the analogous release-acquire
> litmus test:

> C rel-acq-write-ordering-3
> 
> {}
> 
> P0(int *x, int *s, int *y)
> {
> 	WRITE_ONCE(*x, 1);
> 	smp_store_release(s, 1);
> 	r1 = smp_load_acquire(s);
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
> 	r2 = READ_ONCE(*y);
> 	smp_rmb();
> 	r3 = READ_ONCE(*x);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)

Ideally yes, consistency is good.

> For that matter, what if we take the initial write to be the 
> store-release itself rather than something coming before it?
> 
> 
> C rel-acq-write-ordering-1
> 
> {}
> 
> P0(int *s, int *y)
> {
> 	smp_store_release(s, 1);
> 	r1 = smp_load_acquire(s);
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *s, int *y)
> {
> 	r2 = READ_ONCE(*y);
> 	smp_rmb();
> 	r3 = READ_ONCE(*s);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)

I expect it would take extra work to make this one fail while the
previous one works, no?

That is, given the things so far, it would only be consistent for this
to also work.

> And going to extremes, what if the load-acquire reads from a different 
> variable, not the one written by the store-release?  This would be 
> like unlocking s and then locking t:
> 
> 
> C rel-acq-write-ordering-2
> 
> {}
> 
> P0(int *s, int *t, int *y)
> {
> 	smp_store_release(s, 1);
> 	r1 = smp_load_acquire(t);
> 	WRITE_ONCE(*y, 1);
> }
> 
> P1(int *s, int *y)
> {
> 	r2 = READ_ONCE(*y);
> 	smp_rmb();
> 	r3 = READ_ONCE(*s);
> }
> 
> exists (1:r2=1 /\ 1:r3=0)
> 
> 
> The architectures currently supported by the kernel all forbid this 
> last test, but whether we want the model to forbid it is a lot more 
> questionable.
> 
> I (and others!) would like to know people's opinions on these matters.

This one is interesting... Yes it will work on current hardware, but I'm
not sure I'd expect this. I'll ponder this a wee bit more.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-27 21:16 ` Unlock-lock questions and the Linux Kernel Memory Model Alan Stern
  2017-11-27 23:28   ` Daniel Lustig
  2017-11-28  9:58   ` Peter Zijlstra
@ 2017-11-29 19:04   ` Daniel Lustig
  2017-11-29 19:33     ` Paul E. McKenney
                       ` (3 more replies)
  2 siblings, 4 replies; 30+ messages in thread
From: Daniel Lustig @ 2017-11-29 19:04 UTC (permalink / raw)
  To: Alan Stern, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Peter Zijlstra,
	Will Deacon, David Howells, Palmer Dabbelt
  Cc: Kernel development list

On 11/27/2017 1:16 PM, Alan Stern wrote:
> This is essentially a repeat of an email I sent out before the
> Thanksgiving holiday, the assumption being that lack of any responses
> was caused by the holiday break.  (And this time the message is CC'ed
> to LKML, so there will be a public record of it.)
> 
> A few people have said they believe the Linux Kernel Memory Model
> should make unlock followed by lock (of the same variable) act as a
> write memory barrier.  In other words, they want the memory model to
> forbid the following litmus test:
>
<snip>
> 
> I (and others!) would like to know people's opinions on these matters.
> 
> Alan Stern

While we're here, let me ask about another test which isn't directly
about unlock/lock but which is still somewhat related to this
discussion:

"MP+wmb+xchg-acq" (or some such)

{}

P0(int *x, int *y)
{
        WRITE_ONCE(*x, 1);
        smp_wmb();
        WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
        r1 = atomic_xchg_relaxed(y, 2);
        r2 = smp_load_acquire(y);
        r3 = READ_ONCE(*x);
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)

C/C++ would call the atomic_xchg_relaxed part of a release sequence
and hence would forbid this outcome.

x86 and Power would forbid this.  ARM forbids this via a special-case
rule in the memory model, ordering atomics with later load-acquires.

RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
because xchg doesn't have an inherent internal data dependency.

If the Linux memory model is going to forbid this outcome, then
RISC-V would either need to use fences instead, or maybe we'd need to
add a special rule to our memory model similarly.  This is one detail
where RISC-V is still actively deciding what to do.

Have you all thought about this test before?  Any idea which way you
are leaning regarding the outcome above?

Thanks,
Dan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:04   ` Daniel Lustig
@ 2017-11-29 19:33     ` Paul E. McKenney
  2017-11-29 19:44     ` Alan Stern
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-29 19:33 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Andrea Parri, Luc Maranget, Jade Alglave, Boqun Feng,
	Nicholas Piggin, Peter Zijlstra, Will Deacon, David Howells,
	Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> On 11/27/2017 1:16 PM, Alan Stern wrote:
> > This is essentially a repeat of an email I sent out before the
> > Thanksgiving holiday, the assumption being that lack of any responses
> > was caused by the holiday break.  (And this time the message is CC'ed
> > to LKML, so there will be a public record of it.)
> > 
> > A few people have said they believe the Linux Kernel Memory Model
> > should make unlock followed by lock (of the same variable) act as a
> > write memory barrier.  In other words, they want the memory model to
> > forbid the following litmus test:
> >
> <snip>
> > 
> > I (and others!) would like to know people's opinions on these matters.
> > 
> > Alan Stern
> 
> While we're here, let me ask about another test which isn't directly
> about unlock/lock but which is still somewhat related to this
> discussion:
> 
> "MP+wmb+xchg-acq" (or some such)

If you make the above be "C MP+wmb+xchg-acq", then this is currently
allowed by the current version of the Linux kernel memory model.
Also by the hardware model, interestingly enough.

						Thanx, Paul

> {}
> 
> P0(int *x, int *y)
> {
>         WRITE_ONCE(*x, 1);
>         smp_wmb();
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>         r1 = atomic_xchg_relaxed(y, 2);
>         r2 = smp_load_acquire(y);
>         r3 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> 
> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> and hence would forbid this outcome.
> 
> x86 and Power would forbid this.  ARM forbids this via a special-case
> rule in the memory model, ordering atomics with later load-acquires.
> 
> RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> because xchg doesn't have an inherent internal data dependency.
> 
> If the Linux memory model is going to forbid this outcome, then
> RISC-V would either need to use fences instead, or maybe we'd need to
> add a special rule to our memory model similarly.  This is one detail
> where RISC-V is still actively deciding what to do.
> 
> Have you all thought about this test before?  Any idea which way you
> are leaning regarding the outcome above?
> 
> Thanks,
> Dan
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:04   ` Daniel Lustig
  2017-11-29 19:33     ` Paul E. McKenney
@ 2017-11-29 19:44     ` Alan Stern
  2017-11-30  8:55       ` Boqun Feng
  2017-11-29 19:46     ` Peter Zijlstra
  2017-11-29 19:58     ` Peter Zijlstra
  3 siblings, 1 reply; 30+ messages in thread
From: Alan Stern @ 2017-11-29 19:44 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Paul E. McKenney, Andrea Parri, Luc Maranget, Jade Alglave,
	Boqun Feng, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, 29 Nov 2017, Daniel Lustig wrote:

> While we're here, let me ask about another test which isn't directly
> about unlock/lock but which is still somewhat related to this
> discussion:
> 
> "MP+wmb+xchg-acq" (or some such)
> 
> {}
> 
> P0(int *x, int *y)
> {
>         WRITE_ONCE(*x, 1);
>         smp_wmb();
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>         r1 = atomic_xchg_relaxed(y, 2);
>         r2 = smp_load_acquire(y);
>         r3 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> 
> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> and hence would forbid this outcome.
> 
> x86 and Power would forbid this.  ARM forbids this via a special-case
> rule in the memory model, ordering atomics with later load-acquires.
> 
> RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> because xchg doesn't have an inherent internal data dependency.
> 
> If the Linux memory model is going to forbid this outcome, then
> RISC-V would either need to use fences instead, or maybe we'd need to
> add a special rule to our memory model similarly.  This is one detail
> where RISC-V is still actively deciding what to do.
> 
> Have you all thought about this test before?  Any idea which way you
> are leaning regarding the outcome above?

Good questions.  Currently the LKMM allows this, and I think it should
because xchg doesn't have a dependency from its read to its write.

On the other hand, herd isn't careful enough in the way it implements 
internal dependencies for RMW operations.  If we change 
atomic_xchg_relaxed(y, 2) to atomic_inc(y) and remove r1 from the test:

C MP+wmb+inc-acq

{}

P0(int *x, int *y)
{
        WRITE_ONCE(*x, 1);
        smp_wmb();
        WRITE_ONCE(*y, 1);
}

P1(int *x, int *y)
{
        atomic_inc(y);
        r2 = smp_load_acquire(y);
        r3 = READ_ONCE(*x);
}

exists (1:r2=2 /\ 1:r3=0)

then the test _should_ be forbidden, but it isn't -- herd doesn't
realize that all atomic RMW operations other than xchg must have a
dependency (either data or control) between their internal read and
write.

(Although the smp_load_acquire is allowed to execute before the write 
part of the atomic_inc, it cannot execute before the read part.  I 
think a similar argument applies even on ARM.)

Luc, consider this a bug report.  :-)

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:04   ` Daniel Lustig
  2017-11-29 19:33     ` Paul E. McKenney
  2017-11-29 19:44     ` Alan Stern
@ 2017-11-29 19:46     ` Peter Zijlstra
  2017-11-29 19:53       ` Alan Stern
  2017-11-30 10:02       ` Will Deacon
  2017-11-29 19:58     ` Peter Zijlstra
  3 siblings, 2 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-29 19:46 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:

> While we're here, let me ask about another test which isn't directly
> about unlock/lock but which is still somewhat related to this
> discussion:
> 
> "MP+wmb+xchg-acq" (or some such)
> 
> {}
> 
> P0(int *x, int *y)
> {
>         WRITE_ONCE(*x, 1);
>         smp_wmb();
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>         r1 = atomic_xchg_relaxed(y, 2);
>         r2 = smp_load_acquire(y);
>         r3 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> 
> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> and hence would forbid this outcome.

That's just weird. Either its _relaxed, or its _release. Making _relaxed
mean _release is just daft.

> x86 and Power would forbid this.  ARM forbids this via a special-case
> rule in the memory model, ordering atomics with later load-acquires.

Curious, I did not know about that rule. I would've thought ARM would in
fact allow it.

> RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> because xchg doesn't have an inherent internal data dependency.
> 
> If the Linux memory model is going to forbid this outcome, then
> RISC-V would either need to use fences instead, or maybe we'd need to
> add a special rule to our memory model similarly.  This is one detail
> where RISC-V is still actively deciding what to do.
> 
> Have you all thought about this test before?  Any idea which way you
> are leaning regarding the outcome above?

FWIW I would expect the reorder to be allowed.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:46     ` Peter Zijlstra
@ 2017-11-29 19:53       ` Alan Stern
  2017-11-29 20:42         ` Paul E. McKenney
  2017-11-30 10:02       ` Will Deacon
  1 sibling, 1 reply; 30+ messages in thread
From: Alan Stern @ 2017-11-29 19:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Lustig, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, 29 Nov 2017, Peter Zijlstra wrote:

> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> 
> > While we're here, let me ask about another test which isn't directly
> > about unlock/lock but which is still somewhat related to this
> > discussion:
> > 
> > "MP+wmb+xchg-acq" (or some such)
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >         WRITE_ONCE(*x, 1);
> >         smp_wmb();
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >         r1 = atomic_xchg_relaxed(y, 2);
> >         r2 = smp_load_acquire(y);
> >         r3 = READ_ONCE(*x);
> > }
> > 
> > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > 
> > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > and hence would forbid this outcome.
> 
> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> mean _release is just daft.

The C11 memory model specifically allows atomic operations to be 
interspersed within a release sequence.  But it doesn't say why.

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:04   ` Daniel Lustig
                       ` (2 preceding siblings ...)
  2017-11-29 19:46     ` Peter Zijlstra
@ 2017-11-29 19:58     ` Peter Zijlstra
  3 siblings, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-29 19:58 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> "MP+wmb+xchg-acq" (or some such)
> 
> {}
> 
> P0(int *x, int *y)
> {
>         WRITE_ONCE(*x, 1);
>         smp_wmb();
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>         r1 = atomic_xchg_relaxed(y, 2);
>         r2 = smp_load_acquire(y);

Oh, I wasn't careful enough reading; these are both y.

And then Alan raises a good point in that RmW have dependencies.

Much tricker than I initially thought.

>         r3 = READ_ONCE(*x);
> }
> 
> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:53       ` Alan Stern
@ 2017-11-29 20:42         ` Paul E. McKenney
  2017-11-29 22:18           ` Daniel Lustig
  0 siblings, 1 reply; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-29 20:42 UTC (permalink / raw)
  To: Alan Stern
  Cc: Peter Zijlstra, Daniel Lustig, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> 
> > On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> > 
> > > While we're here, let me ask about another test which isn't directly
> > > about unlock/lock but which is still somewhat related to this
> > > discussion:
> > > 
> > > "MP+wmb+xchg-acq" (or some such)
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >         WRITE_ONCE(*x, 1);
> > >         smp_wmb();
> > >         WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >         r1 = atomic_xchg_relaxed(y, 2);
> > >         r2 = smp_load_acquire(y);
> > >         r3 = READ_ONCE(*x);
> > > }
> > > 
> > > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > 
> > > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > and hence would forbid this outcome.
> > 
> > That's just weird. Either its _relaxed, or its _release. Making _relaxed
> > mean _release is just daft.
> 
> The C11 memory model specifically allows atomic operations to be 
> interspersed within a release sequence.  But it doesn't say why.

The use case put forward within the committee is for atomic quantities
with mode bits.  The most frequent has the atomic quantity having
lock-like properties, in which case you don't want to lose the ordering
effects of the lock handoff just because a mode bit got set or cleared.
Some claim to actually use something like this, but details have not
been forthcoming.

I confess to being a bit skeptical.  If the mode changes are infrequent,
the update could just as well be ordered.

That said, Daniel, the C++ memory model really does require that the
above litmus test be forbidden, my denigration of it notwithstanding.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 20:42         ` Paul E. McKenney
@ 2017-11-29 22:18           ` Daniel Lustig
  2017-11-29 22:59             ` Paul E. McKenney
  2017-11-30 15:20             ` Alan Stern
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Lustig @ 2017-11-29 22:18 UTC (permalink / raw)
  To: paulmck, Alan Stern
  Cc: Peter Zijlstra, Andrea Parri, Luc Maranget, Jade Alglave,
	Boqun Feng, Nicholas Piggin, Will Deacon, David Howells,
	Palmer Dabbelt, Kernel development list

On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
>> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
>>
>>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
>>>
>>>> While we're here, let me ask about another test which isn't directly
>>>> about unlock/lock but which is still somewhat related to this
>>>> discussion:
>>>>
>>>> "MP+wmb+xchg-acq" (or some such)
>>>>
>>>> {}
>>>>
>>>> P0(int *x, int *y)
>>>> {
>>>>         WRITE_ONCE(*x, 1);
>>>>         smp_wmb();
>>>>         WRITE_ONCE(*y, 1);
>>>> }
>>>>
>>>> P1(int *x, int *y)
>>>> {
>>>>         r1 = atomic_xchg_relaxed(y, 2);
>>>>         r2 = smp_load_acquire(y);
>>>>         r3 = READ_ONCE(*x);
>>>> }
>>>>
>>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
>>>>
>>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
>>>> and hence would forbid this outcome.
>>>
>>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
>>> mean _release is just daft.
>>
>> The C11 memory model specifically allows atomic operations to be 
>> interspersed within a release sequence.  But it doesn't say why.
> 
> The use case put forward within the committee is for atomic quantities
> with mode bits.  The most frequent has the atomic quantity having
> lock-like properties, in which case you don't want to lose the ordering
> effects of the lock handoff just because a mode bit got set or cleared.
> Some claim to actually use something like this, but details have not
> been forthcoming.
> 
> I confess to being a bit skeptical.  If the mode changes are infrequent,
> the update could just as well be ordered.

Aren't reference counting implementations which use memory_order_relaxed
for incrementing the count another important use case?  Specifically,
the synchronization between a memory_order_release decrement and the
eventual memory_order_acquire/consume free shouldn't be interrupted by
other (relaxed) increments and (release-only) decrements that happen in
between.  At least that's my understanding of this use case.  I wasn't
there when the C/C++ committee decided this.

> That said, Daniel, the C++ memory model really does require that the
> above litmus test be forbidden, my denigration of it notwithstanding.

Yes I agree, that's why I'm curious what the Linux memory model has
in mind here :)

Dan

> 							Thanx, Paul
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 22:18           ` Daniel Lustig
@ 2017-11-29 22:59             ` Paul E. McKenney
  2017-11-30 15:20             ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-29 22:59 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Peter Zijlstra, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 02:18:48PM -0800, Daniel Lustig wrote:
> On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> >>
> >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> >>>
> >>>> While we're here, let me ask about another test which isn't directly
> >>>> about unlock/lock but which is still somewhat related to this
> >>>> discussion:
> >>>>
> >>>> "MP+wmb+xchg-acq" (or some such)
> >>>>
> >>>> {}
> >>>>
> >>>> P0(int *x, int *y)
> >>>> {
> >>>>         WRITE_ONCE(*x, 1);
> >>>>         smp_wmb();
> >>>>         WRITE_ONCE(*y, 1);
> >>>> }
> >>>>
> >>>> P1(int *x, int *y)
> >>>> {
> >>>>         r1 = atomic_xchg_relaxed(y, 2);
> >>>>         r2 = smp_load_acquire(y);
> >>>>         r3 = READ_ONCE(*x);
> >>>> }
> >>>>
> >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> >>>>
> >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> >>>> and hence would forbid this outcome.
> >>>
> >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> >>> mean _release is just daft.
> >>
> >> The C11 memory model specifically allows atomic operations to be 
> >> interspersed within a release sequence.  But it doesn't say why.
> > 
> > The use case put forward within the committee is for atomic quantities
> > with mode bits.  The most frequent has the atomic quantity having
> > lock-like properties, in which case you don't want to lose the ordering
> > effects of the lock handoff just because a mode bit got set or cleared.
> > Some claim to actually use something like this, but details have not
> > been forthcoming.
> > 
> > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > the update could just as well be ordered.
> 
> Aren't reference counting implementations which use memory_order_relaxed
> for incrementing the count another important use case?  Specifically,
> the synchronization between a memory_order_release decrement and the
> eventual memory_order_acquire/consume free shouldn't be interrupted by
> other (relaxed) increments and (release-only) decrements that happen in
> between.  At least that's my understanding of this use case.  I wasn't
> there when the C/C++ committee decided this.

Well, C++ release sequences will likely soon not order memory_order_consume
loads: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html

So we were hoping to avoid release sequences entirely.  But if someone
in the kernel really is using them, we will have to model them, but
only those interacting with acquire loads.

> > That said, Daniel, the C++ memory model really does require that the
> > above litmus test be forbidden, my denigration of it notwithstanding.
> 
> Yes I agree, that's why I'm curious what the Linux memory model has
> in mind here :)

Read P0735R0 (the above URL) and then tell me with a straight face that
you would not also have been tempted.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:44     ` Alan Stern
@ 2017-11-30  8:55       ` Boqun Feng
  2017-11-30  9:15         ` Peter Zijlstra
  2017-11-30 15:46         ` Alan Stern
  0 siblings, 2 replies; 30+ messages in thread
From: Boqun Feng @ 2017-11-30  8:55 UTC (permalink / raw)
  To: Alan Stern
  Cc: Daniel Lustig, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

[-- Attachment #1: Type: text/plain, Size: 2961 bytes --]

On Wed, Nov 29, 2017 at 02:44:37PM -0500, Alan Stern wrote:
> On Wed, 29 Nov 2017, Daniel Lustig wrote:
> 
> > While we're here, let me ask about another test which isn't directly
> > about unlock/lock but which is still somewhat related to this
> > discussion:
> > 
> > "MP+wmb+xchg-acq" (or some such)
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >         WRITE_ONCE(*x, 1);
> >         smp_wmb();
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >         r1 = atomic_xchg_relaxed(y, 2);
> >         r2 = smp_load_acquire(y);
> >         r3 = READ_ONCE(*x);
> > }
> > 
> > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > 
> > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > and hence would forbid this outcome.
> > 
> > x86 and Power would forbid this.  ARM forbids this via a special-case
> > rule in the memory model, ordering atomics with later load-acquires.
> > 
> > RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> > atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> > because xchg doesn't have an inherent internal data dependency.
> > 
> > If the Linux memory model is going to forbid this outcome, then
> > RISC-V would either need to use fences instead, or maybe we'd need to
> > add a special rule to our memory model similarly.  This is one detail
> > where RISC-V is still actively deciding what to do.
> > 
> > Have you all thought about this test before?  Any idea which way you
> > are leaning regarding the outcome above?
> 
> Good questions.  Currently the LKMM allows this, and I think it should
> because xchg doesn't have a dependency from its read to its write.
> 
> On the other hand, herd isn't careful enough in the way it implements 
> internal dependencies for RMW operations.  If we change 
> atomic_xchg_relaxed(y, 2) to atomic_inc(y) and remove r1 from the test:
> 
> C MP+wmb+inc-acq
> 
> {}
> 
> P0(int *x, int *y)
> {
>         WRITE_ONCE(*x, 1);
>         smp_wmb();
>         WRITE_ONCE(*y, 1);
> }
> 
> P1(int *x, int *y)
> {
>         atomic_inc(y);
>         r2 = smp_load_acquire(y);
>         r3 = READ_ONCE(*x);
> }
> 
> exists (1:r2=2 /\ 1:r3=0)
> 
> then the test _should_ be forbidden, but it isn't -- herd doesn't
> realize that all atomic RMW operations other than xchg must have a
> dependency (either data or control) between their internal read and
> write.
> 
> (Although the smp_load_acquire is allowed to execute before the write 
> part of the atomic_inc, it cannot execute before the read part.  I 
> think a similar argument applies even on ARM.)
> 

But in case of AMOs, which directly send the addition request to memory
controller, so there wouldn't be any read part or even write part of the
atomic_inc() executed by CPU. Would this be allowed then?

Regards,
Boqun

> Luc, consider this a bug report.  :-)
> 
> Alan
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30  8:55       ` Boqun Feng
@ 2017-11-30  9:15         ` Peter Zijlstra
  2017-11-30 15:46         ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-30  9:15 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Alan Stern, Daniel Lustig, Paul E. McKenney, Andrea Parri,
	Luc Maranget, Jade Alglave, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 04:55:09PM +0800, Boqun Feng wrote:
> But in case of AMOs, which directly send the addition request to memory
> controller, so there wouldn't be any read part or even write part of the
> atomic_inc() executed by CPU. Would this be allowed then?

Personally I bloody hate AMOs that don't respect the normal way of
things.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 19:46     ` Peter Zijlstra
  2017-11-29 19:53       ` Alan Stern
@ 2017-11-30 10:02       ` Will Deacon
  1 sibling, 0 replies; 30+ messages in thread
From: Will Deacon @ 2017-11-30 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Daniel Lustig, Alan Stern, Paul E. McKenney, Andrea Parri,
	Luc Maranget, Jade Alglave, Boqun Feng, Nicholas Piggin,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, Nov 29, 2017 at 08:46:02PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> 
> > While we're here, let me ask about another test which isn't directly
> > about unlock/lock but which is still somewhat related to this
> > discussion:
> > 
> > "MP+wmb+xchg-acq" (or some such)
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >         WRITE_ONCE(*x, 1);
> >         smp_wmb();
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >         r1 = atomic_xchg_relaxed(y, 2);
> >         r2 = smp_load_acquire(y);
> >         r3 = READ_ONCE(*x);
> > }
> > 
> > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > 
> > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > and hence would forbid this outcome.
> 
> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> mean _release is just daft.

I don't think it's actually that weird. If, for example, the write to *y in
P0 was part of an UNLOCK operation and the load_acquire of y in P1 was a
LOCK operation, then the xchg could just be setting some waiting bit in
other bits of the lock word. C/C++ also requires order here if the xchg is
done on its own thread.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-29 22:18           ` Daniel Lustig
  2017-11-29 22:59             ` Paul E. McKenney
@ 2017-11-30 15:20             ` Alan Stern
  2017-11-30 16:14               ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Alan Stern @ 2017-11-30 15:20 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: paulmck, Peter Zijlstra, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Wed, 29 Nov 2017, Daniel Lustig wrote:

> On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> >>
> >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> >>>
> >>>> While we're here, let me ask about another test which isn't directly
> >>>> about unlock/lock but which is still somewhat related to this
> >>>> discussion:
> >>>>
> >>>> "MP+wmb+xchg-acq" (or some such)
> >>>>
> >>>> {}
> >>>>
> >>>> P0(int *x, int *y)
> >>>> {
> >>>>         WRITE_ONCE(*x, 1);
> >>>>         smp_wmb();
> >>>>         WRITE_ONCE(*y, 1);
> >>>> }
> >>>>
> >>>> P1(int *x, int *y)
> >>>> {
> >>>>         r1 = atomic_xchg_relaxed(y, 2);
> >>>>         r2 = smp_load_acquire(y);
> >>>>         r3 = READ_ONCE(*x);
> >>>> }
> >>>>
> >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> >>>>
> >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> >>>> and hence would forbid this outcome.
> >>>
> >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> >>> mean _release is just daft.
> >>
> >> The C11 memory model specifically allows atomic operations to be 
> >> interspersed within a release sequence.  But it doesn't say why.
> > 
> > The use case put forward within the committee is for atomic quantities
> > with mode bits.  The most frequent has the atomic quantity having
> > lock-like properties, in which case you don't want to lose the ordering
> > effects of the lock handoff just because a mode bit got set or cleared.
> > Some claim to actually use something like this, but details have not
> > been forthcoming.
> > 
> > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > the update could just as well be ordered.
> 
> Aren't reference counting implementations which use memory_order_relaxed
> for incrementing the count another important use case?  Specifically,
> the synchronization between a memory_order_release decrement and the
> eventual memory_order_acquire/consume free shouldn't be interrupted by
> other (relaxed) increments and (release-only) decrements that happen in
> between.  At least that's my understanding of this use case.  I wasn't
> there when the C/C++ committee decided this.
> 
> > That said, Daniel, the C++ memory model really does require that the
> > above litmus test be forbidden, my denigration of it notwithstanding.
> 
> Yes I agree, that's why I'm curious what the Linux memory model has
> in mind here :)

Bear in mind that the litmus test above uses xchg, not increment or 
decrement.  This makes a difference as far as the LKMM is concerned, 
even if not for C/C++.

(Also, technically speaking, the litmus test doesn't have any release 
operations, so no release sequence...)

Alan Stern

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30  8:55       ` Boqun Feng
  2017-11-30  9:15         ` Peter Zijlstra
@ 2017-11-30 15:46         ` Alan Stern
  2017-12-01  2:46           ` Boqun Feng
  1 sibling, 1 reply; 30+ messages in thread
From: Alan Stern @ 2017-11-30 15:46 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Daniel Lustig, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, 30 Nov 2017, Boqun Feng wrote:

> On Wed, Nov 29, 2017 at 02:44:37PM -0500, Alan Stern wrote:
> > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > 
> > > While we're here, let me ask about another test which isn't directly
> > > about unlock/lock but which is still somewhat related to this
> > > discussion:
> > > 
> > > "MP+wmb+xchg-acq" (or some such)
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >         WRITE_ONCE(*x, 1);
> > >         smp_wmb();
> > >         WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >         r1 = atomic_xchg_relaxed(y, 2);
> > >         r2 = smp_load_acquire(y);
> > >         r3 = READ_ONCE(*x);
> > > }
> > > 
> > > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > 
> > > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > and hence would forbid this outcome.
> > > 
> > > x86 and Power would forbid this.  ARM forbids this via a special-case
> > > rule in the memory model, ordering atomics with later load-acquires.
> > > 
> > > RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> > > atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> > > because xchg doesn't have an inherent internal data dependency.
> > > 
> > > If the Linux memory model is going to forbid this outcome, then
> > > RISC-V would either need to use fences instead, or maybe we'd need to
> > > add a special rule to our memory model similarly.  This is one detail
> > > where RISC-V is still actively deciding what to do.
> > > 
> > > Have you all thought about this test before?  Any idea which way you
> > > are leaning regarding the outcome above?
> > 
> > Good questions.  Currently the LKMM allows this, and I think it should
> > because xchg doesn't have a dependency from its read to its write.
> > 
> > On the other hand, herd isn't careful enough in the way it implements 
> > internal dependencies for RMW operations.  If we change 
> > atomic_xchg_relaxed(y, 2) to atomic_inc(y) and remove r1 from the test:
> > 
> > C MP+wmb+inc-acq
> > 
> > {}
> > 
> > P0(int *x, int *y)
> > {
> >         WRITE_ONCE(*x, 1);
> >         smp_wmb();
> >         WRITE_ONCE(*y, 1);
> > }
> > 
> > P1(int *x, int *y)
> > {
> >         atomic_inc(y);
> >         r2 = smp_load_acquire(y);
> >         r3 = READ_ONCE(*x);
> > }
> > 
> > exists (1:r2=2 /\ 1:r3=0)
> > 
> > then the test _should_ be forbidden, but it isn't -- herd doesn't
> > realize that all atomic RMW operations other than xchg must have a
> > dependency (either data or control) between their internal read and
> > write.
> > 
> > (Although the smp_load_acquire is allowed to execute before the write 
> > part of the atomic_inc, it cannot execute before the read part.  I 
> > think a similar argument applies even on ARM.)
> > 
> 
> But in case of AMOs, which directly send the addition request to memory
> controller, so there wouldn't be any read part or even write part of the
> atomic_inc() executed by CPU. Would this be allowed then?

Firstly, sending the addition request to the memory controller _is_ a
write operation.

Secondly, even though the CPU hardware might not execute a read 
operation during an AMO, the LKMM and herd nevertheless represent the 
atomic update as a specially-annotated read event followed by a write 
event.

In an other-multicopy-atomic system, P0's write to y must become
visible to P1 before P1 executes the smp_load_acquire, because the
write was visible to the memory controller when the controller carried
out the AMO, and the write becomes visible to the memory controller and
to P1 at the same time (by other-multicopy-atomicity).  That's why I
said the test would be forbidden on ARM.

But even on a non-other-multicopy-atomic system, there has to be some 
synchronization between the memory controller and P1's CPU.  Otherwise, 
how could the system guarantee that P1's smp_load_acquire would see the 
post-increment value of y?  It seems reasonable to assume that this 
synchronization would also cause P1 to see x=1.

Alan Stern

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 15:20             ` Alan Stern
@ 2017-11-30 16:14               ` Paul E. McKenney
  2017-11-30 16:25                 ` Peter Zijlstra
  2017-11-30 16:41                 ` Will Deacon
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-30 16:14 UTC (permalink / raw)
  To: Alan Stern
  Cc: Daniel Lustig, Peter Zijlstra, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 10:20:02AM -0500, Alan Stern wrote:
> On Wed, 29 Nov 2017, Daniel Lustig wrote:
> 
> > On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> > >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> > >>
> > >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> > >>>
> > >>>> While we're here, let me ask about another test which isn't directly
> > >>>> about unlock/lock but which is still somewhat related to this
> > >>>> discussion:
> > >>>>
> > >>>> "MP+wmb+xchg-acq" (or some such)
> > >>>>
> > >>>> {}
> > >>>>
> > >>>> P0(int *x, int *y)
> > >>>> {
> > >>>>         WRITE_ONCE(*x, 1);
> > >>>>         smp_wmb();
> > >>>>         WRITE_ONCE(*y, 1);
> > >>>> }
> > >>>>
> > >>>> P1(int *x, int *y)
> > >>>> {
> > >>>>         r1 = atomic_xchg_relaxed(y, 2);
> > >>>>         r2 = smp_load_acquire(y);
> > >>>>         r3 = READ_ONCE(*x);
> > >>>> }
> > >>>>
> > >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > >>>>
> > >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > >>>> and hence would forbid this outcome.
> > >>>
> > >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> > >>> mean _release is just daft.
> > >>
> > >> The C11 memory model specifically allows atomic operations to be 
> > >> interspersed within a release sequence.  But it doesn't say why.
> > > 
> > > The use case put forward within the committee is for atomic quantities
> > > with mode bits.  The most frequent has the atomic quantity having
> > > lock-like properties, in which case you don't want to lose the ordering
> > > effects of the lock handoff just because a mode bit got set or cleared.
> > > Some claim to actually use something like this, but details have not
> > > been forthcoming.
> > > 
> > > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > > the update could just as well be ordered.
> > 
> > Aren't reference counting implementations which use memory_order_relaxed
> > for incrementing the count another important use case?  Specifically,
> > the synchronization between a memory_order_release decrement and the
> > eventual memory_order_acquire/consume free shouldn't be interrupted by
> > other (relaxed) increments and (release-only) decrements that happen in
> > between.  At least that's my understanding of this use case.  I wasn't
> > there when the C/C++ committee decided this.
> > 
> > > That said, Daniel, the C++ memory model really does require that the
> > > above litmus test be forbidden, my denigration of it notwithstanding.
> > 
> > Yes I agree, that's why I'm curious what the Linux memory model has
> > in mind here :)
> 
> Bear in mind that the litmus test above uses xchg, not increment or 
> decrement.  This makes a difference as far as the LKMM is concerned, 
> even if not for C/C++.

Finally remembering this discussion...  Yes, xchg is special.  ;-)

Will, are there plans to bring this sort of thing before the standards
committee?

> (Also, technically speaking, the litmus test doesn't have any release 
> operations, so no release sequence...)

True!  But if you translated it into C11, you would probably turn the
smp_wmb() followed by write into a store release, which would get you
a release sequence.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:14               ` Paul E. McKenney
@ 2017-11-30 16:25                 ` Peter Zijlstra
  2017-11-30 16:39                   ` Paul E. McKenney
  2017-11-30 16:41                 ` Will Deacon
  1 sibling, 1 reply; 30+ messages in thread
From: Peter Zijlstra @ 2017-11-30 16:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Daniel Lustig, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 08:14:01AM -0800, Paul E. McKenney wrote:
> > (Also, technically speaking, the litmus test doesn't have any release 
> > operations, so no release sequence...)
> 
> True!  But if you translated it into C11, you would probably turn the
> smp_wmb() followed by write into a store release, which would get you
> a release sequence.

	smp_wmb()
	WRITE_ONCE(*y, 1);

does not a RELEASE make.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:25                 ` Peter Zijlstra
@ 2017-11-30 16:39                   ` Paul E. McKenney
  0 siblings, 0 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-30 16:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alan Stern, Daniel Lustig, Andrea Parri, Luc Maranget,
	Jade Alglave, Boqun Feng, Nicholas Piggin, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 05:25:01PM +0100, Peter Zijlstra wrote:
> On Thu, Nov 30, 2017 at 08:14:01AM -0800, Paul E. McKenney wrote:
> > > (Also, technically speaking, the litmus test doesn't have any release 
> > > operations, so no release sequence...)
> > 
> > True!  But if you translated it into C11, you would probably turn the
> > smp_wmb() followed by write into a store release, which would get you
> > a release sequence.
> 
> 	smp_wmb()
> 	WRITE_ONCE(*y, 1);
> 
> does not a RELEASE make.

Agreed, but it also does not C11 make.  There is no pure write barrier
in C11.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:14               ` Paul E. McKenney
  2017-11-30 16:25                 ` Peter Zijlstra
@ 2017-11-30 16:41                 ` Will Deacon
  2017-11-30 16:54                   ` Paul E. McKenney
  1 sibling, 1 reply; 30+ messages in thread
From: Will Deacon @ 2017-11-30 16:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Daniel Lustig, Peter Zijlstra, Andrea Parri,
	Luc Maranget, Jade Alglave, Boqun Feng, Nicholas Piggin,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 08:14:01AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 30, 2017 at 10:20:02AM -0500, Alan Stern wrote:
> > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > 
> > > On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > > > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> > > >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> > > >>
> > > >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> > > >>>
> > > >>>> While we're here, let me ask about another test which isn't directly
> > > >>>> about unlock/lock but which is still somewhat related to this
> > > >>>> discussion:
> > > >>>>
> > > >>>> "MP+wmb+xchg-acq" (or some such)
> > > >>>>
> > > >>>> {}
> > > >>>>
> > > >>>> P0(int *x, int *y)
> > > >>>> {
> > > >>>>         WRITE_ONCE(*x, 1);
> > > >>>>         smp_wmb();
> > > >>>>         WRITE_ONCE(*y, 1);
> > > >>>> }
> > > >>>>
> > > >>>> P1(int *x, int *y)
> > > >>>> {
> > > >>>>         r1 = atomic_xchg_relaxed(y, 2);
> > > >>>>         r2 = smp_load_acquire(y);
> > > >>>>         r3 = READ_ONCE(*x);
> > > >>>> }
> > > >>>>
> > > >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > >>>>
> > > >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > >>>> and hence would forbid this outcome.
> > > >>>
> > > >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> > > >>> mean _release is just daft.
> > > >>
> > > >> The C11 memory model specifically allows atomic operations to be 
> > > >> interspersed within a release sequence.  But it doesn't say why.
> > > > 
> > > > The use case put forward within the committee is for atomic quantities
> > > > with mode bits.  The most frequent has the atomic quantity having
> > > > lock-like properties, in which case you don't want to lose the ordering
> > > > effects of the lock handoff just because a mode bit got set or cleared.
> > > > Some claim to actually use something like this, but details have not
> > > > been forthcoming.
> > > > 
> > > > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > > > the update could just as well be ordered.
> > > 
> > > Aren't reference counting implementations which use memory_order_relaxed
> > > for incrementing the count another important use case?  Specifically,
> > > the synchronization between a memory_order_release decrement and the
> > > eventual memory_order_acquire/consume free shouldn't be interrupted by
> > > other (relaxed) increments and (release-only) decrements that happen in
> > > between.  At least that's my understanding of this use case.  I wasn't
> > > there when the C/C++ committee decided this.
> > > 
> > > > That said, Daniel, the C++ memory model really does require that the
> > > > above litmus test be forbidden, my denigration of it notwithstanding.
> > > 
> > > Yes I agree, that's why I'm curious what the Linux memory model has
> > > in mind here :)
> > 
> > Bear in mind that the litmus test above uses xchg, not increment or 
> > decrement.  This makes a difference as far as the LKMM is concerned, 
> > even if not for C/C++.
> 
> Finally remembering this discussion...  Yes, xchg is special.  ;-)
> 
> Will, are there plans to bring this sort of thing before the standards
> committee?

We discussed it, but rejected it mainly because of concerns that there could
be RmW operations that don't necessarily have an order-inducing dependency
in all scenarios. I think the case that was batted around was a saturating
add implemented using cmpxchg.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:41                 ` Will Deacon
@ 2017-11-30 16:54                   ` Paul E. McKenney
  2017-11-30 17:04                     ` Will Deacon
  2017-11-30 17:56                     ` Alan Stern
  0 siblings, 2 replies; 30+ messages in thread
From: Paul E. McKenney @ 2017-11-30 16:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alan Stern, Daniel Lustig, Peter Zijlstra, Andrea Parri,
	Luc Maranget, Jade Alglave, Boqun Feng, Nicholas Piggin,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 04:41:05PM +0000, Will Deacon wrote:
> On Thu, Nov 30, 2017 at 08:14:01AM -0800, Paul E. McKenney wrote:
> > On Thu, Nov 30, 2017 at 10:20:02AM -0500, Alan Stern wrote:
> > > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > > 
> > > > On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > > > > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> > > > >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> > > > >>
> > > > >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> > > > >>>
> > > > >>>> While we're here, let me ask about another test which isn't directly
> > > > >>>> about unlock/lock but which is still somewhat related to this
> > > > >>>> discussion:
> > > > >>>>
> > > > >>>> "MP+wmb+xchg-acq" (or some such)
> > > > >>>>
> > > > >>>> {}
> > > > >>>>
> > > > >>>> P0(int *x, int *y)
> > > > >>>> {
> > > > >>>>         WRITE_ONCE(*x, 1);
> > > > >>>>         smp_wmb();
> > > > >>>>         WRITE_ONCE(*y, 1);
> > > > >>>> }
> > > > >>>>
> > > > >>>> P1(int *x, int *y)
> > > > >>>> {
> > > > >>>>         r1 = atomic_xchg_relaxed(y, 2);
> > > > >>>>         r2 = smp_load_acquire(y);
> > > > >>>>         r3 = READ_ONCE(*x);
> > > > >>>> }
> > > > >>>>
> > > > >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > > >>>>
> > > > >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > > >>>> and hence would forbid this outcome.
> > > > >>>
> > > > >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> > > > >>> mean _release is just daft.
> > > > >>
> > > > >> The C11 memory model specifically allows atomic operations to be 
> > > > >> interspersed within a release sequence.  But it doesn't say why.
> > > > > 
> > > > > The use case put forward within the committee is for atomic quantities
> > > > > with mode bits.  The most frequent has the atomic quantity having
> > > > > lock-like properties, in which case you don't want to lose the ordering
> > > > > effects of the lock handoff just because a mode bit got set or cleared.
> > > > > Some claim to actually use something like this, but details have not
> > > > > been forthcoming.
> > > > > 
> > > > > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > > > > the update could just as well be ordered.
> > > > 
> > > > Aren't reference counting implementations which use memory_order_relaxed
> > > > for incrementing the count another important use case?  Specifically,
> > > > the synchronization between a memory_order_release decrement and the
> > > > eventual memory_order_acquire/consume free shouldn't be interrupted by
> > > > other (relaxed) increments and (release-only) decrements that happen in
> > > > between.  At least that's my understanding of this use case.  I wasn't
> > > > there when the C/C++ committee decided this.
> > > > 
> > > > > That said, Daniel, the C++ memory model really does require that the
> > > > > above litmus test be forbidden, my denigration of it notwithstanding.
> > > > 
> > > > Yes I agree, that's why I'm curious what the Linux memory model has
> > > > in mind here :)
> > > 
> > > Bear in mind that the litmus test above uses xchg, not increment or 
> > > decrement.  This makes a difference as far as the LKMM is concerned, 
> > > even if not for C/C++.
> > 
> > Finally remembering this discussion...  Yes, xchg is special.  ;-)
> > 
> > Will, are there plans to bring this sort of thing before the standards
> > committee?
> 
> We discussed it, but rejected it mainly because of concerns that there could
> be RmW operations that don't necessarily have an order-inducing dependency
> in all scenarios. I think the case that was batted around was a saturating
> add implemented using cmpxchg.

Ah, I do remember now, during the Toronto meeting, correct?

So should we consider making LKMM make xchg act in a manner similar to
the other atomics, or would you prefer that we keep the current special
behavior?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:54                   ` Paul E. McKenney
@ 2017-11-30 17:04                     ` Will Deacon
  2017-11-30 17:56                     ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Will Deacon @ 2017-11-30 17:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Alan Stern, Daniel Lustig, Peter Zijlstra, Andrea Parri,
	Luc Maranget, Jade Alglave, Boqun Feng, Nicholas Piggin,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, Nov 30, 2017 at 08:54:35AM -0800, Paul E. McKenney wrote:
> On Thu, Nov 30, 2017 at 04:41:05PM +0000, Will Deacon wrote:
> > On Thu, Nov 30, 2017 at 08:14:01AM -0800, Paul E. McKenney wrote:
> > > On Thu, Nov 30, 2017 at 10:20:02AM -0500, Alan Stern wrote:
> > > > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > > > 
> > > > > On 11/29/2017 12:42 PM, Paul E. McKenney wrote:
> > > > > > On Wed, Nov 29, 2017 at 02:53:06PM -0500, Alan Stern wrote:
> > > > > >> On Wed, 29 Nov 2017, Peter Zijlstra wrote:
> > > > > >>
> > > > > >>> On Wed, Nov 29, 2017 at 11:04:53AM -0800, Daniel Lustig wrote:
> > > > > >>>
> > > > > >>>> While we're here, let me ask about another test which isn't directly
> > > > > >>>> about unlock/lock but which is still somewhat related to this
> > > > > >>>> discussion:
> > > > > >>>>
> > > > > >>>> "MP+wmb+xchg-acq" (or some such)
> > > > > >>>>
> > > > > >>>> {}
> > > > > >>>>
> > > > > >>>> P0(int *x, int *y)
> > > > > >>>> {
> > > > > >>>>         WRITE_ONCE(*x, 1);
> > > > > >>>>         smp_wmb();
> > > > > >>>>         WRITE_ONCE(*y, 1);
> > > > > >>>> }
> > > > > >>>>
> > > > > >>>> P1(int *x, int *y)
> > > > > >>>> {
> > > > > >>>>         r1 = atomic_xchg_relaxed(y, 2);
> > > > > >>>>         r2 = smp_load_acquire(y);
> > > > > >>>>         r3 = READ_ONCE(*x);
> > > > > >>>> }
> > > > > >>>>
> > > > > >>>> exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > > > >>>>
> > > > > >>>> C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > > > >>>> and hence would forbid this outcome.
> > > > > >>>
> > > > > >>> That's just weird. Either its _relaxed, or its _release. Making _relaxed
> > > > > >>> mean _release is just daft.
> > > > > >>
> > > > > >> The C11 memory model specifically allows atomic operations to be 
> > > > > >> interspersed within a release sequence.  But it doesn't say why.
> > > > > > 
> > > > > > The use case put forward within the committee is for atomic quantities
> > > > > > with mode bits.  The most frequent has the atomic quantity having
> > > > > > lock-like properties, in which case you don't want to lose the ordering
> > > > > > effects of the lock handoff just because a mode bit got set or cleared.
> > > > > > Some claim to actually use something like this, but details have not
> > > > > > been forthcoming.
> > > > > > 
> > > > > > I confess to being a bit skeptical.  If the mode changes are infrequent,
> > > > > > the update could just as well be ordered.
> > > > > 
> > > > > Aren't reference counting implementations which use memory_order_relaxed
> > > > > for incrementing the count another important use case?  Specifically,
> > > > > the synchronization between a memory_order_release decrement and the
> > > > > eventual memory_order_acquire/consume free shouldn't be interrupted by
> > > > > other (relaxed) increments and (release-only) decrements that happen in
> > > > > between.  At least that's my understanding of this use case.  I wasn't
> > > > > there when the C/C++ committee decided this.
> > > > > 
> > > > > > That said, Daniel, the C++ memory model really does require that the
> > > > > > above litmus test be forbidden, my denigration of it notwithstanding.
> > > > > 
> > > > > Yes I agree, that's why I'm curious what the Linux memory model has
> > > > > in mind here :)
> > > > 
> > > > Bear in mind that the litmus test above uses xchg, not increment or 
> > > > decrement.  This makes a difference as far as the LKMM is concerned, 
> > > > even if not for C/C++.
> > > 
> > > Finally remembering this discussion...  Yes, xchg is special.  ;-)
> > > 
> > > Will, are there plans to bring this sort of thing before the standards
> > > committee?
> > 
> > We discussed it, but rejected it mainly because of concerns that there could
> > be RmW operations that don't necessarily have an order-inducing dependency
> > in all scenarios. I think the case that was batted around was a saturating
> > add implemented using cmpxchg.
> 
> Ah, I do remember now, during the Toronto meeting, correct?
> 
> So should we consider making LKMM make xchg act in a manner similar to
> the other atomics, or would you prefer that we keep the current special
> behavior?

It's certainly simpler to treat all of the atomics the same, and I've
bent arm64 into shape for C/C++ so xchg (SWP) does the right thing.

As discussed, it's only ordered feeding an acquire, not a
READ_ONCE/rcu_dereference.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 16:54                   ` Paul E. McKenney
  2017-11-30 17:04                     ` Will Deacon
@ 2017-11-30 17:56                     ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Alan Stern @ 2017-11-30 17:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Will Deacon, Daniel Lustig, Peter Zijlstra, Andrea Parri,
	Luc Maranget, Jade Alglave, Boqun Feng, Nicholas Piggin,
	David Howells, Palmer Dabbelt, Kernel development list

On Thu, 30 Nov 2017, Paul E. McKenney wrote:

> > > Will, are there plans to bring this sort of thing before the standards
> > > committee?
> > 
> > We discussed it, but rejected it mainly because of concerns that there could
> > be RmW operations that don't necessarily have an order-inducing dependency
> > in all scenarios. I think the case that was batted around was a saturating
> > add implemented using cmpxchg.
> 
> Ah, I do remember now, during the Toronto meeting, correct?
> 
> So should we consider making LKMM make xchg act in a manner similar to
> the other atomics, or would you prefer that we keep the current special
> behavior?

In fact, right now herd doesn't implement the dependencies.  So
"current special behavior" is ambiguous -- there's what herd currently
does, and there's what one would expect it to do.

(Incidentally, this should all be considered part of herd, rather than
part of the LKMM.  Although herd can simulate the memory model, they 
aren't the same thing.)

Also, as Will says, some atomic operations have more than one possible
implementation, with differing internal dependencies.

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-11-30 15:46         ` Alan Stern
@ 2017-12-01  2:46           ` Boqun Feng
  2017-12-01 15:32             ` Alan Stern
  0 siblings, 1 reply; 30+ messages in thread
From: Boqun Feng @ 2017-12-01  2:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: Daniel Lustig, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

[-- Attachment #1: Type: text/plain, Size: 5629 bytes --]

On Thu, Nov 30, 2017 at 10:46:22AM -0500, Alan Stern wrote:
> On Thu, 30 Nov 2017, Boqun Feng wrote:
> 
> > On Wed, Nov 29, 2017 at 02:44:37PM -0500, Alan Stern wrote:
> > > On Wed, 29 Nov 2017, Daniel Lustig wrote:
> > > 
> > > > While we're here, let me ask about another test which isn't directly
> > > > about unlock/lock but which is still somewhat related to this
> > > > discussion:
> > > > 
> > > > "MP+wmb+xchg-acq" (or some such)
> > > > 
> > > > {}
> > > > 
> > > > P0(int *x, int *y)
> > > > {
> > > >         WRITE_ONCE(*x, 1);
> > > >         smp_wmb();
> > > >         WRITE_ONCE(*y, 1);
> > > > }
> > > > 
> > > > P1(int *x, int *y)
> > > > {
> > > >         r1 = atomic_xchg_relaxed(y, 2);
> > > >         r2 = smp_load_acquire(y);
> > > >         r3 = READ_ONCE(*x);
> > > > }
> > > > 
> > > > exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0)
> > > > 
> > > > C/C++ would call the atomic_xchg_relaxed part of a release sequence
> > > > and hence would forbid this outcome.
> > > > 
> > > > x86 and Power would forbid this.  ARM forbids this via a special-case
> > > > rule in the memory model, ordering atomics with later load-acquires.
> > > > 
> > > > RISC-V, however, wouldn't forbid this by default using RCpc or RCsc
> > > > atomics for smp_load_acquire().  It's an "fri; rfi" type of pattern,
> > > > because xchg doesn't have an inherent internal data dependency.
> > > > 
> > > > If the Linux memory model is going to forbid this outcome, then
> > > > RISC-V would either need to use fences instead, or maybe we'd need to
> > > > add a special rule to our memory model similarly.  This is one detail
> > > > where RISC-V is still actively deciding what to do.
> > > > 
> > > > Have you all thought about this test before?  Any idea which way you
> > > > are leaning regarding the outcome above?
> > > 
> > > Good questions.  Currently the LKMM allows this, and I think it should
> > > because xchg doesn't have a dependency from its read to its write.
> > > 
> > > On the other hand, herd isn't careful enough in the way it implements 
> > > internal dependencies for RMW operations.  If we change 
> > > atomic_xchg_relaxed(y, 2) to atomic_inc(y) and remove r1 from the test:
> > > 
> > > C MP+wmb+inc-acq
> > > 
> > > {}
> > > 
> > > P0(int *x, int *y)
> > > {
> > >         WRITE_ONCE(*x, 1);
> > >         smp_wmb();
> > >         WRITE_ONCE(*y, 1);
> > > }
> > > 
> > > P1(int *x, int *y)
> > > {
> > >         atomic_inc(y);
> > >         r2 = smp_load_acquire(y);
> > >         r3 = READ_ONCE(*x);
> > > }
> > > 
> > > exists (1:r2=2 /\ 1:r3=0)
> > > 
> > > then the test _should_ be forbidden, but it isn't -- herd doesn't
> > > realize that all atomic RMW operations other than xchg must have a
> > > dependency (either data or control) between their internal read and
> > > write.
> > > 
> > > (Although the smp_load_acquire is allowed to execute before the write 
> > > part of the atomic_inc, it cannot execute before the read part.  I 
> > > think a similar argument applies even on ARM.)
> > > 
> > 
> > But in case of AMOs, which directly send the addition request to memory
> > controller, so there wouldn't be any read part or even write part of the
> > atomic_inc() executed by CPU. Would this be allowed then?
> 
> Firstly, sending the addition request to the memory controller _is_ a
> write operation.
> 
> Secondly, even though the CPU hardware might not execute a read 
> operation during an AMO, the LKMM and herd nevertheless represent the 
> atomic update as a specially-annotated read event followed by a write 
> event.
> 

Ah, right! From the point of view of the model, there are read events
and write events for the atomics.

> In an other-multicopy-atomic system, P0's write to y must become
> visible to P1 before P1 executes the smp_load_acquire, because the
> write was visible to the memory controller when the controller carried
> out the AMO, and the write becomes visible to the memory controller and
> to P1 at the same time (by other-multicopy-atomicity).  That's why I
> said the test would be forbidden on ARM.
> 

Agreed.

> But even on a non-other-multicopy-atomic system, there has to be some 
> synchronization between the memory controller and P1's CPU.  Otherwise, 
> how could the system guarantee that P1's smp_load_acquire would see the 
> post-increment value of y?  It seems reasonable to assume that this 
> synchronization would also cause P1 to see x=1.
> 

I agree with you the "reasonable" part ;-) So basically, memory
controller could only do the write of AMO until P0's second write
propagated to the memory controller(and because of the wmb(), P0's first
write must be already propagated to the memory controller, too), so it
makes sense when the write of AMO propagated from memory controller to
P1, P0's first write is also propagted to P1. IOW, the write of AMO on
memory controller acts at least like a release.

However, some part of myself is still a little paranoid, because to my
understanding, the point of AMO is to get atomic operations executing
as fast as possible, so maybe, AMO has some fast path for the memory
controller to forward a write to the CPU that issues the AMO, in that
way, it will become unreasonable ;-)

With that in mind, I think it's better if herd could provide the type
annotations of atomics for the read and write parts, and we handle it
inside the LKMM's cats and bells, rather than letting herd provide the
internal dependency by default.

Regards,
Boqun

> Alan Stern
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-12-01  2:46           ` Boqun Feng
@ 2017-12-01 15:32             ` Alan Stern
  2017-12-01 16:17               ` Daniel Lustig
  0 siblings, 1 reply; 30+ messages in thread
From: Alan Stern @ 2017-12-01 15:32 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Daniel Lustig, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Fri, 1 Dec 2017, Boqun Feng wrote:

> > > But in case of AMOs, which directly send the addition request to memory
> > > controller, so there wouldn't be any read part or even write part of the
> > > atomic_inc() executed by CPU. Would this be allowed then?
> > 
> > Firstly, sending the addition request to the memory controller _is_ a
> > write operation.
> > 
> > Secondly, even though the CPU hardware might not execute a read 
> > operation during an AMO, the LKMM and herd nevertheless represent the 
> > atomic update as a specially-annotated read event followed by a write 
> > event.
> > 
> 
> Ah, right! From the point of view of the model, there are read events
> and write events for the atomics.
> 
> > In an other-multicopy-atomic system, P0's write to y must become
> > visible to P1 before P1 executes the smp_load_acquire, because the
> > write was visible to the memory controller when the controller carried
> > out the AMO, and the write becomes visible to the memory controller and
> > to P1 at the same time (by other-multicopy-atomicity).  That's why I
> > said the test would be forbidden on ARM.
> > 
> 
> Agreed.
> 
> > But even on a non-other-multicopy-atomic system, there has to be some 
> > synchronization between the memory controller and P1's CPU.  Otherwise, 
> > how could the system guarantee that P1's smp_load_acquire would see the 
> > post-increment value of y?  It seems reasonable to assume that this 
> > synchronization would also cause P1 to see x=1.
> > 
> 
> I agree with you the "reasonable" part ;-) So basically, memory
> controller could only do the write of AMO until P0's second write
> propagated to the memory controller(and because of the wmb(), P0's first
> write must be already propagated to the memory controller, too), so it
> makes sense when the write of AMO propagated from memory controller to
> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> memory controller acts at least like a release.
> 
> However, some part of myself is still a little paranoid, because to my
> understanding, the point of AMO is to get atomic operations executing
> as fast as possible, so maybe, AMO has some fast path for the memory
> controller to forward a write to the CPU that issues the AMO, in that
> way, it will become unreasonable ;-)

It's true that a hardware design in the future might behave differently 
from current hardware.  If that ever happens, we will need to rethink 
the situation.  Maybe the designers will change their hardware to make 
it match the memory model.  Or maybe the memory model will change.

And it's certainly possible to write a litmus test which emulates this 
situation:

C MP+wmb+emulated-amo-acq

{}

P0(int *x, int *y)
{
	WRITE_ONCE(*x, 1);
	smp_wmb();
	WRITE_ONCE(*y, 1);
}

P1(int *x, int *y, int *u, int *v)
{
	WRITE_ONCE(*u, 1);
	r1 = READ_ONCE(*v);
	smp_rmb();
	r2 = smp_load_acquire(y);
	r3 = READ_ONCE(*x);
}

P2(int *y, int *u, int *v)
{
	r4 = READ_ONCE(*u);
	if (r4 != 0) {
		atomic_inc(y);
		smp_wmb();
		WRITE_ONCE(*v, 1);
	}
}

exists (1:r1=1 /\ 1:r2=2 /\ 1:r3=0 /\ 2:r4=1)

Here P1 tells P2 to perform the atomic increment by setting u to 1, and
P2 tells P1 that the increment is finished by setting v to 1.  This
test is allowed by the LKMM, because the wmb in P2 is not A-cumulative.
On the other hand, store-release is A-cumulative -- the test would be
forbidden if P2 did "smp_store_release(v, 1)" rather than "smp_wmb() ;
WRITE_ONCE(*v, 1)".

> With that in mind, I think it's better if herd could provide the type
> annotations of atomics for the read and write parts, and we handle it
> inside the LKMM's cats and bells, rather than letting herd provide the
> internal dependency by default.

herd already does provide this information via the rmw relation.

Alan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-12-01 15:32             ` Alan Stern
@ 2017-12-01 16:17               ` Daniel Lustig
  2017-12-01 16:24                 ` Will Deacon
  2017-12-01 17:18                 ` Alan Stern
  0 siblings, 2 replies; 30+ messages in thread
From: Daniel Lustig @ 2017-12-01 16:17 UTC (permalink / raw)
  To: Alan Stern, Boqun Feng
  Cc: Paul E. McKenney, Andrea Parri, Luc Maranget, Jade Alglave,
	Nicholas Piggin, Peter Zijlstra, Will Deacon, David Howells,
	Palmer Dabbelt, Kernel development list

On 12/1/2017 7:32 AM, Alan Stern wrote:
> On Fri, 1 Dec 2017, Boqun Feng wrote:
>>> But even on a non-other-multicopy-atomic system, there has to be some 
>>> synchronization between the memory controller and P1's CPU.  Otherwise, 
>>> how could the system guarantee that P1's smp_load_acquire would see the 
>>> post-increment value of y?  It seems reasonable to assume that this 
>>> synchronization would also cause P1 to see x=1.
>>>
>>
>> I agree with you the "reasonable" part ;-) So basically, memory
>> controller could only do the write of AMO until P0's second write
>> propagated to the memory controller(and because of the wmb(), P0's first
>> write must be already propagated to the memory controller, too), so it
>> makes sense when the write of AMO propagated from memory controller to
>> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
>> memory controller acts at least like a release.
>>
>> However, some part of myself is still a little paranoid, because to my
>> understanding, the point of AMO is to get atomic operations executing
>> as fast as possible, so maybe, AMO has some fast path for the memory
>> controller to forward a write to the CPU that issues the AMO, in that
>> way, it will become unreasonable ;-)
> 
> It's true that a hardware design in the future might behave differently 
> from current hardware.  If that ever happens, we will need to rethink 
> the situation.  Maybe the designers will change their hardware to make 
> it match the memory model.  Or maybe the memory model will change.

Do you mean all of the above in the context of increment etc, as opposed
to swap?  ARM hardware in the wild is already documented as forwarding
SWP values to subsequent loads early, even past control dependencies.
Paul sent this link earlier in the thread.

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html

The reason swap is special is because its store value is available to be
forwarded even before the AMO goes out to the memory controller or
wherever else it gets its load value from.

Also, the case I described is an acquire rather than a control
dependency, but it's similar enough that it doesn't seem completely
unrealistic to think hardware might try to do this.

Dan

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-12-01 16:17               ` Daniel Lustig
@ 2017-12-01 16:24                 ` Will Deacon
  2017-12-01 17:18                 ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Will Deacon @ 2017-12-01 16:24 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Alan Stern, Boqun Feng, Paul E. McKenney, Andrea Parri,
	Luc Maranget, Jade Alglave, Nicholas Piggin, Peter Zijlstra,
	David Howells, Palmer Dabbelt, Kernel development list

On Fri, Dec 01, 2017 at 08:17:04AM -0800, Daniel Lustig wrote:
> On 12/1/2017 7:32 AM, Alan Stern wrote:
> > On Fri, 1 Dec 2017, Boqun Feng wrote:
> >>> But even on a non-other-multicopy-atomic system, there has to be some 
> >>> synchronization between the memory controller and P1's CPU.  Otherwise, 
> >>> how could the system guarantee that P1's smp_load_acquire would see the 
> >>> post-increment value of y?  It seems reasonable to assume that this 
> >>> synchronization would also cause P1 to see x=1.
> >>>
> >>
> >> I agree with you the "reasonable" part ;-) So basically, memory
> >> controller could only do the write of AMO until P0's second write
> >> propagated to the memory controller(and because of the wmb(), P0's first
> >> write must be already propagated to the memory controller, too), so it
> >> makes sense when the write of AMO propagated from memory controller to
> >> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> >> memory controller acts at least like a release.
> >>
> >> However, some part of myself is still a little paranoid, because to my
> >> understanding, the point of AMO is to get atomic operations executing
> >> as fast as possible, so maybe, AMO has some fast path for the memory
> >> controller to forward a write to the CPU that issues the AMO, in that
> >> way, it will become unreasonable ;-)
> > 
> > It's true that a hardware design in the future might behave differently 
> > from current hardware.  If that ever happens, we will need to rethink 
> > the situation.  Maybe the designers will change their hardware to make 
> > it match the memory model.  Or maybe the memory model will change.
> 
> Do you mean all of the above in the context of increment etc, as opposed
> to swap?  ARM hardware in the wild is already documented as forwarding
> SWP values to subsequent loads early, even past control dependencies.
> Paul sent this link earlier in the thread.
> 
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html
> 
> The reason swap is special is because its store value is available to be
> forwarded even before the AMO goes out to the memory controller or
> wherever else it gets its load value from.
> 
> Also, the case I described is an acquire rather than a control
> dependency, but it's similar enough that it doesn't seem completely
> unrealistic to think hardware might try to do this.

To be clear: we don't forward from a SWP to a load with Acquire semantics,
so the distinction is an important one.

Will

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: Unlock-lock questions and the Linux Kernel Memory Model
  2017-12-01 16:17               ` Daniel Lustig
  2017-12-01 16:24                 ` Will Deacon
@ 2017-12-01 17:18                 ` Alan Stern
  1 sibling, 0 replies; 30+ messages in thread
From: Alan Stern @ 2017-12-01 17:18 UTC (permalink / raw)
  To: Daniel Lustig
  Cc: Boqun Feng, Paul E. McKenney, Andrea Parri, Luc Maranget,
	Jade Alglave, Nicholas Piggin, Peter Zijlstra, Will Deacon,
	David Howells, Palmer Dabbelt, Kernel development list

On Fri, 1 Dec 2017, Daniel Lustig wrote:

> On 12/1/2017 7:32 AM, Alan Stern wrote:
> > On Fri, 1 Dec 2017, Boqun Feng wrote:
> >>> But even on a non-other-multicopy-atomic system, there has to be some 
> >>> synchronization between the memory controller and P1's CPU.  Otherwise, 
> >>> how could the system guarantee that P1's smp_load_acquire would see the 
> >>> post-increment value of y?  It seems reasonable to assume that this 
> >>> synchronization would also cause P1 to see x=1.
> >>>
> >>
> >> I agree with you the "reasonable" part ;-) So basically, memory
> >> controller could only do the write of AMO until P0's second write
> >> propagated to the memory controller(and because of the wmb(), P0's first
> >> write must be already propagated to the memory controller, too), so it
> >> makes sense when the write of AMO propagated from memory controller to
> >> P1, P0's first write is also propagted to P1. IOW, the write of AMO on
> >> memory controller acts at least like a release.
> >>
> >> However, some part of myself is still a little paranoid, because to my
> >> understanding, the point of AMO is to get atomic operations executing
> >> as fast as possible, so maybe, AMO has some fast path for the memory
> >> controller to forward a write to the CPU that issues the AMO, in that
> >> way, it will become unreasonable ;-)
> > 
> > It's true that a hardware design in the future might behave differently 
> > from current hardware.  If that ever happens, we will need to rethink 
> > the situation.  Maybe the designers will change their hardware to make 
> > it match the memory model.  Or maybe the memory model will change.
> 
> Do you mean all of the above in the context of increment etc, as opposed
> to swap?  ARM hardware in the wild is already documented as forwarding
> SWP values to subsequent loads early, even past control dependencies.
> Paul sent this link earlier in the thread.
> 
> http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0735r0.html
> 
> The reason swap is special is because its store value is available to be
> forwarded even before the AMO goes out to the memory controller or
> wherever else it gets its load value from.

I believe the current intention for herd is as follows:

	xchg() and similar RMW operations do not generate an internal
	dependency;

	cmpxchg() and similar RMW operations generate an internal 
	control dependency;

	atomic_add() and similar RMW operations generate an internal 
	data dependency.

If herd adds support for saturating operations, they will generate at 
least a data dependency and maybe also a control dependency.

Alan

> Also, the case I described is an acquire rather than a control
> dependency, but it's similar enough that it doesn't seem completely
> unrealistic to think hardware might try to do this.
> 
> Dan

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-12-01 17:18 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <4118cdbe-c396-08b9-a3e3-a0a6491b82fa@nvidia.com>
2017-11-27 21:16 ` Unlock-lock questions and the Linux Kernel Memory Model Alan Stern
2017-11-27 23:28   ` Daniel Lustig
2017-11-28  9:44     ` Peter Zijlstra
2017-11-28  9:58   ` Peter Zijlstra
2017-11-29 19:04   ` Daniel Lustig
2017-11-29 19:33     ` Paul E. McKenney
2017-11-29 19:44     ` Alan Stern
2017-11-30  8:55       ` Boqun Feng
2017-11-30  9:15         ` Peter Zijlstra
2017-11-30 15:46         ` Alan Stern
2017-12-01  2:46           ` Boqun Feng
2017-12-01 15:32             ` Alan Stern
2017-12-01 16:17               ` Daniel Lustig
2017-12-01 16:24                 ` Will Deacon
2017-12-01 17:18                 ` Alan Stern
2017-11-29 19:46     ` Peter Zijlstra
2017-11-29 19:53       ` Alan Stern
2017-11-29 20:42         ` Paul E. McKenney
2017-11-29 22:18           ` Daniel Lustig
2017-11-29 22:59             ` Paul E. McKenney
2017-11-30 15:20             ` Alan Stern
2017-11-30 16:14               ` Paul E. McKenney
2017-11-30 16:25                 ` Peter Zijlstra
2017-11-30 16:39                   ` Paul E. McKenney
2017-11-30 16:41                 ` Will Deacon
2017-11-30 16:54                   ` Paul E. McKenney
2017-11-30 17:04                     ` Will Deacon
2017-11-30 17:56                     ` Alan Stern
2017-11-30 10:02       ` Will Deacon
2017-11-29 19:58     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).