linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* perf events ring buffer memory barrier on powerpc
@ 2013-10-22 23:54 Michael Neuling
  2013-10-23  7:39 ` Victor Kaplansky
  2013-10-23 14:19 ` Frederic Weisbecker
  0 siblings, 2 replies; 120+ messages in thread
From: Michael Neuling @ 2013-10-22 23:54 UTC (permalink / raw)
  To: Frederic Weisbecker, benh, anton, linux-kernel, Linux PPC dev,
	Victor Kaplansky, Mathieu Desnoyers, michael

Frederic,

In the perf ring buffer code we have this in perf_output_get_handle():

	if (!local_dec_and_test(&rb->nest))
		goto out;

	/*
	 * Publish the known good head. Rely on the full barrier implied
	 * by atomic_dec_and_test() order the rb->head read and this
	 * write.
	 */
	rb->user_page->data_head = head;

The comment says atomic_dec_and_test() but the code is
local_dec_and_test().

On powerpc, local_dec_and_test() doesn't have a memory barrier but
atomic_dec_and_test() does.  Is the comment wrong, or is
local_dec_and_test() suppose to imply a memory barrier too and we have
it wrongly implemented in powerpc?

My guess is that local_dec_and_test() is correct but we to add an
explicit memory barrier like below:

(Kudos to Victor Kaplansky for finding this)

Mikey

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..95768c6 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,10 @@ again:
 		goto out;
 
 	/*
-	 * Publish the known good head. Rely on the full barrier implied
-	 * by atomic_dec_and_test() order the rb->head read and this
-	 * write.
+	 * Publish the known good head. We need a memory barrier to order the
+	 * order the rb->head read and this write.
 	 */
+	smp_mb ();
 	rb->user_page->data_head = head;
 
 	/*

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-22 23:54 perf events ring buffer memory barrier on powerpc Michael Neuling
@ 2013-10-23  7:39 ` Victor Kaplansky
  2013-10-23 14:19 ` Frederic Weisbecker
  1 sibling, 0 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-23  7:39 UTC (permalink / raw)
  To: Michael Neuling
  Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev,
	Mathieu Desnoyers, michael

See below.

Michael Neuling <mikey@neuling.org> wrote on 10/23/2013 02:54:54 AM:

>
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..95768c6 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,10 @@ again:
>        goto out;
>
>     /*
> -    * Publish the known good head. Rely on the full barrier implied
> -    * by atomic_dec_and_test() order the rb->head read and this
> -    * write.
> +    * Publish the known good head. We need a memory barrier to order the
> +    * order the rb->head read and this write.
>      */
> +   smp_mb ();
>     rb->user_page->data_head = head;
>
>     /*

1. As far as I understand, smp_mb() is superfluous in this case, smp_wmb()
should be enough.
   (same for the space between the name of function and open
parenthesis :-) )

2. Again, as far as I understand from ./Documentation/atomic_ops.txt, it is
mistake in architecture independent
   code to rely on memory barriers in atomic operations, all the more so in
"local" operations.

3. The solution above is sub-optimal on architectures where memory barrier
is part of "local", since we are going to execute
   two consecutive barriers. So, maybe, it would be better to use
smp_mb__after_atomic_dec().

4. I'm not sure, but I think there is another, unrelated potential problem
in function perf_output_put_handle()
   - the write to "data_head" -

kernel/events/ring_buffer.c:

 77         /*
 78          * Publish the known good head. Rely on the full barrier
implied
 79          * by atomic_dec_and_test() order the rb->head read and this
 80          * write.
 81          */
 82         rb->user_page->data_head = head;

As data_head is 64-bit wide, the update should be done by an atomic64_set
().

Regards,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-22 23:54 perf events ring buffer memory barrier on powerpc Michael Neuling
  2013-10-23  7:39 ` Victor Kaplansky
@ 2013-10-23 14:19 ` Frederic Weisbecker
  2013-10-23 14:25   ` Frederic Weisbecker
  2013-10-25 17:37   ` Peter Zijlstra
  1 sibling, 2 replies; 120+ messages in thread
From: Frederic Weisbecker @ 2013-10-23 14:19 UTC (permalink / raw)
  To: Michael Neuling, Peter Zijlstra
  Cc: benh, anton, linux-kernel, Linux PPC dev, Victor Kaplansky,
	Mathieu Desnoyers, michael

On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
> Frederic,
> 
> In the perf ring buffer code we have this in perf_output_get_handle():
> 
> 	if (!local_dec_and_test(&rb->nest))
> 		goto out;
> 
> 	/*
> 	 * Publish the known good head. Rely on the full barrier implied
> 	 * by atomic_dec_and_test() order the rb->head read and this
> 	 * write.
> 	 */
> 	rb->user_page->data_head = head;
> 
> The comment says atomic_dec_and_test() but the code is
> local_dec_and_test().
> 
> On powerpc, local_dec_and_test() doesn't have a memory barrier but
> atomic_dec_and_test() does.  Is the comment wrong, or is
> local_dec_and_test() suppose to imply a memory barrier too and we have
> it wrongly implemented in powerpc?
> 
> My guess is that local_dec_and_test() is correct but we to add an
> explicit memory barrier like below:
> 
> (Kudos to Victor Kaplansky for finding this)
> 
> Mikey
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144..95768c6 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,10 @@ again:
>  		goto out;
>  
>  	/*
> -	 * Publish the known good head. Rely on the full barrier implied
> -	 * by atomic_dec_and_test() order the rb->head read and this
> -	 * write.
> +	 * Publish the known good head. We need a memory barrier to order the
> +	 * order the rb->head read and this write.
>  	 */
> +	smp_mb ();
>  	rb->user_page->data_head = head;
>  
>  	/*


I'm adding Peter in Cc since he wrote that code.
I agree that local_dec_and_test() doesn't need to imply an smp barrier.
All it has to provide as a guarantee is the atomicity against local concurrent
operations (interrupts, preemption, ...).

Now I'm a bit confused about this barrier.

I think we want this ordering:

    Kernel                             User

   READ rb->user_page->data_tail       READ rb->user_page->data_head
   smp_mb()                            smp_mb()
   WRITE rb data                       READ rb  data
   smp_mb()                            smp_mb()
   rb->user_page->data_head            WRITE rb->user_page->data_tail

So yeah we want a berrier between the data published and the user data_head.
But this ordering concerns wider layout than just rb->head and rb->user_page->data_head

And BTW I can see an smp_rmb() after we read rb->user_page->data_tail. This is probably the
first kernel barrier in my above example. (not sure if rmb() alone is enough though).

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-23 14:19 ` Frederic Weisbecker
@ 2013-10-23 14:25   ` Frederic Weisbecker
  2013-10-25 17:37   ` Peter Zijlstra
  1 sibling, 0 replies; 120+ messages in thread
From: Frederic Weisbecker @ 2013-10-23 14:25 UTC (permalink / raw)
  To: Michael Neuling, Peter Zijlstra
  Cc: Benjamin Herrenschmidt, Anton Blanchard, LKML, Linux PPC dev,
	Victor Kaplansky, Mathieu Desnoyers, michael

2013/10/23 Frederic Weisbecker <fweisbec@gmail.com>:
> On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
>> Frederic,
>>
>> In the perf ring buffer code we have this in perf_output_get_handle():
>>
>>       if (!local_dec_and_test(&rb->nest))
>>               goto out;
>>
>>       /*
>>        * Publish the known good head. Rely on the full barrier implied
>>        * by atomic_dec_and_test() order the rb->head read and this
>>        * write.
>>        */
>>       rb->user_page->data_head = head;
>>
>> The comment says atomic_dec_and_test() but the code is
>> local_dec_and_test().
>>
>> On powerpc, local_dec_and_test() doesn't have a memory barrier but
>> atomic_dec_and_test() does.  Is the comment wrong, or is
>> local_dec_and_test() suppose to imply a memory barrier too and we have
>> it wrongly implemented in powerpc?
>>
>> My guess is that local_dec_and_test() is correct but we to add an
>> explicit memory barrier like below:
>>
>> (Kudos to Victor Kaplansky for finding this)
>>
>> Mikey
>>
>> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
>> index cd55144..95768c6 100644
>> --- a/kernel/events/ring_buffer.c
>> +++ b/kernel/events/ring_buffer.c
>> @@ -87,10 +87,10 @@ again:
>>               goto out;
>>
>>       /*
>> -      * Publish the known good head. Rely on the full barrier implied
>> -      * by atomic_dec_and_test() order the rb->head read and this
>> -      * write.
>> +      * Publish the known good head. We need a memory barrier to order the
>> +      * order the rb->head read and this write.
>>        */
>> +     smp_mb ();
>>       rb->user_page->data_head = head;
>>
>>       /*
>
>
> I'm adding Peter in Cc since he wrote that code.
> I agree that local_dec_and_test() doesn't need to imply an smp barrier.
> All it has to provide as a guarantee is the atomicity against local concurrent
> operations (interrupts, preemption, ...).
>
> Now I'm a bit confused about this barrier.
>
> I think we want this ordering:
>
>     Kernel                             User
>
>    READ rb->user_page->data_tail       READ rb->user_page->data_head
>    smp_mb()                            smp_mb()
>    WRITE rb data                       READ rb  data
>    smp_mb()                            smp_mb()
>    rb->user_page->data_head            WRITE rb->user_page->data_tail
      ^^ I meant a write above for data_head.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-23 14:19 ` Frederic Weisbecker
  2013-10-23 14:25   ` Frederic Weisbecker
@ 2013-10-25 17:37   ` Peter Zijlstra
  2013-10-25 20:31     ` Michael Neuling
                       ` (3 more replies)
  1 sibling, 4 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-25 17:37 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Michael Neuling, benh, anton, linux-kernel, Linux PPC dev,
	Victor Kaplansky, Mathieu Desnoyers, michael

On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote:
> > Frederic,
> > 
> > The comment says atomic_dec_and_test() but the code is
> > local_dec_and_test().
> > 
> > On powerpc, local_dec_and_test() doesn't have a memory barrier but
> > atomic_dec_and_test() does.  Is the comment wrong, or is
> > local_dec_and_test() suppose to imply a memory barrier too and we have
> > it wrongly implemented in powerpc?

My bad; I converted from atomic to local without actually thinking it
seems. Your implementation of the local primitives is fine.

> > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> > index cd55144..95768c6 100644
> > --- a/kernel/events/ring_buffer.c
> > +++ b/kernel/events/ring_buffer.c
> > @@ -87,10 +87,10 @@ again:
> >  		goto out;
> >  
> >  	/*
> > -	 * Publish the known good head. Rely on the full barrier implied
> > -	 * by atomic_dec_and_test() order the rb->head read and this
> > -	 * write.
> > +	 * Publish the known good head. We need a memory barrier to order the
> > +	 * order the rb->head read and this write.
> >  	 */
> > +	smp_mb ();
> >  	rb->user_page->data_head = head;
> >  
> >  	/*

Right; so that would indeed be what the comment suggests it should be.
However I think the comment is now actively wrong too :-)

Since on the kernel side the buffer is strictly per-cpu, we don't need
memory barriers there.

> I think we want this ordering:
> 
>     Kernel                             User
> 
>    READ rb->user_page->data_tail       READ rb->user_page->data_head
>    smp_mb()                            smp_mb()
>    WRITE rb data                       READ rb  data
>    smp_mb()                            smp_mb()
>    rb->user_page->data_head            WRITE rb->user_page->data_tail
> 

I would argue for:

  READ ->data_tail			READ ->data_head
  smp_rmb()	(A)			smp_rmb()	(C)
  WRITE $data				READ $data
  smp_wmb()	(B)			smp_mb()	(D)
  STORE ->data_head			WRITE ->data_tail

Where A pairs with D, and B pairs with C.

I don't think A needs to be a full barrier because we won't in fact
write data until we see the store from userspace. So we simply don't
issue the data WRITE until we observe it.

OTOH, D needs to be a full barrier since it separates the data READ from
the tail WRITE.

For B a WMB is sufficient since it separates two WRITEs, and for C an
RMB is sufficient since it separates two READs.

---
 kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144270b5..c91274ef4e23 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 		goto out;
 
 	/*
-	 * Publish the known good head. Rely on the full barrier implied
-	 * by atomic_dec_and_test() order the rb->head read and this
-	 * write.
+	 * Since the mmap() consumer (userspace) can run on a different CPU:
+	 *
+	 *   kernel				user
+	 *
+	 *   READ ->data_tail			READ ->data_head
+	 *   smp_rmb()	(A)			smp_rmb()	(C)
+	 *   WRITE $data			READ $data
+	 *   smp_wmb()	(B)			smp_mb()	(D)
+	 *   STORE ->data_head			WRITE ->data_tail
+	 * 
+	 * Where A pairs with D, and B pairs with C.
+	 * 
+	 * I don't think A needs to be a full barrier because we won't in fact
+	 * write data until we see the store from userspace. So we simply don't
+	 * issue the data WRITE until we observe it.
+	 * 
+	 * OTOH, D needs to be a full barrier since it separates the data READ
+	 * from the tail WRITE.
+	 * 
+	 * For B a WMB is sufficient since it separates two WRITEs, and for C
+	 * an RMB is sufficient since it separates two READs.
+	 *
+	 * See perf_output_begin().
 	 */
+	smp_wmb();
 	rb->user_page->data_head = head;
 
 	/*
@@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle,
 		 * Userspace could choose to issue a mb() before updating the
 		 * tail pointer. So that all reads will be completed before the
 		 * write is issued.
+		 *
+		 * See perf_output_put_handle().
 		 */
 		tail = ACCESS_ONCE(rb->user_page->data_tail);
 		smp_rmb();

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-25 17:37   ` Peter Zijlstra
@ 2013-10-25 20:31     ` Michael Neuling
  2013-10-27  9:00     ` Victor Kaplansky
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 120+ messages in thread
From: Michael Neuling @ 2013-10-25 20:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, benh, anton, linux-kernel, Linux PPC dev,
	Victor Kaplansky, Mathieu Desnoyers, michael

> I would argue for:
> 
>   READ ->data_tail			READ ->data_head
>   smp_rmb()	(A)			smp_rmb()	(C)
>   WRITE $data				READ $data
>   smp_wmb()	(B)			smp_mb()	(D)
>   STORE ->data_head			WRITE ->data_tail
> 
> Where A pairs with D, and B pairs with C.
> 
> I don't think A needs to be a full barrier because we won't in fact
> write data until we see the store from userspace. So we simply don't
> issue the data WRITE until we observe it.
> 
> OTOH, D needs to be a full barrier since it separates the data READ from
> the tail WRITE.
> 
> For B a WMB is sufficient since it separates two WRITEs, and for C an
> RMB is sufficient since it separates two READs.

FWIW the testing Victor did confirms WMB is good enough on powerpc.

Thanks,
Mikey

> 
> ---
>  kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++---
>  1 file changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
> index cd55144270b5..c91274ef4e23 100644
> --- a/kernel/events/ring_buffer.c
> +++ b/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
>  		goto out;
>  
>  	/*
> -	 * Publish the known good head. Rely on the full barrier implied
> -	 * by atomic_dec_and_test() order the rb->head read and this
> -	 * write.
> +	 * Since the mmap() consumer (userspace) can run on a different CPU:
> +	 *
> +	 *   kernel				user
> +	 *
> +	 *   READ ->data_tail			READ ->data_head
> +	 *   smp_rmb()	(A)			smp_rmb()	(C)
> +	 *   WRITE $data			READ $data
> +	 *   smp_wmb()	(B)			smp_mb()	(D)
> +	 *   STORE ->data_head			WRITE ->data_tail
> +	 * 
> +	 * Where A pairs with D, and B pairs with C.
> +	 * 
> +	 * I don't think A needs to be a full barrier because we won't in fact
> +	 * write data until we see the store from userspace. So we simply don't
> +	 * issue the data WRITE until we observe it.
> +	 * 
> +	 * OTOH, D needs to be a full barrier since it separates the data READ
> +	 * from the tail WRITE.
> +	 * 
> +	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> +	 * an RMB is sufficient since it separates two READs.
> +	 *
> +	 * See perf_output_begin().
>  	 */
> +	smp_wmb();
>  	rb->user_page->data_head = head;
>  
>  	/*
> @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle,
>  		 * Userspace could choose to issue a mb() before updating the
>  		 * tail pointer. So that all reads will be completed before the
>  		 * write is issued.
> +		 *
> +		 * See perf_output_put_handle().
>  		 */
>  		tail = ACCESS_ONCE(rb->user_page->data_tail);
>  		smp_rmb();
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-25 17:37   ` Peter Zijlstra
  2013-10-25 20:31     ` Michael Neuling
@ 2013-10-27  9:00     ` Victor Kaplansky
  2013-10-28  9:22       ` Peter Zijlstra
  2013-10-28 10:02     ` Frederic Weisbecker
  2013-10-29 14:06     ` [tip:perf/urgent] perf: Fix perf ring buffer memory ordering tip-bot for Peter Zijlstra
  3 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-27  9:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev,
	Mathieu Desnoyers, michael, Michael Neuling

Peter Zijlstra <peterz@infradead.org> wrote on 10/25/2013 07:37:49 PM:

> I would argue for:
>
>   READ ->data_tail         READ ->data_head
>     smp_rmb()   (A)          smp_rmb()   (C)
>   WRITE $data              READ $data
>     smp_wmb()   (B)          smp_mb()   (D)
>   STORE ->data_head        WRITE ->data_tail
>
> Where A pairs with D, and B pairs with C.

1. I agree. My only concern is that architectures which do use atomic
operations
with memory barriers, will issue two consecutive barriers now, which is
sub-optimal.

2. I think the comment in "include/linux/perf_event.h" describing
"data_head" and
"data_tail" for user space need an update as well. Current version -

        /*
         * Control data for the mmap() data buffer.
         *
         * User-space reading the @data_head value should issue an rmb(),
on
         * SMP capable platforms, after reading this value -- see
         * perf_event_wakeup().
         *
         * When the mapping is PROT_WRITE the @data_tail value should be
         * written by userspace to reflect the last read data. In this case
         * the kernel will not over-write unread data.
         */
        __u64   data_head;              /* head in the data section */
        __u64   data_tail;              /* user-space written tail */

- say nothing about the need of memory barrier before "data_tail" write.

-- Victor



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-27  9:00     ` Victor Kaplansky
@ 2013-10-28  9:22       ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-28  9:22 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev,
	Mathieu Desnoyers, michael, Michael Neuling

On Sun, Oct 27, 2013 at 11:00:33AM +0200, Victor Kaplansky wrote:
> Peter Zijlstra <peterz@infradead.org> wrote on 10/25/2013 07:37:49 PM:
> 
> > I would argue for:
> >
> >   READ ->data_tail         READ ->data_head
> >     smp_rmb()   (A)          smp_rmb()   (C)
> >   WRITE $data              READ $data
> >     smp_wmb()   (B)          smp_mb()   (D)
> >   STORE ->data_head        WRITE ->data_tail
> >
> > Where A pairs with D, and B pairs with C.
> 
> 1. I agree. My only concern is that architectures which do use atomic
> operations
> with memory barriers, will issue two consecutive barriers now, which is
> sub-optimal.

Yeah, although that would be fairly easy to optimize by the CPUs itself;
not sure they actually do this though.

But we don't really have much choice aside of introducing things like:

smp_wmb__after_local_$op; and I'm fairly sure people won't like adding a
ton of conditional barriers like that either.


> 2. I think the comment in "include/linux/perf_event.h" describing
> "data_head" and
> "data_tail" for user space need an update as well. Current version -

Oh, indeed. Thanks; I'll update that too!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-25 17:37   ` Peter Zijlstra
  2013-10-25 20:31     ` Michael Neuling
  2013-10-27  9:00     ` Victor Kaplansky
@ 2013-10-28 10:02     ` Frederic Weisbecker
  2013-10-28 12:38       ` Victor Kaplansky
  2013-10-29 14:06     ` [tip:perf/urgent] perf: Fix perf ring buffer memory ordering tip-bot for Peter Zijlstra
  3 siblings, 1 reply; 120+ messages in thread
From: Frederic Weisbecker @ 2013-10-28 10:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Michael Neuling, Benjamin Herrenschmidt, Anton Blanchard, LKML,
	Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers,
	Michael Ellerman

2013/10/25 Peter Zijlstra <peterz@infradead.org>:
> On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> I would argue for:
>
>   READ ->data_tail                      READ ->data_head
>   smp_rmb()     (A)                     smp_rmb()       (C)
>   WRITE $data                           READ $data
>   smp_wmb()     (B)                     smp_mb()        (D)
>   STORE ->data_head                     WRITE ->data_tail
>
> Where A pairs with D, and B pairs with C.
>
> I don't think A needs to be a full barrier because we won't in fact
> write data until we see the store from userspace. So we simply don't
> issue the data WRITE until we observe it.
>
> OTOH, D needs to be a full barrier since it separates the data READ from
> the tail WRITE.
>
> For B a WMB is sufficient since it separates two WRITEs, and for C an
> RMB is sufficient since it separates two READs.

Hmm, I need to defer on you for that, I'm not yet comfortable with
picking specific barrier flavours when both write and read are
involved in a same side :)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 10:02     ` Frederic Weisbecker
@ 2013-10-28 12:38       ` Victor Kaplansky
  2013-10-28 13:26         ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-28 12:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Peter Zijlstra

> From: Frederic Weisbecker <fweisbec@gmail.com>
>
> 2013/10/25 Peter Zijlstra <peterz@infradead.org>:
> > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > I would argue for
> >
> >   READ ->data_tail                      READ ->data_head
> >   smp_rmb()     (A)                     smp_rmb()       (C)
> >   WRITE $data                           READ $data
> >   smp_wmb()     (B)                     smp_mb()        (D)
> >   STORE ->data_head                     WRITE ->data_tail
> >
> > Where A pairs with D, and B pairs with C.
> >
> > I don't think A needs to be a full barrier because we won't in fact
> > write data until we see the store from userspace. So we simply don't
> > issue the data WRITE until we observe it.
> >
> > OTOH, D needs to be a full barrier since it separates the data READ
from
> > the tail WRITE.
> >
> > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > RMB is sufficient since it separates two READs.
>
> Hmm, I need to defer on you for that, I'm not yet comfortable with
> picking specific barrier flavours when both write and read are
> involved in a same side :)

I think you have a point :) IMO, memory barrier (A) is superfluous.
At producer side we need to ensure that "WRITE $data" is not committed to
memory
before "READ ->data_tail" had seen a new value and if the old one indicated
that
there is no enough space for a new entry. All this is already guaranteed by
control flow dependancy on single CPU - writes will not be committed to the
memory
if read value of "data_tail" doesn't specify enough free space in the ring
buffer.

Likewise, on consumer side, we can make use of natural data dependency and
memory ordering guarantee for single CPU and try to replace "smp_mb" by
a more light-weight "smp_rmb":

READ ->data_tail                      READ ->data_head
// ...                                smp_rmb()       (C)
WRITE $data                           READ $data
smp_wmb()     (B)                     smp_rmb()       (D)
						  READ $header_size
STORE ->data_head                     WRITE ->data_tail = $old_data_tail +
$header_size

We ensure that all $data is read before "data_tail" is written by doing
"READ $header_size" after
all other data is read and we rely on natural data dependancy between
"data_tail" write
and "header_size" read.

-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 12:38       ` Victor Kaplansky
@ 2013-10-28 13:26         ` Peter Zijlstra
  2013-10-28 16:34           ` Paul E. McKenney
  2013-10-28 19:09           ` Oleg Nesterov
  0 siblings, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-28 13:26 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Frederic Weisbecker, Anton Blanchard, Benjamin Herrenschmidt,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Paul McKenney, Oleg Nesterov

On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote:
> > 2013/10/25 Peter Zijlstra <peterz@infradead.org>:
> > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > > I would argue for
> > >
> > >   READ ->data_tail                      READ ->data_head
> > >   smp_rmb()     (A)                     smp_rmb()       (C)
> > >   WRITE $data                           READ $data
> > >   smp_wmb()     (B)                     smp_mb()        (D)
> > >   STORE ->data_head                     WRITE ->data_tail
> > >
> > > Where A pairs with D, and B pairs with C.
> > >
> > > I don't think A needs to be a full barrier because we won't in fact
> > > write data until we see the store from userspace. So we simply don't
> > > issue the data WRITE until we observe it.
> > >
> > > OTOH, D needs to be a full barrier since it separates the data READ from
> > > the tail WRITE.
> > >
> > > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > > RMB is sufficient since it separates two READs.

<snip>

> I think you have a point :) IMO, memory barrier (A) is superfluous.
> At producer side we need to ensure that "WRITE $data" is not committed
> to memory before "READ ->data_tail" had seen a new value and if the
> old one indicated that there is no enough space for a new entry. All
> this is already guaranteed by control flow dependancy on single CPU -
> writes will not be committed to the memory if read value of
> "data_tail" doesn't specify enough free space in the ring buffer.
> 
> Likewise, on consumer side, we can make use of natural data dependency and
> memory ordering guarantee for single CPU and try to replace "smp_mb" by
> a more light-weight "smp_rmb":
> 
> READ ->data_tail                      READ ->data_head
> // ...                                smp_rmb()       (C)
> WRITE $data                           READ $data
> smp_wmb()     (B)                     smp_rmb()       (D)
> 						  READ $header_size
> STORE ->data_head                     WRITE ->data_tail = $old_data_tail +
> $header_size
> 
> We ensure that all $data is read before "data_tail" is written by
> doing "READ $header_size" after all other data is read and we rely on
> natural data dependancy between "data_tail" write and "header_size"
> read.

I'm not entirely sure I get the $header_size trickery; need to think
more on that. But yes, I did consider the other one. However, I had
trouble having no pairing barrier for (D).

ISTR something like Alpha being able to miss the update (for a long
while) if you don't issue the RMB.

Lets add Paul and Oleg to the thread; this is getting far more 'fun'
that it should be ;-)

For completeness; below the patch as I had queued it.
---
Subject: perf: Fix perf ring buffer memory ordering
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Oct 28 13:55:29 CET 2013

The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.

When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.

Cc: anton@samba.org
Cc: benh@kernel.crashing.org
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: michael@ellerman.id.au
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Reported-by: Victor Kaplansky <victork@il.ibm.com>
Tested-by: Victor Kaplansky <victork@il.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
---
 include/uapi/linux/perf_event.h |   12 +++++++-----
 kernel/events/ring_buffer.c     |   29 ++++++++++++++++++++++++++---
 2 files changed, 33 insertions(+), 8 deletions(-)

Index: linux-2.6/include/uapi/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/perf_event.h
+++ linux-2.6/include/uapi/linux/perf_event.h
@@ -479,13 +479,15 @@ struct perf_event_mmap_page {
 	/*
 	 * Control data for the mmap() data buffer.
 	 *
-	 * User-space reading the @data_head value should issue an rmb(), on
-	 * SMP capable platforms, after reading this value -- see
-	 * perf_event_wakeup().
+	 * User-space reading the @data_head value should issue an smp_rmb(),
+	 * after reading this value.
 	 *
 	 * When the mapping is PROT_WRITE the @data_tail value should be
-	 * written by userspace to reflect the last read data. In this case
-	 * the kernel will not over-write unread data.
+	 * written by userspace to reflect the last read data, after issueing
+	 * an smp_mb() to separate the data read from the ->data_tail store.
+	 * In this case the kernel will not over-write unread data.
+	 *
+	 * See perf_output_put_handle() for the data ordering.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
Index: linux-2.6/kernel/events/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/events/ring_buffer.c
+++ linux-2.6/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
 		goto out;
 
 	/*
-	 * Publish the known good head. Rely on the full barrier implied
-	 * by atomic_dec_and_test() order the rb->head read and this
-	 * write.
+	 * Since the mmap() consumer (userspace) can run on a different CPU:
+	 *
+	 *   kernel				user
+	 *
+	 *   READ ->data_tail			READ ->data_head
+	 *   smp_rmb()	(A)			smp_rmb()	(C)
+	 *   WRITE $data			READ $data
+	 *   smp_wmb()	(B)			smp_mb()	(D)
+	 *   STORE ->data_head			WRITE ->data_tail
+	 *
+	 * Where A pairs with D, and B pairs with C.
+	 *
+	 * I don't think A needs to be a full barrier because we won't in fact
+	 * write data until we see the store from userspace. So we simply don't
+	 * issue the data WRITE until we observe it.
+	 *
+	 * OTOH, D needs to be a full barrier since it separates the data READ
+	 * from the tail WRITE.
+	 *
+	 * For B a WMB is sufficient since it separates two WRITEs, and for C
+	 * an RMB is sufficient since it separates two READs.
+	 *
+	 * See perf_output_begin().
 	 */
+	smp_wmb();
 	rb->user_page->data_head = head;
 
 	/*
@@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output
 		 * Userspace could choose to issue a mb() before updating the
 		 * tail pointer. So that all reads will be completed before the
 		 * write is issued.
+		 *
+		 * See perf_output_put_handle().
 		 */
 		tail = ACCESS_ONCE(rb->user_page->data_tail);
 		smp_rmb();

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 13:26         ` Peter Zijlstra
@ 2013-10-28 16:34           ` Paul E. McKenney
  2013-10-28 20:17             ` Oleg Nesterov
  2013-10-28 19:09           ` Oleg Nesterov
  1 sibling, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-28 16:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Frederic Weisbecker, Anton Blanchard,
	Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote:
> > > 2013/10/25 Peter Zijlstra <peterz@infradead.org>:
> > > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote:
> > > > I would argue for
> > > >
> > > >   READ ->data_tail                      READ ->data_head
> > > >   smp_rmb()     (A)                     smp_rmb()       (C)
> > > >   WRITE $data                           READ $data
> > > >   smp_wmb()     (B)                     smp_mb()        (D)
> > > >   STORE ->data_head                     WRITE ->data_tail
> > > >
> > > > Where A pairs with D, and B pairs with C.
> > > >
> > > > I don't think A needs to be a full barrier because we won't in fact
> > > > write data until we see the store from userspace. So we simply don't
> > > > issue the data WRITE until we observe it.
> > > >
> > > > OTOH, D needs to be a full barrier since it separates the data READ from
> > > > the tail WRITE.
> > > >
> > > > For B a WMB is sufficient since it separates two WRITEs, and for C an
> > > > RMB is sufficient since it separates two READs.
> 
> <snip>
> 
> > I think you have a point :) IMO, memory barrier (A) is superfluous.
> > At producer side we need to ensure that "WRITE $data" is not committed
> > to memory before "READ ->data_tail" had seen a new value and if the
> > old one indicated that there is no enough space for a new entry. All
> > this is already guaranteed by control flow dependancy on single CPU -
> > writes will not be committed to the memory if read value of
> > "data_tail" doesn't specify enough free space in the ring buffer.
> > 
> > Likewise, on consumer side, we can make use of natural data dependency and
> > memory ordering guarantee for single CPU and try to replace "smp_mb" by
> > a more light-weight "smp_rmb":
> > 
> > READ ->data_tail                      READ ->data_head
> > // ...                                smp_rmb()       (C)
> > WRITE $data                           READ $data
> > smp_wmb()     (B)                     smp_rmb()       (D)
> > 						  READ $header_size
> > STORE ->data_head                     WRITE ->data_tail = $old_data_tail +
> > $header_size
> > 
> > We ensure that all $data is read before "data_tail" is written by
> > doing "READ $header_size" after all other data is read and we rely on
> > natural data dependancy between "data_tail" write and "header_size"
> > read.
> 
> I'm not entirely sure I get the $header_size trickery; need to think
> more on that. But yes, I did consider the other one. However, I had
> trouble having no pairing barrier for (D).
> 
> ISTR something like Alpha being able to miss the update (for a long
> while) if you don't issue the RMB.
> 
> Lets add Paul and Oleg to the thread; this is getting far more 'fun'
> that it should be ;-)
> 
> For completeness; below the patch as I had queued it.
> ---
> Subject: perf: Fix perf ring buffer memory ordering
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Oct 28 13:55:29 CET 2013
> 
> The PPC64 people noticed a missing memory barrier and crufty old
> comments in the perf ring buffer code. So update all the comments and
> add the missing barrier.
> 
> When the architecture implements local_t using atomic_long_t there
> will be double barriers issued; but short of introducing more
> conditional barrier primitives this is the best we can do.
> 
> Cc: anton@samba.org
> Cc: benh@kernel.crashing.org
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: michael@ellerman.id.au
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Reported-by: Victor Kaplansky <victork@il.ibm.com>
> Tested-by: Victor Kaplansky <victork@il.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
> ---
>  include/uapi/linux/perf_event.h |   12 +++++++-----
>  kernel/events/ring_buffer.c     |   29 ++++++++++++++++++++++++++---
>  2 files changed, 33 insertions(+), 8 deletions(-)
> 
> Index: linux-2.6/include/uapi/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
>  	/*
>  	 * Control data for the mmap() data buffer.
>  	 *
> -	 * User-space reading the @data_head value should issue an rmb(), on
> -	 * SMP capable platforms, after reading this value -- see
> -	 * perf_event_wakeup().
> +	 * User-space reading the @data_head value should issue an smp_rmb(),
> +	 * after reading this value.
>  	 *
>  	 * When the mapping is PROT_WRITE the @data_tail value should be
> -	 * written by userspace to reflect the last read data. In this case
> -	 * the kernel will not over-write unread data.
> +	 * written by userspace to reflect the last read data, after issueing
> +	 * an smp_mb() to separate the data read from the ->data_tail store.
> +	 * In this case the kernel will not over-write unread data.
> +	 *
> +	 * See perf_output_put_handle() for the data ordering.
>  	 */
>  	__u64   data_head;		/* head in the data section */
>  	__u64	data_tail;		/* user-space written tail */
> Index: linux-2.6/kernel/events/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/ring_buffer.c
> +++ linux-2.6/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
>  		goto out;
> 
>  	/*
> -	 * Publish the known good head. Rely on the full barrier implied
> -	 * by atomic_dec_and_test() order the rb->head read and this
> -	 * write.
> +	 * Since the mmap() consumer (userspace) can run on a different CPU:
> +	 *
> +	 *   kernel				user
> +	 *
> +	 *   READ ->data_tail			READ ->data_head
> +	 *   smp_rmb()	(A)			smp_rmb()	(C)

Given that both of the kernel's subsequent operations are stores/writes,
doesn't (A) need to be smp_mb()?

							Thanx, Paul

> +	 *   WRITE $data			READ $data
> +	 *   smp_wmb()	(B)			smp_mb()	(D)
> +	 *   STORE ->data_head			WRITE ->data_tail
> +	 *
> +	 * Where A pairs with D, and B pairs with C.
> +	 *
> +	 * I don't think A needs to be a full barrier because we won't in fact
> +	 * write data until we see the store from userspace. So we simply don't
> +	 * issue the data WRITE until we observe it.
> +	 *
> +	 * OTOH, D needs to be a full barrier since it separates the data READ
> +	 * from the tail WRITE.
> +	 *
> +	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> +	 * an RMB is sufficient since it separates two READs.
> +	 *
> +	 * See perf_output_begin().
>  	 */
> +	smp_wmb();
>  	rb->user_page->data_head = head;
> 
>  	/*
> @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output
>  		 * Userspace could choose to issue a mb() before updating the
>  		 * tail pointer. So that all reads will be completed before the
>  		 * write is issued.
> +		 *
> +		 * See perf_output_put_handle().
>  		 */
>  		tail = ACCESS_ONCE(rb->user_page->data_tail);
>  		smp_rmb();
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 13:26         ` Peter Zijlstra
  2013-10-28 16:34           ` Paul E. McKenney
@ 2013-10-28 19:09           ` Oleg Nesterov
  1 sibling, 0 replies; 120+ messages in thread
From: Oleg Nesterov @ 2013-10-28 19:09 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Frederic Weisbecker, Anton Blanchard,
	Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Paul McKenney

On 10/28, Peter Zijlstra wrote:
>
> Lets add Paul and Oleg to the thread; this is getting far more 'fun'
> that it should be ;-)

Heh. All I can say is that I would like to see the authoritative answer,
you know who can shed a light ;)

But to avoid the confusion, wmb() added by this patch looks "obviously
correct" to me.

> +	 * Since the mmap() consumer (userspace) can run on a different CPU:
> +	 *
> +	 *   kernel				user
> +	 *
> +	 *   READ ->data_tail			READ ->data_head
> +	 *   smp_rmb()	(A)			smp_rmb()	(C)
> +	 *   WRITE $data			READ $data
> +	 *   smp_wmb()	(B)			smp_mb()	(D)
> +	 *   STORE ->data_head			WRITE ->data_tail
> +	 *
> +	 * Where A pairs with D, and B pairs with C.
> +	 *
> +	 * I don't think A needs to be a full barrier because we won't in fact
> +	 * write data until we see the store from userspace.

this matches the intuition, but ...

> So we simply don't
> +	 * issue the data WRITE until we observe it.

why do we need any barrier (rmb) then? how it can help to serialize with
"WRITE $data" ?

(of course there could be other reasons for this rmb(), just I can't
 really understand "A pairs with D").

And this reminds me about the memory barrier in kfifo.c which I was not
able to understand. Hmm, it has already gone away, and now I do not
understand kfifo.c at all ;) But I have found the commit, attached below.

Note that that commit added the full barrier into __kfifo_put(). And to
me it looks the same as "A" above. Following the logic above we could say
that we do not need a barrier (at least the full one), we won't in fact
write into the "unread" area until we see the store to ->out from
__kfifo_get() ?


In short. I am confused, I _feel_ that "A" has to be a full barrier, but
I can't prove this. And let me suggest the artificial/simplified example,

	bool	BUSY;
	data_t 	DATA;

	bool try_to_get(data_t *data)
	{
		if (!BUSY)
			return false;

		rmb();

		*data = DATA;
		mb();
		BUSY = false;

		return true;
	}

	bool try_to_put(data_t *data)
	{
		if (BUSY)
			return false;

		mb();	// XXXXXXXX: do we really need it? I think yes.

		DATA = *data;
		wmb();
		BUSY = true;

		return true;
	}

Again, following the description above we could turn the mb() in _put()
into barrier(), but I do not think we can rely on the contorl dependency.

Oleg.
---

commit a45bce49545739a940f8bd4ca85c3b7435564893
Author: Paul E. McKenney <paulmck@us.ibm.com>
Date:   Fri Sep 29 02:00:11 2006 -0700

    [PATCH] memory ordering in __kfifo primitives

    Both __kfifo_put() and __kfifo_get() have header comments stating that if
    there is but one concurrent reader and one concurrent writer, locking is not
    necessary.  This is almost the case, but a couple of memory barriers are
    needed.  Another option would be to change the header comments to remove the
    bit about locking not being needed, and to change the those callers who
    currently don't use locking to add the required locking.  The attachment
    analyzes this approach, but the patch below seems simpler.

    Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com>
    Cc: Stelian Pop <stelian@popies.net>
    Signed-off-by: Andrew Morton <akpm@osdl.org>
    Signed-off-by: Linus Torvalds <torvalds@osdl.org>

diff --git a/kernel/kfifo.c b/kernel/kfifo.c
index 64ab045..5d1d907 100644
--- a/kernel/kfifo.c
+++ b/kernel/kfifo.c
@@ -122,6 +122,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
 
 	len = min(len, fifo->size - fifo->in + fifo->out);
 
+	/*
+	 * Ensure that we sample the fifo->out index -before- we
+	 * start putting bytes into the kfifo.
+	 */
+
+	smp_mb();
+
 	/* first put the data starting from fifo->in to buffer end */
 	l = min(len, fifo->size - (fifo->in & (fifo->size - 1)));
 	memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l);
@@ -129,6 +136,13 @@ unsigned int __kfifo_put(struct kfifo *fifo,
 	/* then put the rest (if any) at the beginning of the buffer */
 	memcpy(fifo->buffer, buffer + l, len - l);
 
+	/*
+	 * Ensure that we add the bytes to the kfifo -before-
+	 * we update the fifo->in index.
+	 */
+
+	smp_wmb();
+
 	fifo->in += len;
 
 	return len;
@@ -154,6 +168,13 @@ unsigned int __kfifo_get(struct kfifo *fifo,
 
 	len = min(len, fifo->in - fifo->out);
 
+	/*
+	 * Ensure that we sample the fifo->in index -before- we
+	 * start removing bytes from the kfifo.
+	 */
+
+	smp_rmb();
+
 	/* first get the data from fifo->out until the end of the buffer */
 	l = min(len, fifo->size - (fifo->out & (fifo->size - 1)));
 	memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l);
@@ -161,6 +182,13 @@ unsigned int __kfifo_get(struct kfifo *fifo,
 	/* then get the rest (if any) from the beginning of the buffer */
 	memcpy(buffer + l, fifo->buffer, len - l);
 
+	/*
+	 * Ensure that we remove the bytes from the kfifo -before-
+	 * we update the fifo->out index.
+	 */
+
+	smp_mb();
+
 	fifo->out += len;
 
 	return len;


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 16:34           ` Paul E. McKenney
@ 2013-10-28 20:17             ` Oleg Nesterov
  2013-10-28 20:58               ` Victor Kaplansky
  0 siblings, 1 reply; 120+ messages in thread
From: Oleg Nesterov @ 2013-10-28 20:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Victor Kaplansky, Frederic Weisbecker,
	Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On 10/28, Paul E. McKenney wrote:
>
> On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote:
> > --- linux-2.6.orig/kernel/events/ring_buffer.c
> > +++ linux-2.6/kernel/events/ring_buffer.c
> > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
> >  		goto out;
> >
> >  	/*
> > -	 * Publish the known good head. Rely on the full barrier implied
> > -	 * by atomic_dec_and_test() order the rb->head read and this
> > -	 * write.
> > +	 * Since the mmap() consumer (userspace) can run on a different CPU:
> > +	 *
> > +	 *   kernel				user
> > +	 *
> > +	 *   READ ->data_tail			READ ->data_head
> > +	 *   smp_rmb()	(A)			smp_rmb()	(C)
>
> Given that both of the kernel's subsequent operations are stores/writes,
> doesn't (A) need to be smp_mb()?

Yes, this is my understanding^Wfeeling too, but I have to admit that
I can't really explain to myself why _exactly_ we need mb() here.

And let me copy-and-paste the artificial example from my previous
email,

	bool	BUSY;
	data_t 	DATA;

	bool try_to_get(data_t *data)
	{
		if (!BUSY)
			return false;

		rmb();

		*data = DATA;
		mb();
		BUSY = false;

		return true;
	}

	bool try_to_put(data_t *data)
	{
		if (BUSY)
			return false;

		mb();	// XXXXXXXX: do we really need it? I think yes.

		DATA = *data;
		wmb();
		BUSY = true;

		return true;
	}

(just in case, the code above obviously assumes that _get or _put
 can't race with itself, but they can race with each other).

Could you confirm that try_to_put() actually needs mb() between
LOAD(BUSY) and STORE(DATA) ?

I am sure it actually needs, but I will appreciate it if you can
explain why. IOW, how it is possible that without mb() try_to_put()
can overwrite DATA before try_to_get() completes its "*data = DATA"
in this particular case.

Perhaps this can happen if, say, reader and writer share a level of
cache or something like this...

Oleg.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 20:17             ` Oleg Nesterov
@ 2013-10-28 20:58               ` Victor Kaplansky
  2013-10-29 10:21                 ` Peter Zijlstra
  2013-10-30  9:27                 ` Paul E. McKenney
  0 siblings, 2 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-28 20:58 UTC (permalink / raw)
  To: Oleg Nesterov
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Paul E. McKenney, Peter Zijlstra

Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:

>       mb();   // XXXXXXXX: do we really need it? I think yes.

Oh, it is hard to argue with feelings. Also, it is easy to be on
conservative side and put the barrier here just in case.
But I still insist that the barrier is redundant in your example.

-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 20:58               ` Victor Kaplansky
@ 2013-10-29 10:21                 ` Peter Zijlstra
  2013-10-29 10:30                   ` Peter Zijlstra
  2013-10-30  9:27                 ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-29 10:21 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Paul E. McKenney

On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> 
> >       mb();   // XXXXXXXX: do we really need it? I think yes.
> 
> Oh, it is hard to argue with feelings. Also, it is easy to be on
> conservative side and put the barrier here just in case.

I'll make it a full mb for now and too am curious to see the end of this
discussion explaining things ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 10:21                 ` Peter Zijlstra
@ 2013-10-29 10:30                   ` Peter Zijlstra
  2013-10-29 10:35                     ` Peter Zijlstra
                                       ` (2 more replies)
  0 siblings, 3 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-29 10:30 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Paul E. McKenney

On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> > 
> > >       mb();   // XXXXXXXX: do we really need it? I think yes.
> > 
> > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > conservative side and put the barrier here just in case.
> 
> I'll make it a full mb for now and too am curious to see the end of this
> discussion explaining things ;-)

That is, I've now got this queued:

---
Subject: perf: Fix perf ring buffer memory ordering
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Oct 28 13:55:29 CET 2013

The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.

When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.

Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: michael@ellerman.id.au
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: anton@samba.org
Cc: benh@kernel.crashing.org
Reported-by: Victor Kaplansky <victork@il.ibm.com>
Tested-by: Victor Kaplansky <victork@il.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
---
 include/uapi/linux/perf_event.h |   12 +++++++-----
 kernel/events/ring_buffer.c     |   31 +++++++++++++++++++++++++++----
 2 files changed, 34 insertions(+), 9 deletions(-)

Index: linux-2.6/include/uapi/linux/perf_event.h
===================================================================
--- linux-2.6.orig/include/uapi/linux/perf_event.h
+++ linux-2.6/include/uapi/linux/perf_event.h
@@ -479,13 +479,15 @@ struct perf_event_mmap_page {
 	/*
 	 * Control data for the mmap() data buffer.
 	 *
-	 * User-space reading the @data_head value should issue an rmb(), on
-	 * SMP capable platforms, after reading this value -- see
-	 * perf_event_wakeup().
+	 * User-space reading the @data_head value should issue an smp_rmb(),
+	 * after reading this value.
 	 *
 	 * When the mapping is PROT_WRITE the @data_tail value should be
-	 * written by userspace to reflect the last read data. In this case
-	 * the kernel will not over-write unread data.
+	 * written by userspace to reflect the last read data, after issueing
+	 * an smp_mb() to separate the data read from the ->data_tail store.
+	 * In this case the kernel will not over-write unread data.
+	 *
+	 * See perf_output_put_handle() for the data ordering.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
Index: linux-2.6/kernel/events/ring_buffer.c
===================================================================
--- linux-2.6.orig/kernel/events/ring_buffer.c
+++ linux-2.6/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
 		goto out;
 
 	/*
-	 * Publish the known good head. Rely on the full barrier implied
-	 * by atomic_dec_and_test() order the rb->head read and this
-	 * write.
+	 * Since the mmap() consumer (userspace) can run on a different CPU:
+	 *
+	 *   kernel				user
+	 *
+	 *   READ ->data_tail			READ ->data_head
+	 *   smp_mb()	(A)			smp_rmb()	(C)
+	 *   WRITE $data			READ $data
+	 *   smp_wmb()	(B)			smp_mb()	(D)
+	 *   STORE ->data_head			WRITE ->data_tail
+	 *
+	 * Where A pairs with D, and B pairs with C.
+	 *
+	 * I don't think A needs to be a full barrier because we won't in fact
+	 * write data until we see the store from userspace. So we simply don't
+	 * issue the data WRITE until we observe it. Be conservative for now.
+	 *
+	 * OTOH, D needs to be a full barrier since it separates the data READ
+	 * from the tail WRITE.
+	 *
+	 * For B a WMB is sufficient since it separates two WRITEs, and for C
+	 * an RMB is sufficient since it separates two READs.
+	 *
+	 * See perf_output_begin().
 	 */
+	smp_wmb();
 	rb->user_page->data_head = head;
 
 	/*
@@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
 		 * Userspace could choose to issue a mb() before updating the
 		 * tail pointer. So that all reads will be completed before the
 		 * write is issued.
+		 *
+		 * See perf_output_put_handle().
 		 */
 		tail = ACCESS_ONCE(rb->user_page->data_tail);
-		smp_rmb();
+		smp_mb();
 		offset = head = local_read(&rb->head);
 		head += size;
 		if (unlikely(!perf_output_space(rb, tail, offset, head)))

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 10:30                   ` Peter Zijlstra
@ 2013-10-29 10:35                     ` Peter Zijlstra
  2013-10-29 20:15                       ` Oleg Nesterov
  2013-10-29 19:27                     ` Vince Weaver
  2013-10-29 21:23                     ` perf events ring buffer memory barrier on powerpc Michael Neuling
  2 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-29 10:35 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Paul E. McKenney

On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote:
> @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
>  		 * Userspace could choose to issue a mb() before updating the
>  		 * tail pointer. So that all reads will be completed before the
>  		 * write is issued.
> +		 *
> +		 * See perf_output_put_handle().
>  		 */
>  		tail = ACCESS_ONCE(rb->user_page->data_tail);
> -		smp_rmb();
> +		smp_mb();
>  		offset = head = local_read(&rb->head);
>  		head += size;
>  		if (unlikely(!perf_output_space(rb, tail, offset, head)))

That said; it would be very nice to be able to remove this barrier. This
is in every event write path :/

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [tip:perf/urgent] perf: Fix perf ring buffer memory ordering
  2013-10-25 17:37   ` Peter Zijlstra
                       ` (2 preceding siblings ...)
  2013-10-28 10:02     ` Frederic Weisbecker
@ 2013-10-29 14:06     ` tip-bot for Peter Zijlstra
  3 siblings, 0 replies; 120+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-10-29 14:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mathieu.desnoyers, hpa, mingo, peterz, victork,
	paulmck, fweisbec, tglx, mikey

Commit-ID:  bf378d341e4873ed928dc3c636252e6895a21f50
Gitweb:     http://git.kernel.org/tip/bf378d341e4873ed928dc3c636252e6895a21f50
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Mon, 28 Oct 2013 13:55:29 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Tue, 29 Oct 2013 12:01:19 +0100

perf: Fix perf ring buffer memory ordering

The PPC64 people noticed a missing memory barrier and crufty old
comments in the perf ring buffer code. So update all the comments and
add the missing barrier.

When the architecture implements local_t using atomic_long_t there
will be double barriers issued; but short of introducing more
conditional barrier primitives this is the best we can do.

Reported-by: Victor Kaplansky <victork@il.ibm.com>
Tested-by: Victor Kaplansky <victork@il.ibm.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: michael@ellerman.id.au
Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: anton@samba.org
Cc: benh@kernel.crashing.org
Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/uapi/linux/perf_event.h | 12 +++++++-----
 kernel/events/ring_buffer.c     | 31 +++++++++++++++++++++++++++----
 2 files changed, 34 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 009a655..2fc1602 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -456,13 +456,15 @@ struct perf_event_mmap_page {
 	/*
 	 * Control data for the mmap() data buffer.
 	 *
-	 * User-space reading the @data_head value should issue an rmb(), on
-	 * SMP capable platforms, after reading this value -- see
-	 * perf_event_wakeup().
+	 * User-space reading the @data_head value should issue an smp_rmb(),
+	 * after reading this value.
 	 *
 	 * When the mapping is PROT_WRITE the @data_tail value should be
-	 * written by userspace to reflect the last read data. In this case
-	 * the kernel will not over-write unread data.
+	 * written by userspace to reflect the last read data, after issueing
+	 * an smp_mb() to separate the data read from the ->data_tail store.
+	 * In this case the kernel will not over-write unread data.
+	 *
+	 * See perf_output_put_handle() for the data ordering.
 	 */
 	__u64   data_head;		/* head in the data section */
 	__u64	data_tail;		/* user-space written tail */
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index cd55144..9c2ddfb 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -87,10 +87,31 @@ again:
 		goto out;
 
 	/*
-	 * Publish the known good head. Rely on the full barrier implied
-	 * by atomic_dec_and_test() order the rb->head read and this
-	 * write.
+	 * Since the mmap() consumer (userspace) can run on a different CPU:
+	 *
+	 *   kernel				user
+	 *
+	 *   READ ->data_tail			READ ->data_head
+	 *   smp_mb()	(A)			smp_rmb()	(C)
+	 *   WRITE $data			READ $data
+	 *   smp_wmb()	(B)			smp_mb()	(D)
+	 *   STORE ->data_head			WRITE ->data_tail
+	 *
+	 * Where A pairs with D, and B pairs with C.
+	 *
+	 * I don't think A needs to be a full barrier because we won't in fact
+	 * write data until we see the store from userspace. So we simply don't
+	 * issue the data WRITE until we observe it. Be conservative for now.
+	 *
+	 * OTOH, D needs to be a full barrier since it separates the data READ
+	 * from the tail WRITE.
+	 *
+	 * For B a WMB is sufficient since it separates two WRITEs, and for C
+	 * an RMB is sufficient since it separates two READs.
+	 *
+	 * See perf_output_begin().
 	 */
+	smp_wmb();
 	rb->user_page->data_head = head;
 
 	/*
@@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output_handle *handle,
 		 * Userspace could choose to issue a mb() before updating the
 		 * tail pointer. So that all reads will be completed before the
 		 * write is issued.
+		 *
+		 * See perf_output_put_handle().
 		 */
 		tail = ACCESS_ONCE(rb->user_page->data_tail);
-		smp_rmb();
+		smp_mb();
 		offset = head = local_read(&rb->head);
 		head += size;
 		if (unlikely(!perf_output_space(rb, tail, offset, head)))

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 10:30                   ` Peter Zijlstra
  2013-10-29 10:35                     ` Peter Zijlstra
@ 2013-10-29 19:27                     ` Vince Weaver
  2013-10-30 10:42                       ` Peter Zijlstra
  2013-10-29 21:23                     ` perf events ring buffer memory barrier on powerpc Michael Neuling
  2 siblings, 1 reply; 120+ messages in thread
From: Vince Weaver @ 2013-10-29 19:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Paul E. McKenney

On Tue, 29 Oct 2013, Peter Zijlstra wrote:

> On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
>  	/*
>  	 * Control data for the mmap() data buffer.
>  	 *
> -	 * User-space reading the @data_head value should issue an rmb(), on
> -	 * SMP capable platforms, after reading this value -- see
> -	 * perf_event_wakeup().
> +	 * User-space reading the @data_head value should issue an smp_rmb(),
> +	 * after reading this value.

so where's the patch fixing perf to use the new recommendations?

Is this purely a performance thing or a correctness change?

A change like this a bit of a pain, especially as userspace doesn't really 
have nice access to smb_mb() defines so a lot of cut-and-pasting will 
ensue for everyone who's trying to parse the mmap buffer.

Vince

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 10:35                     ` Peter Zijlstra
@ 2013-10-29 20:15                       ` Oleg Nesterov
  0 siblings, 0 replies; 120+ messages in thread
From: Oleg Nesterov @ 2013-10-29 20:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Paul E. McKenney

On 10/29, Peter Zijlstra wrote:
>
> On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote:
> > @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
> >  		 * Userspace could choose to issue a mb() before updating the
> >  		 * tail pointer. So that all reads will be completed before the
> >  		 * write is issued.
> > +		 *
> > +		 * See perf_output_put_handle().
> >  		 */
> >  		tail = ACCESS_ONCE(rb->user_page->data_tail);
> > -		smp_rmb();
> > +		smp_mb();
> >  		offset = head = local_read(&rb->head);
> >  		head += size;
> >  		if (unlikely(!perf_output_space(rb, tail, offset, head)))
>
> That said; it would be very nice to be able to remove this barrier. This
> is in every event write path :/

Yes.. And I'm afraid very much that I simply confused you. Perhaps Victor
is right and we do not need this mb(). So I am waiting for the end of
this story too.

And btw I do not understand why we need it (or smp_rmb) right after
ACCESS_ONCE(data_tail).

Oleg.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 10:30                   ` Peter Zijlstra
  2013-10-29 10:35                     ` Peter Zijlstra
  2013-10-29 19:27                     ` Vince Weaver
@ 2013-10-29 21:23                     ` Michael Neuling
  2 siblings, 0 replies; 120+ messages in thread
From: Michael Neuling @ 2013-10-29 21:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Paul E. McKenney

Peter Zijlstra <peterz@infradead.org> wrote:

> On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> > > 
> > > >       mb();   // XXXXXXXX: do we really need it? I think yes.
> > > 
> > > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > > conservative side and put the barrier here just in case.
> > 
> > I'll make it a full mb for now and too am curious to see the end of this
> > discussion explaining things ;-)
> 
> That is, I've now got this queued:

Can we also CC stable@kernel.org?  This has been around for a while.

Mikey

> 
> ---
> Subject: perf: Fix perf ring buffer memory ordering
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon Oct 28 13:55:29 CET 2013
> 
> The PPC64 people noticed a missing memory barrier and crufty old
> comments in the perf ring buffer code. So update all the comments and
> add the missing barrier.
> 
> When the architecture implements local_t using atomic_long_t there
> will be double barriers issued; but short of introducing more
> conditional barrier primitives this is the best we can do.
> 
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Cc: michael@ellerman.id.au
> Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: anton@samba.org
> Cc: benh@kernel.crashing.org
> Reported-by: Victor Kaplansky <victork@il.ibm.com>
> Tested-by: Victor Kaplansky <victork@il.ibm.com>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan
> ---
>  include/uapi/linux/perf_event.h |   12 +++++++-----
>  kernel/events/ring_buffer.c     |   31 +++++++++++++++++++++++++++----
>  2 files changed, 34 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/include/uapi/linux/perf_event.h
> ===================================================================
> --- linux-2.6.orig/include/uapi/linux/perf_event.h
> +++ linux-2.6/include/uapi/linux/perf_event.h
> @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
>  	/*
>  	 * Control data for the mmap() data buffer.
>  	 *
> -	 * User-space reading the @data_head value should issue an rmb(), on
> -	 * SMP capable platforms, after reading this value -- see
> -	 * perf_event_wakeup().
> +	 * User-space reading the @data_head value should issue an smp_rmb(),
> +	 * after reading this value.
>  	 *
>  	 * When the mapping is PROT_WRITE the @data_tail value should be
> -	 * written by userspace to reflect the last read data. In this case
> -	 * the kernel will not over-write unread data.
> +	 * written by userspace to reflect the last read data, after issueing
> +	 * an smp_mb() to separate the data read from the ->data_tail store.
> +	 * In this case the kernel will not over-write unread data.
> +	 *
> +	 * See perf_output_put_handle() for the data ordering.
>  	 */
>  	__u64   data_head;		/* head in the data section */
>  	__u64	data_tail;		/* user-space written tail */
> Index: linux-2.6/kernel/events/ring_buffer.c
> ===================================================================
> --- linux-2.6.orig/kernel/events/ring_buffer.c
> +++ linux-2.6/kernel/events/ring_buffer.c
> @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc
>  		goto out;
>  
>  	/*
> -	 * Publish the known good head. Rely on the full barrier implied
> -	 * by atomic_dec_and_test() order the rb->head read and this
> -	 * write.
> +	 * Since the mmap() consumer (userspace) can run on a different CPU:
> +	 *
> +	 *   kernel				user
> +	 *
> +	 *   READ ->data_tail			READ ->data_head
> +	 *   smp_mb()	(A)			smp_rmb()	(C)
> +	 *   WRITE $data			READ $data
> +	 *   smp_wmb()	(B)			smp_mb()	(D)
> +	 *   STORE ->data_head			WRITE ->data_tail
> +	 *
> +	 * Where A pairs with D, and B pairs with C.
> +	 *
> +	 * I don't think A needs to be a full barrier because we won't in fact
> +	 * write data until we see the store from userspace. So we simply don't
> +	 * issue the data WRITE until we observe it. Be conservative for now.
> +	 *
> +	 * OTOH, D needs to be a full barrier since it separates the data READ
> +	 * from the tail WRITE.
> +	 *
> +	 * For B a WMB is sufficient since it separates two WRITEs, and for C
> +	 * an RMB is sufficient since it separates two READs.
> +	 *
> +	 * See perf_output_begin().
>  	 */
> +	smp_wmb();
>  	rb->user_page->data_head = head;
>  
>  	/*
> @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output
>  		 * Userspace could choose to issue a mb() before updating the
>  		 * tail pointer. So that all reads will be completed before the
>  		 * write is issued.
> +		 *
> +		 * See perf_output_put_handle().
>  		 */
>  		tail = ACCESS_ONCE(rb->user_page->data_tail);
> -		smp_rmb();
> +		smp_mb();
>  		offset = head = local_read(&rb->head);
>  		head += size;
>  		if (unlikely(!perf_output_space(rb, tail, offset, head)))
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-28 20:58               ` Victor Kaplansky
  2013-10-29 10:21                 ` Peter Zijlstra
@ 2013-10-30  9:27                 ` Paul E. McKenney
  2013-10-30 11:25                   ` Peter Zijlstra
  2013-10-30 13:28                   ` Victor Kaplansky
  1 sibling, 2 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-30  9:27 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Peter Zijlstra

On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> 
> >       mb();   // XXXXXXXX: do we really need it? I think yes.
> 
> Oh, it is hard to argue with feelings. Also, it is easy to be on
> conservative side and put the barrier here just in case.
> But I still insist that the barrier is redundant in your example.

If you were to back up that insistence with a description of the orderings
you are relying on, why other orderings are not important, and how the
important orderings are enforced, I might be tempted to pay attention
to your opinion.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-29 19:27                     ` Vince Weaver
@ 2013-10-30 10:42                       ` Peter Zijlstra
  2013-10-30 11:48                         ` James Hogan
  2013-11-06 13:19                         ` [tip:perf/core] tools/perf: Add required memory barriers tip-bot for Peter Zijlstra
  0 siblings, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 10:42 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Paul E. McKenney, james.hogan

On Tue, Oct 29, 2013 at 03:27:05PM -0400, Vince Weaver wrote:
> On Tue, 29 Oct 2013, Peter Zijlstra wrote:
> 
> > On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote:
> > --- linux-2.6.orig/include/uapi/linux/perf_event.h
> > +++ linux-2.6/include/uapi/linux/perf_event.h
> > @@ -479,13 +479,15 @@ struct perf_event_mmap_page {
> >  	/*
> >  	 * Control data for the mmap() data buffer.
> >  	 *
> > -	 * User-space reading the @data_head value should issue an rmb(), on
> > -	 * SMP capable platforms, after reading this value -- see
> > -	 * perf_event_wakeup().
> > +	 * User-space reading the @data_head value should issue an smp_rmb(),
> > +	 * after reading this value.
> 
> so where's the patch fixing perf to use the new recommendations?

Fair enough, thanks for reminding me about that. See below.

> Is this purely a performance thing or a correctness change?

Correctness, although I suppose on most archs you'd be hard pushed to
notice.

> A change like this a bit of a pain, especially as userspace doesn't really 
> have nice access to smb_mb() defines so a lot of cut-and-pasting will 
> ensue for everyone who's trying to parse the mmap buffer.

Agreed; we should maybe push for a user visible asm/barrier.h or so.

---
Subject: perf, tool: Add required memory barriers

To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
ordering") change userspace to also adhere to the ordering outlined.

Most barrier implementations were gleaned from
arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
sure they're correct.

Cc: James Hogan <james.hogan@imgtec.com>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 tools/perf/perf.h        | 39 +++++++++++++++++++++++++++++++++++++--
 tools/perf/util/evlist.h |  2 +-
 2 files changed, 38 insertions(+), 3 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index f61c230beec4..1b8a0a2a63d4 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -4,6 +4,8 @@
 #include <asm/unistd.h>
 
 #if defined(__i386__)
+#define mb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
+#define wmb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define rmb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define cpu_relax()	asm volatile("rep; nop" ::: "memory");
 #define CPUINFO_PROC	"model name"
@@ -13,6 +15,8 @@
 #endif
 
 #if defined(__x86_64__)
+#define mb()		asm volatile("mfence" ::: "memory")
+#define wmb()		asm volatile("sfence" ::: "memory")
 #define rmb()		asm volatile("lfence" ::: "memory")
 #define cpu_relax()	asm volatile("rep; nop" ::: "memory");
 #define CPUINFO_PROC	"model name"
@@ -23,20 +27,28 @@
 
 #ifdef __powerpc__
 #include "../../arch/powerpc/include/uapi/asm/unistd.h"
+#define mb()		asm volatile ("sync" ::: "memory")
+#define wmb()		asm volatile ("sync" ::: "memory")
 #define rmb()		asm volatile ("sync" ::: "memory")
 #define cpu_relax()	asm volatile ("" ::: "memory");
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __s390__
+#define mb()		asm volatile("bcr 15,0" ::: "memory")
+#define wmb()		asm volatile("bcr 15,0" ::: "memory")
 #define rmb()		asm volatile("bcr 15,0" ::: "memory")
 #define cpu_relax()	asm volatile("" ::: "memory");
 #endif
 
 #ifdef __sh__
 #if defined(__SH4A__) || defined(__SH5__)
+# define mb()		asm volatile("synco" ::: "memory")
+# define wmb()		asm volatile("synco" ::: "memory")
 # define rmb()		asm volatile("synco" ::: "memory")
 #else
+# define mb()		asm volatile("" ::: "memory")
+# define wmb()		asm volatile("" ::: "memory")
 # define rmb()		asm volatile("" ::: "memory")
 #endif
 #define cpu_relax()	asm volatile("" ::: "memory")
@@ -44,24 +56,38 @@
 #endif
 
 #ifdef __hppa__
+#define mb()		asm volatile("" ::: "memory")
+#define wmb()		asm volatile("" ::: "memory")
 #define rmb()		asm volatile("" ::: "memory")
 #define cpu_relax()	asm volatile("" ::: "memory");
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __sparc__
+#ifdef __LP64__
+#define mb()		asm volatile("ba,pt %%xcc, 1f\n"	\
+				     "membar #StoreLoad\n"	\
+				     "1:\n"":::"memory")
+#else
+#define mb()		asm volatile("":::"memory")
+#endif
+#define wmb()		asm volatile("":::"memory")
 #define rmb()		asm volatile("":::"memory")
 #define cpu_relax()	asm volatile("":::"memory")
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __alpha__
+#define mb()		asm volatile("mb" ::: "memory")
+#define wmb()		asm volatile("wmb" ::: "memory")
 #define rmb()		asm volatile("mb" ::: "memory")
 #define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"cpu model"
 #endif
 
 #ifdef __ia64__
+#define mb()		asm volatile ("mf" ::: "memory")
+#define wmb()		asm volatile ("mf" ::: "memory")
 #define rmb()		asm volatile ("mf" ::: "memory")
 #define cpu_relax()	asm volatile ("hint @pause" ::: "memory")
 #define CPUINFO_PROC	"model name"
@@ -72,35 +98,44 @@
  * Use the __kuser_memory_barrier helper in the CPU helper page. See
  * arch/arm/kernel/entry-armv.S in the kernel source for details.
  */
+#define mb()		((void(*)(void))0xffff0fa0)()
+#define wmb()		((void(*)(void))0xffff0fa0)()
 #define rmb()		((void(*)(void))0xffff0fa0)()
 #define cpu_relax()	asm volatile("":::"memory")
 #define CPUINFO_PROC	"Processor"
 #endif
 
 #ifdef __aarch64__
-#define rmb()		asm volatile("dmb ld" ::: "memory")
+#define mb()		asm volatile("dmb ish" ::: "memory")
+#define wmb()		asm volatile("dmb ishld" ::: "memory")
+#define rmb()		asm volatile("dmb ishst" ::: "memory")
 #define cpu_relax()	asm volatile("yield" ::: "memory")
 #endif
 
 #ifdef __mips__
-#define rmb()		asm volatile(					\
+#define mb()		asm volatile(					\
 				".set	mips2\n\t"			\
 				"sync\n\t"				\
 				".set	mips0"				\
 				: /* no output */			\
 				: /* no input */			\
 				: "memory")
+#define wmb()	mb()
+#define rmb()	mb()
 #define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"cpu model"
 #endif
 
 #ifdef __arc__
+#define mb()		asm volatile("" ::: "memory")
+#define wmb()		asm volatile("" ::: "memory")
 #define rmb()		asm volatile("" ::: "memory")
 #define cpu_relax()	rmb()
 #define CPUINFO_PROC	"Processor"
 #endif
 
 #ifdef __metag__
+/* XXX no clue */
 #define rmb()		asm volatile("" ::: "memory")
 #define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"CPU"
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 6e8acc9abe38..8ab1b5ae4a0e 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -189,7 +189,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md,
 	/*
 	 * ensure all reads are done before we write the tail out.
 	 */
-	/* mb(); */
+	mb();
 	pc->data_tail = tail;
 }
 

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30  9:27                 ` Paul E. McKenney
@ 2013-10-30 11:25                   ` Peter Zijlstra
  2013-10-30 14:52                     ` Victor Kaplansky
  2013-10-31  6:40                     ` Paul E. McKenney
  2013-10-30 13:28                   ` Victor Kaplansky
  1 sibling, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 11:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Wed, Oct 30, 2013 at 02:27:25AM -0700, Paul E. McKenney wrote:
> On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> > 
> > >       mb();   // XXXXXXXX: do we really need it? I think yes.
> > 
> > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > conservative side and put the barrier here just in case.
> > But I still insist that the barrier is redundant in your example.
> 
> If you were to back up that insistence with a description of the orderings
> you are relying on, why other orderings are not important, and how the
> important orderings are enforced, I might be tempted to pay attention
> to your opinion.

OK, so let me try.. a slightly less convoluted version of the code in
kernel/events/ring_buffer.c coupled with a userspace consumer would look
something like the below.

One important detail is that the kbuf part and the kbuf_writer() are
strictly per cpu and we can thus rely on implicit ordering for those.

Only the userspace consumer can possibly run on another cpu, and thus we
need to ensure data consistency for those. 

struct buffer {
	u64 size;
	u64 tail;
	u64 head;
	void *data;
};

struct buffer *kbuf, *ubuf;

/*
 * Determine there's space in the buffer to store data at @offset to
 * @head without overwriting data at @tail.
 */
bool space(u64 tail, u64 offset, u64 head)
{
	offset = (offset - tail) % kbuf->size;
	head   = (head   - tail) % kbuf->size;

	return (s64)(head - offset) >= 0;
}

/*
 * If there's space in the buffer; store the data @buf; otherwise
 * discard it.
 */
void kbuf_write(int sz, void *buf)
{
	u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */
	u64 offset = kbuf->head; /* we already know where we last wrote */
	u64 head = offset + sz;

	if (!space(tail, offset, head)) {
		/* discard @buf */
		return;
	}

	/*
	 * Ensure that if we see the userspace tail (ubuf->tail) such
	 * that there is space to write @buf without overwriting data
	 * userspace hasn't seen yet, we won't in fact store data before
	 * that read completes.
	 */

	smp_mb(); /* A, matches with D */

	write(kbuf->data + offset, buf, sz);
	kbuf->head = head % kbuf->size;

	/*
	 * Ensure that we write all the @buf data before we update the
	 * userspace visible ubuf->head pointer.
	 */
	smp_wmb(); /* B, matches with C */

	ubuf->head = kbuf->head;
}

/*
 * Consume the buffer data and update the tail pointer to indicate to
 * kernel space there's 'free' space.
 */
void ubuf_read(void)
{
	u64 head, tail;

	tail = ACCESS_ONCE(ubuf->tail);
	head = ACCESS_ONCE(ubuf->head);

	/*
	 * Ensure we read the buffer boundaries before the actual buffer
	 * data...
	 */
	smp_rmb(); /* C, matches with B */

	while (tail != head) {
		obj = ubuf->data + tail;
		/* process obj */
		tail += obj->size;
		tail %= ubuf->size;
	}

	/*
	 * Ensure all data reads are complete before we issue the
	 * ubuf->tail update; once that update hits, kbuf_write() can
	 * observe and overwrite data.
	 */
	smp_mb(); /* D, matches with A */

	ubuf->tail = tail;
}


Now the whole crux of the question is if we need barrier A at all, since
the STORES issued by the @buf writes are dependent on the ubuf->tail
read.

If the read shows no available space, we simply will not issue those
writes -- therefore we could argue we can avoid the memory barrier.

However, that leaves D unpaired and me confused. We must have D because
otherwise the CPU could reorder that write into the reads previous and
the kernel could start overwriting data we're still reading.. which
seems like a bad deal.

Also, I'm not entirely sure on C, that too seems like a dependency, we
simply cannot read the buffer @tail before we've read the tail itself,
now can we? Similarly we cannot compare tail to head without having the
head read completed.


Could we replace A and C with an smp_read_barrier_depends()?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 10:42                       ` Peter Zijlstra
@ 2013-10-30 11:48                         ` James Hogan
  2013-10-30 12:48                           ` Peter Zijlstra
  2013-11-06 13:19                         ` [tip:perf/core] tools/perf: Add required memory barriers tip-bot for Peter Zijlstra
  1 sibling, 1 reply; 120+ messages in thread
From: James Hogan @ 2013-10-30 11:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Paul E. McKenney, linux-metag

Hi Peter,

On 30/10/13 10:42, Peter Zijlstra wrote:
> Subject: perf, tool: Add required memory barriers
> 
> To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
> ordering") change userspace to also adhere to the ordering outlined.
> 
> Most barrier implementations were gleaned from
> arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
> sure they're correct.

Yeh...

Short answer:
For Meta you're probably best off assuming
CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers.

Long answer:
The issue with write reordering between Meta's hardware threads beyond
the cache is only with a particular SoC, and SMP is not used in
production on it.
It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register
writable to userspace (it's in a non-mappable region already) but even
then the write to that register needs odd placement to be effective
(before the shmem write rather than after - which isn't a place any
existing barriers are guaranteed to be placed). I'm fairly confident we
get away with it in the kernel, and userland normally just uses linked
load/store instructions for atomicity which works fine.

Cheers
James


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 11:48                         ` James Hogan
@ 2013-10-30 12:48                           ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 12:48 UTC (permalink / raw)
  To: James Hogan
  Cc: Vince Weaver, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Paul E. McKenney, linux-metag

On Wed, Oct 30, 2013 at 11:48:44AM +0000, James Hogan wrote:
> Hi Peter,
> 
> On 30/10/13 10:42, Peter Zijlstra wrote:
> > Subject: perf, tool: Add required memory barriers
> > 
> > To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
> > ordering") change userspace to also adhere to the ordering outlined.
> > 
> > Most barrier implementations were gleaned from
> > arch/*/include/asm/barrier.h and with the exception of metag I'm fairly
> > sure they're correct.
> 
> Yeh...
> 
> Short answer:
> For Meta you're probably best off assuming
> CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers.

Thanks, fixed it that way.

> Long answer:
> The issue with write reordering between Meta's hardware threads beyond
> the cache is only with a particular SoC, and SMP is not used in
> production on it.
> It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register
> writable to userspace (it's in a non-mappable region already) but even
> then the write to that register needs odd placement to be effective
> (before the shmem write rather than after - which isn't a place any
> existing barriers are guaranteed to be placed). I'm fairly confident we
> get away with it in the kernel, and userland normally just uses linked
> load/store instructions for atomicity which works fine.

Urgh.. sounds like way 'fun' for you ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30  9:27                 ` Paul E. McKenney
  2013-10-30 11:25                   ` Peter Zijlstra
@ 2013-10-30 13:28                   ` Victor Kaplansky
  2013-10-30 15:51                     ` Peter Zijlstra
  2013-10-31  4:32                     ` Paul E. McKenney
  1 sibling, 2 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-30 13:28 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/30/2013
11:27:25 AM:

> If you were to back up that insistence with a description of the
orderings
> you are relying on, why other orderings are not important, and how the
> important orderings are enforced, I might be tempted to pay attention
> to your opinion.
>
>                      Thanx, Paul

NP, though, I feel too embarrassed to explain things about memory barriers
when
one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)

Disclaimer: it is anyway impossible to prove lack of *any* problem.

Having said that, lets look into an example in
Documentation/circular-buffers.txt:

> THE PRODUCER
> ------------
>
> The producer will look something like this:
>
>       spin_lock(&producer_lock);
>
>       unsigned long head = buffer->head;
>       unsigned long tail = ACCESS_ONCE(buffer->tail);
>
>       if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
>               /* insert one item into the buffer */
>               struct item *item = buffer[head];
>
>               produce_item(item);
>
>               smp_wmb(); /* commit the item before incrementing the head
*/
>
>               buffer->head = (head + 1) & (buffer->size - 1);
>
>               /* wake_up() will make sure that the head is committed
before
>                * waking anyone up */
>               wake_up(consumer);
>       }
>
>       spin_unlock(&producer_lock);

We can see that authors of the document didn't put any memory barrier
after "buffer->tail" read and before "produce_item(item)" and I think they
have
a good reason.

Lets consider an imaginary smp_mb() right before "produce_item(item);".
Such a barrier will ensure that -

    - the memory read on "buffer->tail" is completed
	before store to memory pointed by "item" is committed.

However, the store to "buffer->tail" anyway cannot be completed before
conditional
branch implied by "if ()" is proven to execute body statement of the if().
And the
latter cannot be proven before read of "buffer->tail" is completed.

Lets see this other way. Lets imagine that somehow a store to the data
pointed by "item"
is completed before we read "buffer->tail". That would mean, that the store
was completed
speculatively. But speculative execution of conditional stores is
prohibited by C/C++ standard,
otherwise any conditional store at any random place of code could pollute
shared memory.

On the other hand, if compiler or processor can prove that condition in
above if() is going
to be true (or if speculative store writes the same value as it was before
write), the
speculative store *is* allowed. In this case we should not be bothered by
the fact that
memory pointed by "item" is written before a read from "buffer->tail" is
completed.

Regards,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 11:25                   ` Peter Zijlstra
@ 2013-10-30 14:52                     ` Victor Kaplansky
  2013-10-30 15:39                       ` Peter Zijlstra
  2013-10-31  6:16                       ` Paul E. McKenney
  2013-10-31  6:40                     ` Paul E. McKenney
  1 sibling, 2 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-30 14:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Paul E. McKenney

Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM:

> Also, I'm not entirely sure on C, that too seems like a dependency, we
> simply cannot read the buffer @tail before we've read the tail itself,
> now can we? Similarly we cannot compare tail to head without having the
> head read completed.

No, this one we cannot omit, because our problem on consumer side is not
with @tail, which is written exclusively by consumer, but with @head.

BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
around
@head read.

-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 14:52                     ` Victor Kaplansky
@ 2013-10-30 15:39                       ` Peter Zijlstra
  2013-10-30 17:14                         ` Victor Kaplansky
  2013-10-31  6:16                       ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 15:39 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Paul E. McKenney

On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote:
> Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM:
> 
> > Also, I'm not entirely sure on C, that too seems like a dependency, we
> > simply cannot read the buffer @tail before we've read the tail itself,
> > now can we? Similarly we cannot compare tail to head without having the
> > head read completed.
> 
> No, this one we cannot omit, because our problem on consumer side is not
> with @tail, which is written exclusively by consumer, but with @head.

Ah indeed, my argument was flawed in that @head is the important part.
But we still do a comparison of @tail against @head before we do further
reads.

Although I suppose speculative reads are allowed -- they don't have the
destructive behaviour speculative writes have -- and thus we could in
fact get reorder issues.

But since it is still a dependent load in that we do that @tail vs @head
comparison before doing other loads, wouldn't a read_barrier_depends()
be sufficient? Or do we still need a complete rmb?

> BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> around
> @head read.

Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
updating tail, so there's no problem with the value changing
unexpectedly.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 13:28                   ` Victor Kaplansky
@ 2013-10-30 15:51                     ` Peter Zijlstra
  2013-10-30 18:29                       ` Peter Zijlstra
  2013-10-31  4:33                       ` Paul E. McKenney
  2013-10-31  4:32                     ` Paul E. McKenney
  1 sibling, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 15:51 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> 
> Disclaimer: it is anyway impossible to prove lack of *any* problem.
> 
> Having said that, lets look into an example in
> Documentation/circular-buffers.txt:

> 
> We can see that authors of the document didn't put any memory barrier

Note that both documents have the same author list ;-)

Anyway, I didn't know about the circular thing, I suppose I should use
CIRC_SPACE() thing :-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 15:39                       ` Peter Zijlstra
@ 2013-10-30 17:14                         ` Victor Kaplansky
  2013-10-30 17:44                           ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-30 17:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Paul E. McKenney

Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 05:39:31 PM:

> Although I suppose speculative reads are allowed -- they don't have the
> destructive behaviour speculative writes have -- and thus we could in
> fact get reorder issues.

I agree.

>
> But since it is still a dependent load in that we do that @tail vs @head
> comparison before doing other loads, wouldn't a read_barrier_depends()
> be sufficient? Or do we still need a complete rmb?

We need a complete rmb() here IMO. I think there is a fundamental
difference
between load and stores in this aspect. Load are allowed to be hoisted by
compiler or executed speculatively by HW. To prevent load "*(ubuf->data +
tail)"
to be hoisted beyond "ubuf->head" load you would need something like this:

void
ubuf_read(void)
{
        u64 head, tail;

        tail = ubuf->tail;
        head = ACCESS_ONCE(ubuf->head);

        /*
         * Ensure we read the buffer boundaries before the actual buffer
         * data...
         */

        while (tail != head) {
		    smp_read_barrier_depends();         /* for Alpha */
                obj = *(ubuf->data + head - 128);
                /* process obj */
                tail += obj->size;
                tail %= ubuf->size;
        }

        /*
         * Ensure all data reads are complete before we issue the
         * ubuf->tail update; once that update hits, kbuf_write() can
         * observe and overwrite data.
         */
        smp_mb();               /* D, matches with A */

        ubuf->tail = tail;
}

(note that "head" is part of address calculation of obj load now).

But, even in this demo example some "smp_read_barrier_depends()" before
"obj = *(ubuf->data + head - 100);" is required for architectures
like Alpha. Though, on more sane architectures "smp_read_barrier_depends()"
will be translated to nothing.


Regards,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 17:14                         ` Victor Kaplansky
@ 2013-10-30 17:44                           ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 17:44 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Paul E. McKenney

On Wed, Oct 30, 2013 at 07:14:29PM +0200, Victor Kaplansky wrote:
> We need a complete rmb() here IMO. I think there is a fundamental
> difference between load and stores in this aspect. Load are allowed to
> be hoisted by compiler or executed speculatively by HW. To prevent
> load "*(ubuf->data + tail)" to be hoisted beyond "ubuf->head" load you
> would need something like this:

Indeed, we could compute and load ->data + tail the moment we've
completed the tail load but before we've completed the head load and
done the comparison.

So yes, full rmb() it is.

> void
> ubuf_read(void)
> {
>         u64 head, tail;
> 
>         tail = ubuf->tail;
>         head = ACCESS_ONCE(ubuf->head);
> 
>         /*
>          * Ensure we read the buffer boundaries before the actual buffer
>          * data...
>          */
> 
>         while (tail != head) {
> 		    smp_read_barrier_depends();         /* for Alpha */
>                 obj = *(ubuf->data + head - 128);
>                 /* process obj */
>                 tail += obj->size;
>                 tail %= ubuf->size;
>         }
> 
>         /*
>          * Ensure all data reads are complete before we issue the
>          * ubuf->tail update; once that update hits, kbuf_write() can
>          * observe and overwrite data.
>          */
>         smp_mb();               /* D, matches with A */
> 
>         ubuf->tail = tail;
> }
> 
> (note that "head" is part of address calculation of obj load now).

Right, explicit dependent loads; I was hoping the conditional in between
might be enough, but as argued above it is not. The above cannot work in
our case though, we must use tail to find the obj since we have variable
size objects.

> But, even in this demo example some "smp_read_barrier_depends()" before
> "obj = *(ubuf->data + head - 100);" is required for architectures
> like Alpha. Though, on more sane architectures "smp_read_barrier_depends()"
> will be translated to nothing.

Sure.. I know all about that.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 15:51                     ` Peter Zijlstra
@ 2013-10-30 18:29                       ` Peter Zijlstra
  2013-10-30 19:11                         ` Peter Zijlstra
  2013-10-31  4:33                       ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 18:29 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> > 
> > Disclaimer: it is anyway impossible to prove lack of *any* problem.
> > 
> > Having said that, lets look into an example in
> > Documentation/circular-buffers.txt:
> 
> > 
> > We can see that authors of the document didn't put any memory barrier
> 
> Note that both documents have the same author list ;-)
> 
> Anyway, I didn't know about the circular thing, I suppose I should use
> CIRC_SPACE() thing :-)

The below removes 80 bytes from ring_buffer.o of which 50 bytes are from
perf_output_begin(), it also removes 30 lines of code, so yay!

(x86_64 build)

And it appears to still work.. although I've not stressed the no-space
bits.

---
 kernel/events/ring_buffer.c | 74 ++++++++++++++-------------------------------
 1 file changed, 22 insertions(+), 52 deletions(-)

diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 9c2ddfbf4525..e4a51fa10595 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -12,40 +12,10 @@
 #include <linux/perf_event.h>
 #include <linux/vmalloc.h>
 #include <linux/slab.h>
+#include <linux/circ_buf.h>
 
 #include "internal.h"
 
-static bool perf_output_space(struct ring_buffer *rb, unsigned long tail,
-			      unsigned long offset, unsigned long head)
-{
-	unsigned long sz = perf_data_size(rb);
-	unsigned long mask = sz - 1;
-
-	/*
-	 * check if user-writable
-	 * overwrite : over-write its own tail
-	 * !overwrite: buffer possibly drops events.
-	 */
-	if (rb->overwrite)
-		return true;
-
-	/*
-	 * verify that payload is not bigger than buffer
-	 * otherwise masking logic may fail to detect
-	 * the "not enough space" condition
-	 */
-	if ((head - offset) > sz)
-		return false;
-
-	offset = (offset - tail) & mask;
-	head   = (head   - tail) & mask;
-
-	if ((int)(head - offset) < 0)
-		return false;
-
-	return true;
-}
-
 static void perf_output_wakeup(struct perf_output_handle *handle)
 {
 	atomic_set(&handle->rb->poll, POLL_IN);
@@ -115,8 +85,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle)
 	rb->user_page->data_head = head;
 
 	/*
-	 * Now check if we missed an update, rely on the (compiler)
-	 * barrier in atomic_dec_and_test() to re-read rb->head.
+	 * Now check if we missed an update.
 	 */
 	if (unlikely(head != local_read(&rb->head))) {
 		local_inc(&rb->nest);
@@ -135,7 +104,7 @@ int perf_output_begin(struct perf_output_handle *handle,
 {
 	struct ring_buffer *rb;
 	unsigned long tail, offset, head;
-	int have_lost;
+	int have_lost, page_shift;
 	struct perf_sample_data sample_data;
 	struct {
 		struct perf_event_header header;
@@ -161,7 +130,7 @@ int perf_output_begin(struct perf_output_handle *handle,
 		goto out;
 
 	have_lost = local_read(&rb->lost);
-	if (have_lost) {
+	if (unlikely(have_lost)) {
 		lost_event.header.size = sizeof(lost_event);
 		perf_event_header__init_id(&lost_event.header, &sample_data,
 					   event);
@@ -171,32 +140,33 @@ int perf_output_begin(struct perf_output_handle *handle,
 	perf_output_get_handle(handle);
 
 	do {
-		/*
-		 * Userspace could choose to issue a mb() before updating the
-		 * tail pointer. So that all reads will be completed before the
-		 * write is issued.
-		 *
-		 * See perf_output_put_handle().
-		 */
 		tail = ACCESS_ONCE(rb->user_page->data_tail);
-		smp_mb();
 		offset = head = local_read(&rb->head);
-		head += size;
-		if (unlikely(!perf_output_space(rb, tail, offset, head)))
+		if (!rb->overwrite &&
+		    unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
 			goto fail;
+		head += size;
 	} while (local_cmpxchg(&rb->head, offset, head) != offset);
 
+	/*
+	 * Userspace SHOULD issue an MB before writing the tail; see
+	 * perf_output_put_handle().
+	 */
+	smp_mb();
+
 	if (head - local_read(&rb->wakeup) > rb->watermark)
 		local_add(rb->watermark, &rb->wakeup);
 
-	handle->page = offset >> (PAGE_SHIFT + page_order(rb));
-	handle->page &= rb->nr_pages - 1;
-	handle->size = offset & ((PAGE_SIZE << page_order(rb)) - 1);
-	handle->addr = rb->data_pages[handle->page];
-	handle->addr += handle->size;
-	handle->size = (PAGE_SIZE << page_order(rb)) - handle->size;
+	page_shift = PAGE_SHIFT + page_order(rb);
+
+	handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
+
+	offset &= page_shift - 1;
+
+	handle->addr = rb->data_pages[handle->page] + offset;
+	handle->size = (1 << page_shift) - offset;
 
-	if (have_lost) {
+	if (unlikely(have_lost)) {
 		lost_event.header.type = PERF_RECORD_LOST;
 		lost_event.header.misc = 0;
 		lost_event.id          = event->id;

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 18:29                       ` Peter Zijlstra
@ 2013-10-30 19:11                         ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-30 19:11 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 07:29:30PM +0100, Peter Zijlstra wrote:
> +	page_shift = PAGE_SHIFT + page_order(rb);
> +
> +	handle->page = (offset >> page_shift) & (rb->nr_pages - 1);
> +
> +	offset &= page_shift - 1;

offset &= (1UL << page_shift) - 1;

Weird that it even appeared to work.. /me wonders if he even booted the
right kernel.

> +
> +	handle->addr = rb->data_pages[handle->page] + offset;
> +	handle->size = (1 << page_shift) - offset;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 13:28                   ` Victor Kaplansky
  2013-10-30 15:51                     ` Peter Zijlstra
@ 2013-10-31  4:32                     ` Paul E. McKenney
  2013-10-31  9:04                       ` Peter Zijlstra
  2013-10-31  9:59                       ` Victor Kaplansky
  1 sibling, 2 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31  4:32 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/30/2013
> 11:27:25 AM:
> 
> > If you were to back up that insistence with a description of the
> orderings
> > you are relying on, why other orderings are not important, and how the
> > important orderings are enforced, I might be tempted to pay attention
> > to your opinion.
> >
> >                      Thanx, Paul
> 
> NP, though, I feel too embarrassed to explain things about memory barriers
> when
> one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> 
> Disclaimer: it is anyway impossible to prove lack of *any* problem.

If you want to play the "omit memory barriers" game, then proving a
negative is in fact the task before you.

> Having said that, lets look into an example in
> Documentation/circular-buffers.txt:

And the correctness of this code has been called into question.  :-(
An embarrassingly long time ago -- I need to get this either proven
or fixed.

> > THE PRODUCER
> > ------------
> >
> > The producer will look something like this:
> >
> >       spin_lock(&producer_lock);
> >
> >       unsigned long head = buffer->head;
> >       unsigned long tail = ACCESS_ONCE(buffer->tail);
> >
> >       if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
> >               /* insert one item into the buffer */
> >               struct item *item = buffer[head];
> >
> >               produce_item(item);
> >
> >               smp_wmb(); /* commit the item before incrementing the head
> */
> >
> >               buffer->head = (head + 1) & (buffer->size - 1);
> >
> >               /* wake_up() will make sure that the head is committed
> before
> >                * waking anyone up */
> >               wake_up(consumer);
> >       }
> >
> >       spin_unlock(&producer_lock);
> 
> We can see that authors of the document didn't put any memory barrier
> after "buffer->tail" read and before "produce_item(item)" and I think they
> have
> a good reason.
> 
> Lets consider an imaginary smp_mb() right before "produce_item(item);".
> Such a barrier will ensure that -
> 
>     - the memory read on "buffer->tail" is completed
> 	before store to memory pointed by "item" is committed.
> 
> However, the store to "buffer->tail" anyway cannot be completed before
> conditional
> branch implied by "if ()" is proven to execute body statement of the if().
> And the
> latter cannot be proven before read of "buffer->tail" is completed.
> 
> Lets see this other way. Lets imagine that somehow a store to the data
> pointed by "item"
> is completed before we read "buffer->tail". That would mean, that the store
> was completed
> speculatively. But speculative execution of conditional stores is
> prohibited by C/C++ standard,
> otherwise any conditional store at any random place of code could pollute
> shared memory.

Before C/C++11, the closest thing to such a prohibition is use of
volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
use atomics to get anything resembing this prohibition.

If you just use normal variables, the compiler is within its rights
to transform something like the following:

	if (a)
		b = 1;
	else
		b = 42;

Into:

	b = 42;
	if (a)
		b = 1;

Many other similar transformations are permitted.  Some are used to all
vector instructions to be used -- the compiler can do a write with an
overly wide vector instruction, then clean up the clobbered variables
later, if it wishes.  Again, if the variables are not marked volatile,
or, in C/C++11, atomic.

> On the other hand, if compiler or processor can prove that condition in
> above if() is going
> to be true (or if speculative store writes the same value as it was before
> write), the
> speculative store *is* allowed. In this case we should not be bothered by
> the fact that
> memory pointed by "item" is written before a read from "buffer->tail" is
> completed.

The compilers don't always know as much as they might about the underlying
hardware's memory model.  Of course, if this code is architecture specific,
it can avoid DEC Alpha's fun and games, which could also violate your
assumptions in the above paragraph:

	http://www.openvms.compaq.com/wizard/wiz_2637.html

Anyway, proving or fixing the code in Documentation/circular-buffers.txt
has been on my list for too long, so I will take a closer look at it.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 15:51                     ` Peter Zijlstra
  2013-10-30 18:29                       ` Peter Zijlstra
@ 2013-10-31  4:33                       ` Paul E. McKenney
  1 sibling, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31  4:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote:
> > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-)
> > 
> > Disclaimer: it is anyway impossible to prove lack of *any* problem.
> > 
> > Having said that, lets look into an example in
> > Documentation/circular-buffers.txt:
> 
> > 
> > We can see that authors of the document didn't put any memory barrier
> 
> Note that both documents have the same author list ;-)
> 
> Anyway, I didn't know about the circular thing, I suppose I should use
> CIRC_SPACE() thing :-)

Interesting that we didn't seem to supply a definition...  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 14:52                     ` Victor Kaplansky
  2013-10-30 15:39                       ` Peter Zijlstra
@ 2013-10-31  6:16                       ` Paul E. McKenney
  2013-11-01 13:12                         ` Victor Kaplansky
  1 sibling, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31  6:16 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Peter Zijlstra, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote:
> Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM:
> 
> > Also, I'm not entirely sure on C, that too seems like a dependency, we
> > simply cannot read the buffer @tail before we've read the tail itself,
> > now can we? Similarly we cannot compare tail to head without having the
> > head read completed.
> 
> No, this one we cannot omit, because our problem on consumer side is not
> with @tail, which is written exclusively by consumer, but with @head.
> 
> BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> around
> @head read.

If you omit the ACCESS_ONCE() calls around @tail, the compiler is within
its rights to combine adjacent operations and also to invent loads and
stores, for example, in cases of register pressure.  It is also within
its rights to do piece-at-a-time loads and stores, which might sound
unlikely, but which can actually has happened when the compiler figures
out exactly what is to be stored at compile time, especially on hardware
that only allows small immediate values.

So the ACCESS_ONCE() calls are not optional, the current contents of
Documentation/circular-buffers.txt notwithstanding.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-30 11:25                   ` Peter Zijlstra
  2013-10-30 14:52                     ` Victor Kaplansky
@ 2013-10-31  6:40                     ` Paul E. McKenney
  2013-11-01 14:25                       ` Victor Kaplansky
                                         ` (3 more replies)
  1 sibling, 4 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31  6:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Wed, Oct 30, 2013 at 12:25:26PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 02:27:25AM -0700, Paul E. McKenney wrote:
> > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote:
> > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM:
> > > 
> > > >       mb();   // XXXXXXXX: do we really need it? I think yes.
> > > 
> > > Oh, it is hard to argue with feelings. Also, it is easy to be on
> > > conservative side and put the barrier here just in case.
> > > But I still insist that the barrier is redundant in your example.
> > 
> > If you were to back up that insistence with a description of the orderings
> > you are relying on, why other orderings are not important, and how the
> > important orderings are enforced, I might be tempted to pay attention
> > to your opinion.
> 
> OK, so let me try.. a slightly less convoluted version of the code in
> kernel/events/ring_buffer.c coupled with a userspace consumer would look
> something like the below.
> 
> One important detail is that the kbuf part and the kbuf_writer() are
> strictly per cpu and we can thus rely on implicit ordering for those.
> 
> Only the userspace consumer can possibly run on another cpu, and thus we
> need to ensure data consistency for those. 
> 
> struct buffer {
> 	u64 size;
> 	u64 tail;
> 	u64 head;
> 	void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * Determine there's space in the buffer to store data at @offset to
>  * @head without overwriting data at @tail.
>  */
> bool space(u64 tail, u64 offset, u64 head)
> {
> 	offset = (offset - tail) % kbuf->size;
> 	head   = (head   - tail) % kbuf->size;
> 
> 	return (s64)(head - offset) >= 0;
> }
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
> 	u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */
> 	u64 offset = kbuf->head; /* we already know where we last wrote */
> 	u64 head = offset + sz;
> 
> 	if (!space(tail, offset, head)) {
> 		/* discard @buf */
> 		return;
> 	}
> 
> 	/*
> 	 * Ensure that if we see the userspace tail (ubuf->tail) such
> 	 * that there is space to write @buf without overwriting data
> 	 * userspace hasn't seen yet, we won't in fact store data before
> 	 * that read completes.
> 	 */
> 
> 	smp_mb(); /* A, matches with D */
> 
> 	write(kbuf->data + offset, buf, sz);
> 	kbuf->head = head % kbuf->size;
> 
> 	/*
> 	 * Ensure that we write all the @buf data before we update the
> 	 * userspace visible ubuf->head pointer.
> 	 */
> 	smp_wmb(); /* B, matches with C */
> 
> 	ubuf->head = kbuf->head;
> }
> 
> /*
>  * Consume the buffer data and update the tail pointer to indicate to
>  * kernel space there's 'free' space.
>  */
> void ubuf_read(void)
> {
> 	u64 head, tail;
> 
> 	tail = ACCESS_ONCE(ubuf->tail);
> 	head = ACCESS_ONCE(ubuf->head);
> 
> 	/*
> 	 * Ensure we read the buffer boundaries before the actual buffer
> 	 * data...
> 	 */
> 	smp_rmb(); /* C, matches with B */
> 
> 	while (tail != head) {
> 		obj = ubuf->data + tail;
> 		/* process obj */
> 		tail += obj->size;
> 		tail %= ubuf->size;
> 	}
> 
> 	/*
> 	 * Ensure all data reads are complete before we issue the
> 	 * ubuf->tail update; once that update hits, kbuf_write() can
> 	 * observe and overwrite data.
> 	 */
> 	smp_mb(); /* D, matches with A */
> 
> 	ubuf->tail = tail;
> }
> 
> 
> Now the whole crux of the question is if we need barrier A at all, since
> the STORES issued by the @buf writes are dependent on the ubuf->tail
> read.

The dependency you are talking about is via the "if" statement?
Even C/C++11 is not required to respect control dependencies.

This one is a bit annoying.  The x86 TSO means that you really only
need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
barrier, and so on -- but smp_mb() emits a full barrier.

Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
before reads, writes before writes, and reads before writes, but not
writes before reads?  Another approach would be to define a per-arch
barrier for this particular case.

> If the read shows no available space, we simply will not issue those
> writes -- therefore we could argue we can avoid the memory barrier.

Proving that means iterating through the permitted combinations of
compilers and architectures...  There is always hand-coded assembly
language, I suppose.

> However, that leaves D unpaired and me confused. We must have D because
> otherwise the CPU could reorder that write into the reads previous and
> the kernel could start overwriting data we're still reading.. which
> seems like a bad deal.

Yep.  If you were hand-coding only for x86 and s390, D would pair with
the required barrier() asm.

> Also, I'm not entirely sure on C, that too seems like a dependency, we
> simply cannot read the buffer @tail before we've read the tail itself,
> now can we? Similarly we cannot compare tail to head without having the
> head read completed.
> 
> Could we replace A and C with an smp_read_barrier_depends()?

C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail
and that the value fetch from ->tail feeds into the address used for
the "obj =" assignment.  A, not so much -- again, compilers are not
required to respect control dependencies.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  4:32                     ` Paul E. McKenney
@ 2013-10-31  9:04                       ` Peter Zijlstra
  2013-10-31 15:07                         ` Paul E. McKenney
  2013-10-31  9:59                       ` Victor Kaplansky
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-31  9:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> Before C/C++11, the closest thing to such a prohibition is use of
> volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> use atomics to get anything resembing this prohibition.
> 
> If you just use normal variables, the compiler is within its rights
> to transform something like the following:
> 
> 	if (a)
> 		b = 1;
> 	else
> 		b = 42;
> 
> Into:
> 
> 	b = 42;
> 	if (a)
> 		b = 1;
> 
> Many other similar transformations are permitted.  Some are used to all
> vector instructions to be used -- the compiler can do a write with an
> overly wide vector instruction, then clean up the clobbered variables
> later, if it wishes.  Again, if the variables are not marked volatile,
> or, in C/C++11, atomic.

While I've heard you tell this story before, my mind keeps boggling how
we've been able to use shared memory at all, all these years.

It seems to me stuff should have broken left, right and center if
compilers were really aggressive about this.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  4:32                     ` Paul E. McKenney
  2013-10-31  9:04                       ` Peter Zijlstra
@ 2013-10-31  9:59                       ` Victor Kaplansky
  2013-10-31 12:28                         ` David Laight
  2013-10-31 15:25                         ` Paul E. McKenney
  1 sibling, 2 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-31  9:59 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
06:32:58 AM:

> If you want to play the "omit memory barriers" game, then proving a
> negative is in fact the task before you.

Generally it is not fair. Otherwise, anyone could put an smp_mb() at a
random place, and expect others to "prove" that it is not needed.

It is not fair also because it should be virtually impossible to prove lack
of any problem. OTH, if a problem exists, it should be easy for proponents
of a memory barrier to build a test case or design a scenario demonstrating
the problem.

Actually, advocates of the memory barrier in our case do have an argument -
- the rule of thumb saying that barriers should be paired. I consider this
rule only as a general recommendation to look into potentially risky
places.
And indeed, in our case if the store to circular wasn't conditional, it
would require a memory barrier to prevent the store to be performed before
the read of @tail. But in our case the store is conditional, so no memory
barrier is required.

> And the correctness of this code has been called into question.  :-(
> An embarrassingly long time ago -- I need to get this either proven
> or fixed.

I agree.

> Before C/C++11, the closest thing to such a prohibition is use of
> volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> use atomics to get anything resembing this prohibition.
>
> If you just use normal variables, the compiler is within its rights
> to transform something like the following:
>
>    if (a)
>       b = 1;
>    else
>       b = 42;
>
> Into:
>
>    b = 42;
>    if (a)
>       b = 1;
>
> Many other similar transformations are permitted.  Some are used to all
> vector instructions to be used -- the compiler can do a write with an
> overly wide vector instruction, then clean up the clobbered variables
> later, if it wishes.  Again, if the variables are not marked volatile,
> or, in C/C++11, atomic.

All this can justify only compiler barrier() which is almost free from
performance point of view, since current gcc anyway doesn't perform store
hoisting optimization in our case.

(And I'm not getting into philosophical discussion whether kernel code
should consider future possible bugs/features in gcc or C/C++11
standard).


> The compilers don't always know as much as they might about the
underlying
> hardware's memory model.

That's correct in general. But can you point out a problem that really
exists?

> Of course, if this code is architecture specific,
> it can avoid DEC Alpha's fun and games, which could also violate your
> assumptions in the above paragraph:
>
>    http://www.openvms.compaq.com/wizard/wiz_2637.html

Are you talking about this paragraph from above link:

"For instance, your producer must issue a "memory barrier" instruction
  after writing the data to shared memory and before inserting it on
  the queue; likewise, your consumer must issue a memory barrier
  instruction after removing an item from the queue and before reading
  from its memory.  Otherwise, you risk seeing stale data, since, while
  the Alpha processor does provide coherent memory, it does not provide
  implicit ordering of reads and writes.  (That is, the write of the
  producer's data might reach memory after the write of the queue, such
  that the consumer might read the new item from the queue but get the
  previous values from the item's memory."

If yes, I don't think it explains the need of memory barrier on Alpha
in our case (we all agree about the need of smp_wmb() right before @head
update by producer). If not, could you please point to specific paragraph?

>
> Anyway, proving or fixing the code in Documentation/circular-buffers.txt
> has been on my list for too long, so I will take a closer look at it.

Thanks!

I'm concerned more about performance overhead imposed by the full memory
barrier in kfifo circular buffers. Even if it is needed on Alpha (I don't
understand why) we could try to solve this with some memory barrier which
is effective only on architectures which really need it.

Regards,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: perf events ring buffer memory barrier on powerpc
  2013-10-31  9:59                       ` Victor Kaplansky
@ 2013-10-31 12:28                         ` David Laight
  2013-10-31 12:55                           ` Victor Kaplansky
  2013-10-31 15:25                         ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: David Laight @ 2013-10-31 12:28 UTC (permalink / raw)
  To: Victor Kaplansky, paulmck
  Cc: Michael Neuling, Mathieu Desnoyers, Peter Zijlstra, LKML,
	Oleg Nesterov, Linux PPC dev, Anton Blanchard,
	Frederic Weisbecker

> "For instance, your producer must issue a "memory barrier" instruction
>   after writing the data to shared memory and before inserting it on
>   the queue; likewise, your consumer must issue a memory barrier
>   instruction after removing an item from the queue and before reading
>   from its memory.  Otherwise, you risk seeing stale data, since, while
>   the Alpha processor does provide coherent memory, it does not provide
>   implicit ordering of reads and writes.  (That is, the write of the
>   producer's data might reach memory after the write of the queue, such
>   that the consumer might read the new item from the queue but get the
>   previous values from the item's memory."
> 
> If yes, I don't think it explains the need of memory barrier on Alpha
> in our case (we all agree about the need of smp_wmb() right before @head
> update by producer). If not, could you please point to specific paragraph?

My understanding is that the extra read barrier the alpha needs isn't to
control the order the cpu performs the memory cycles in, but rather to
wait while the cache system performs all outstanding operations.
So even though the wmb() in the writer ensures the writes are correctly
ordered, the reader can read the old value from the second location from
its local cache.

	David




^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: perf events ring buffer memory barrier on powerpc
  2013-10-31 12:28                         ` David Laight
@ 2013-10-31 12:55                           ` Victor Kaplansky
  0 siblings, 0 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-10-31 12:55 UTC (permalink / raw)
  To: David Laight
  Cc: Anton Blanchard, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Neuling, Oleg Nesterov, paulmck,
	Peter Zijlstra

"David Laight" <David.Laight@aculab.com> wrote on 10/31/2013 02:28:56 PM:

> So even though the wmb() in the writer ensures the writes are correctly
> ordered, the reader can read the old value from the second location from
> its local cache.

In case of circular buffer, the only thing that producer reads is @tail,
and nothing wrong will happen if producer reads old value of @tail.
Moreover,
adherents of smp_mb() insert it *after* the read of @tail, so it cannot
prevent reading of old value anyway.
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  9:04                       ` Peter Zijlstra
@ 2013-10-31 15:07                         ` Paul E. McKenney
  2013-10-31 15:19                           ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31 15:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> > Before C/C++11, the closest thing to such a prohibition is use of
> > volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> > use atomics to get anything resembing this prohibition.
> > 
> > If you just use normal variables, the compiler is within its rights
> > to transform something like the following:
> > 
> > 	if (a)
> > 		b = 1;
> > 	else
> > 		b = 42;
> > 
> > Into:
> > 
> > 	b = 42;
> > 	if (a)
> > 		b = 1;
> > 
> > Many other similar transformations are permitted.  Some are used to all
> > vector instructions to be used -- the compiler can do a write with an
> > overly wide vector instruction, then clean up the clobbered variables
> > later, if it wishes.  Again, if the variables are not marked volatile,
> > or, in C/C++11, atomic.
> 
> While I've heard you tell this story before, my mind keeps boggling how
> we've been able to use shared memory at all, all these years.
> 
> It seems to me stuff should have broken left, right and center if
> compilers were really aggressive about this.

Sometimes having stupid compilers is a good thing.  But they really are
getting more aggressive.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31 15:07                         ` Paul E. McKenney
@ 2013-10-31 15:19                           ` Peter Zijlstra
  2013-11-01  9:28                             ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-10-31 15:19 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Thu, Oct 31, 2013 at 08:07:56AM -0700, Paul E. McKenney wrote:
> On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> > > Before C/C++11, the closest thing to such a prohibition is use of
> > > volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> > > use atomics to get anything resembing this prohibition.
> > > 
> > > If you just use normal variables, the compiler is within its rights
> > > to transform something like the following:
> > > 
> > > 	if (a)
> > > 		b = 1;
> > > 	else
> > > 		b = 42;
> > > 
> > > Into:
> > > 
> > > 	b = 42;
> > > 	if (a)
> > > 		b = 1;
> > > 
> > > Many other similar transformations are permitted.  Some are used to all
> > > vector instructions to be used -- the compiler can do a write with an
> > > overly wide vector instruction, then clean up the clobbered variables
> > > later, if it wishes.  Again, if the variables are not marked volatile,
> > > or, in C/C++11, atomic.
> > 
> > While I've heard you tell this story before, my mind keeps boggling how
> > we've been able to use shared memory at all, all these years.
> > 
> > It seems to me stuff should have broken left, right and center if
> > compilers were really aggressive about this.
> 
> Sometimes having stupid compilers is a good thing.  But they really are
> getting more aggressive.

But surely we cannot go mark all data structures lodged in shared memory
as volatile, that's insane.

I'm sure you're quite worried about this as well. Suppose we have:

struct foo {
	unsigned long value;
	void *ptr;
	unsigned long value1;
};

And our ptr member is RCU managed. Then while the assignment using:
rcu_assign_ptr() will use the volatile cast, what stops the compiler
from wrecking ptr while writing either of the value* members and
'fixing' her up after?

This is a completely untenable position.

How do the C/C++ people propose to deal with this?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  9:59                       ` Victor Kaplansky
  2013-10-31 12:28                         ` David Laight
@ 2013-10-31 15:25                         ` Paul E. McKenney
  2013-11-01 16:06                           ` Victor Kaplansky
  1 sibling, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-10-31 15:25 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

On Thu, Oct 31, 2013 at 11:59:21AM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
> 06:32:58 AM:
> 
> > If you want to play the "omit memory barriers" game, then proving a
> > negative is in fact the task before you.
> 
> Generally it is not fair. Otherwise, anyone could put an smp_mb() at a
> random place, and expect others to "prove" that it is not needed.
> 
> It is not fair also because it should be virtually impossible to prove lack
> of any problem. OTH, if a problem exists, it should be easy for proponents
> of a memory barrier to build a test case or design a scenario demonstrating
> the problem.

I really don't care about "fair" -- I care instead about the kernel
working reliably.

And it should also be easy for proponents of removing memory barriers to
clearly articulate what orderings their code does and does not need.

> Actually, advocates of the memory barrier in our case do have an argument -
> - the rule of thumb saying that barriers should be paired. I consider this
> rule only as a general recommendation to look into potentially risky
> places.
> And indeed, in our case if the store to circular wasn't conditional, it
> would require a memory barrier to prevent the store to be performed before
> the read of @tail. But in our case the store is conditional, so no memory
> barrier is required.

You are assuming control dependencies that the C language does not
provide.  Now, for all I know right now, there might well be some other
reason why a full barrier is not required, but the "if" statement cannot
be that reason.

Please review section 1.10 of the C++11 standard (or the corresponding
section of the C11 standard, if you prefer).  The point is that the
C/C++11 covers only data dependencies, not control dependencies.

> > And the correctness of this code has been called into question.  :-(
> > An embarrassingly long time ago -- I need to get this either proven
> > or fixed.
> 
> I agree.

Glad we agree on something!

> > Before C/C++11, the closest thing to such a prohibition is use of
> > volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> > use atomics to get anything resembing this prohibition.
> >
> > If you just use normal variables, the compiler is within its rights
> > to transform something like the following:
> >
> >    if (a)
> >       b = 1;
> >    else
> >       b = 42;
> >
> > Into:
> >
> >    b = 42;
> >    if (a)
> >       b = 1;
> >
> > Many other similar transformations are permitted.  Some are used to all
> > vector instructions to be used -- the compiler can do a write with an
> > overly wide vector instruction, then clean up the clobbered variables
> > later, if it wishes.  Again, if the variables are not marked volatile,
> > or, in C/C++11, atomic.
> 
> All this can justify only compiler barrier() which is almost free from
> performance point of view, since current gcc anyway doesn't perform store
> hoisting optimization in our case.

If the above example doesn't get you to give up your incorrect assumption
about "if" statements having much effect on ordering, you need more help
than I can give you just now.

> (And I'm not getting into philosophical discussion whether kernel code
> should consider future possible bugs/features in gcc or C/C++11
> standard).

Should you wish to get into that discussion in the near future, you
will need to find someone else to discuss it with.

> > The compilers don't always know as much as they might about the
> underlying
> > hardware's memory model.
> 
> That's correct in general. But can you point out a problem that really
> exists?

We will see.

In the meantime, can you summarize the ordering requirements of your
code?

> > Of course, if this code is architecture specific,
> > it can avoid DEC Alpha's fun and games, which could also violate your
> > assumptions in the above paragraph:
> >
> >    http://www.openvms.compaq.com/wizard/wiz_2637.html
> 
> Are you talking about this paragraph from above link:
> 
> "For instance, your producer must issue a "memory barrier" instruction
>   after writing the data to shared memory and before inserting it on
>   the queue; likewise, your consumer must issue a memory barrier
>   instruction after removing an item from the queue and before reading
>   from its memory.  Otherwise, you risk seeing stale data, since, while
>   the Alpha processor does provide coherent memory, it does not provide
>   implicit ordering of reads and writes.  (That is, the write of the
>   producer's data might reach memory after the write of the queue, such
>   that the consumer might read the new item from the queue but get the
>   previous values from the item's memory."
> 
> If yes, I don't think it explains the need of memory barrier on Alpha
> in our case (we all agree about the need of smp_wmb() right before @head
> update by producer). If not, could you please point to specific paragraph?

Did you miss the following passage in the paragraph you quoted?

	"... likewise, your consumer must issue a memory barrier
	instruction after removing an item from the queue and before
	reading from its memory."

That is why DEC Alpha readers need a read-side memory barrier -- it says
so right there.  And as either you or Peter noted earlier in this thread,
this barrier can be supplied by smp_read_barrier_depends().

I can sympathize if you are having trouble believing this.  After all,
it took the DEC Alpha architects a full hour to convince me, and that was
in a face-to-face meeting instead of over email.  (Just for the record,
it took me even longer to convince them that their earlier documentation
did not clearly indicate the need for these read-side barriers.)  But
regardless of whether or not I sympathize, DEC Alpha is what it is.

> > Anyway, proving or fixing the code in Documentation/circular-buffers.txt
> > has been on my list for too long, so I will take a closer look at it.
> 
> Thanks!
> 
> I'm concerned more about performance overhead imposed by the full memory
> barrier in kfifo circular buffers. Even if it is needed on Alpha (I don't
> understand why) we could try to solve this with some memory barrier which
> is effective only on architectures which really need it.

By exactly how much does the memory barrier slow your code down on some
example system?  (Yes, I can believe that it is a problem, but is it
really a problem in your exact situation?)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31 15:19                           ` Peter Zijlstra
@ 2013-11-01  9:28                             ` Paul E. McKenney
  2013-11-01 10:30                               ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-01  9:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Thu, Oct 31, 2013 at 04:19:55PM +0100, Peter Zijlstra wrote:
> On Thu, Oct 31, 2013 at 08:07:56AM -0700, Paul E. McKenney wrote:
> > On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote:
> > > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote:
> > > > Before C/C++11, the closest thing to such a prohibition is use of
> > > > volatile, for example, ACCESS_ONCE().  Even in C/C++11, you have to
> > > > use atomics to get anything resembing this prohibition.
> > > > 
> > > > If you just use normal variables, the compiler is within its rights
> > > > to transform something like the following:
> > > > 
> > > > 	if (a)
> > > > 		b = 1;
> > > > 	else
> > > > 		b = 42;
> > > > 
> > > > Into:
> > > > 
> > > > 	b = 42;
> > > > 	if (a)
> > > > 		b = 1;
> > > > 
> > > > Many other similar transformations are permitted.  Some are used to all
> > > > vector instructions to be used -- the compiler can do a write with an
> > > > overly wide vector instruction, then clean up the clobbered variables
> > > > later, if it wishes.  Again, if the variables are not marked volatile,
> > > > or, in C/C++11, atomic.
> > > 
> > > While I've heard you tell this story before, my mind keeps boggling how
> > > we've been able to use shared memory at all, all these years.
> > > 
> > > It seems to me stuff should have broken left, right and center if
> > > compilers were really aggressive about this.
> > 
> > Sometimes having stupid compilers is a good thing.  But they really are
> > getting more aggressive.
> 
> But surely we cannot go mark all data structures lodged in shared memory
> as volatile, that's insane.
> 
> I'm sure you're quite worried about this as well. Suppose we have:
> 
> struct foo {
> 	unsigned long value;
> 	void *ptr;
> 	unsigned long value1;
> };
> 
> And our ptr member is RCU managed. Then while the assignment using:
> rcu_assign_ptr() will use the volatile cast, what stops the compiler
> from wrecking ptr while writing either of the value* members and
> 'fixing' her up after?

Nothing at all!

We can reduce the probability by putting the pointer at one end or the
other, so that if the compiler uses (say) vector instructions to aggregate
individual assignments to the other fields, it will be less likely to hit
"ptr".  But yes, this is ugly and it would be really hard to get all
this right, and would often conflict with cache-locality needs.

> This is a completely untenable position.

Indeed it is!

C/C++ never was intended to be used for parallel programming, and this is
but one of the problems that can arise when we nevertheless use it for
parallel programming.  As compilers get smarter (for some definition of
"smarter") and as more systems have special-purpose hardware (such as
vector units) that are visible to the compiler, we can expect more of
this kind of trouble.

This was one of many reasons that I decided to help with the C/C++11
effort, whatever anyone might think about the results.

> How do the C/C++ people propose to deal with this?

By marking "ptr" as atomic, thus telling the compiler not to mess with it.
And thus requiring that all accesses to it be decorated, which in the
case of RCU could be buried in the RCU accessors.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01  9:28                             ` Paul E. McKenney
@ 2013-11-01 10:30                               ` Peter Zijlstra
  2013-11-02 15:20                                 ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-01 10:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Fri, Nov 01, 2013 at 02:28:14AM -0700, Paul E. McKenney wrote:
> > This is a completely untenable position.
> 
> Indeed it is!
> 
> C/C++ never was intended to be used for parallel programming, 

And yet pretty much all kernels ever written for SMP systems are written
in it; what drugs are those people smoking?

Furthermore there's a gazillion parallel userspace programs.

> and this is
> but one of the problems that can arise when we nevertheless use it for
> parallel programming.  As compilers get smarter (for some definition of
> "smarter") and as more systems have special-purpose hardware (such as
> vector units) that are visible to the compiler, we can expect more of
> this kind of trouble.
> 
> This was one of many reasons that I decided to help with the C/C++11
> effort, whatever anyone might think about the results.

Well, I applaud your efforts, but given the results I think the C/C++
people are all completely insane.

> > How do the C/C++ people propose to deal with this?
> 
> By marking "ptr" as atomic, thus telling the compiler not to mess with it.
> And thus requiring that all accesses to it be decorated, which in the
> case of RCU could be buried in the RCU accessors.

This seems contradictory; marking it atomic would look like:

struct foo {
	unsigned long value;
	__atomic void *ptr;
	unsigned long value1;
};

Clearly we cannot hide this definition in accessors, because then
accesses to value* won't see the annotation.

That said; mandating we mark all 'shared' data with __atomic is
completely untenable and is not backwards compatible.

To be safe we must assume all data shared unless indicated otherwise.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  6:16                       ` Paul E. McKenney
@ 2013-11-01 13:12                         ` Victor Kaplansky
  2013-11-02 16:36                           ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-11-01 13:12 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
08:16:02 AM:

> > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > around
> > @head read.

Just to be sure, that we are talking about the same code - I was
considering
ACCESS_ONCE() around @tail in point AAA in the following example from
Documentation/circular-buffers.txt for CONSUMER:

        unsigned long head = ACCESS_ONCE(buffer->head);
        unsigned long tail = buffer->tail;      /* AAA */

        if (CIRC_CNT(head, tail, buffer->size) >= 1) {
                /* read index before reading contents at that index */
                smp_read_barrier_depends();

                /* extract one item from the buffer */
                struct item *item = buffer[tail];

                consume_item(item);

                smp_mb(); /* finish reading descriptor before incrementing
tail */

                buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */
        }

>
> If you omit the ACCESS_ONCE() calls around @tail, the compiler is within
> its rights to combine adjacent operations and also to invent loads and
> stores, for example, in cases of register pressure.

Right. And I was completely aware about these possible transformations when
said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or
even
completely dismissed reads of @tail in consumer code, are not a problem at
all,
since @tail is written exclusively by CONSUMER side.


> It is also within
> its rights to do piece-at-a-time loads and stores, which might sound
> unlikely, but which can actually has happened when the compiler figures
> out exactly what is to be stored at compile time, especially on hardware
> that only allows small immediate values.

As for writes to @tail, the ACCESS_ONCE around @tail at point AAA,
doesn't prevent in any way an imaginary super-optimizing compiler
from moving around the store to @tail (which appears in the code at point
BBB).

It is why ACCESS_ONCE at point AAA is completely redundant.

-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  6:40                     ` Paul E. McKenney
@ 2013-11-01 14:25                       ` Victor Kaplansky
  2013-11-02 17:28                         ` Paul E. McKenney
  2013-11-01 14:56                       ` Peter Zijlstra
                                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-11-01 14:25 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
08:40:15 AM:

> > void ubuf_read(void)
> > {
> >    u64 head, tail;
> >
> >    tail = ACCESS_ONCE(ubuf->tail);
> >    head = ACCESS_ONCE(ubuf->head);
> >
> >    /*
> >     * Ensure we read the buffer boundaries before the actual buffer
> >     * data...
> >     */
> >    smp_rmb(); /* C, matches with B */
> >
> >    while (tail != head) {
> >       obj = ubuf->data + tail;
> >       /* process obj */
> >       tail += obj->size;
> >       tail %= ubuf->size;
> >    }
> >
> >    /*
> >     * Ensure all data reads are complete before we issue the
> >     * ubuf->tail update; once that update hits, kbuf_write() can
> >     * observe and overwrite data.
> >     */
> >    smp_mb(); /* D, matches with A */
> >
> >    ubuf->tail = tail;
> > }

> > Could we replace A and C with an smp_read_barrier_depends()?
>
> C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail
> and that the value fetch from ->tail feeds into the address used for
> the "obj =" assignment.

No! You must to have a full smp_rmb() at C. The race on the reader side
is not between fetch of @tail and read from address pointed by @tail.
The real race here is between a fetch of @head and read of obj from
memory pointed by @tail.

Regards,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  6:40                     ` Paul E. McKenney
  2013-11-01 14:25                       ` Victor Kaplansky
@ 2013-11-01 14:56                       ` Peter Zijlstra
  2013-11-02 17:32                         ` Paul E. McKenney
  2013-11-01 16:11                       ` Peter Zijlstra
  2013-11-01 16:18                       ` Peter Zijlstra
  3 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-01 14:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > Now the whole crux of the question is if we need barrier A at all, since
> > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > read.
> 
> The dependency you are talking about is via the "if" statement?
> Even C/C++11 is not required to respect control dependencies.
> 
> This one is a bit annoying.  The x86 TSO means that you really only
> need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> barrier, and so on -- but smp_mb() emits a full barrier.
> 
> Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> before reads, writes before writes, and reads before writes, but not
> writes before reads?  Another approach would be to define a per-arch
> barrier for this particular case.

I suppose we can only introduce new barrier primitives if there's more
than 1 use-case.

> > If the read shows no available space, we simply will not issue those
> > writes -- therefore we could argue we can avoid the memory barrier.
> 
> Proving that means iterating through the permitted combinations of
> compilers and architectures...  There is always hand-coded assembly
> language, I suppose.

I'm starting to think that while the C/C++ language spec says they can
wreck the world by doing these silly optimization, real world users will
push back for breaking their existing code.

I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they
break the kernel by doing crazy shit like that.

Given its near impossible to write a correct program in C/C++ and
tagging the entire kernel with __atomic is equally not going to happen,
I think we must find a practical solution.

Either that, or we really need to consider forking the language and
compiler :-(

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31 15:25                         ` Paul E. McKenney
@ 2013-11-01 16:06                           ` Victor Kaplansky
  2013-11-01 16:25                             ` David Laight
  2013-11-02 15:46                             ` Paul E. McKenney
  0 siblings, 2 replies; 120+ messages in thread
From: Victor Kaplansky @ 2013-11-01 16:06 UTC (permalink / raw)
  To: paulmck
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
05:25:43 PM:

> I really don't care about "fair" -- I care instead about the kernel
> working reliably.

Though I don't see how putting a memory barrier without deep understanding
why it is needed helps kernel reliability, I do agree that reliability
is more important than performance.

> And it should also be easy for proponents of removing memory barriers to
> clearly articulate what orderings their code does and does not need.

I intentionally took a simplified example of circle buffer from
Documentation/circular-buffers.txt. I think both sides agree about
memory ordering requirements in the example. At least I didn't see anyone
argued about them.

> You are assuming control dependencies that the C language does not
> provide.  Now, for all I know right now, there might well be some other
> reason why a full barrier is not required, but the "if" statement cannot
> be that reason.
>
> Please review section 1.10 of the C++11 standard (or the corresponding
> section of the C11 standard, if you prefer).  The point is that the
> C/C++11 covers only data dependencies, not control dependencies.

I feel you made a wrong assumption about my expertise in compilers. I don't
need to reread section 1.10 of the C++11 standard, because I do agree that
potentially compiler can break the code in our case. And I do agree that
a compiler barrier() or some other means (including a change of the
standard)
can be required in future to prevent a compiler from moving memory accesses
around.

But "broken" compiler is much wider issue to be deeply discussed in this
thread. I'm pretty sure that kernel have tons of very subtle
code that actually creates locks and memory ordering. Such code
usually just use the "barrier()"  approach to tell gcc not to combine
or move memory accesses around it.

Let's just agree for the sake of this memory barrier discussion that we
*do* need compiler barrier to tell gcc not to combine or move memory
accesses around it.

> Glad we agree on something!

I'm glad too!

> Did you miss the following passage in the paragraph you quoted?
>
>    "... likewise, your consumer must issue a memory barrier
>    instruction after removing an item from the queue and before
>    reading from its memory."
>
> That is why DEC Alpha readers need a read-side memory barrier -- it says
> so right there.  And as either you or Peter noted earlier in this thread,
> this barrier can be supplied by smp_read_barrier_depends().

I did not miss that passage. That passage explains why consumer on Alpha
processor after reading @head is required to execute an additional
smp_read_barrier_depends() before it can *read* from memory pointed by
@tail. And I think that I understand why - because the reader have to wait
till local caches are fully updated and only then it can read data from
the data buffer.

But on the producer side, after we read @tail, we don't need to wait for
update of local caches before we start *write* data to the buffer, since
the
producer is the only one who write data there!

>
> I can sympathize if you are having trouble believing this.  After all,
> it took the DEC Alpha architects a full hour to convince me, and that was
> in a face-to-face meeting instead of over email.  (Just for the record,
> it took me even longer to convince them that their earlier documentation
> did not clearly indicate the need for these read-side barriers.)  But
> regardless of whether or not I sympathize, DEC Alpha is what it is.

Again, I do understand quirkiness of the DEC Alpha, and I still think that
there is no need in *full* memory barrier on producer side - the one
before writing data to the buffer and which you've put in kfifo
implementation.

Regard,
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  6:40                     ` Paul E. McKenney
  2013-11-01 14:25                       ` Victor Kaplansky
  2013-11-01 14:56                       ` Peter Zijlstra
@ 2013-11-01 16:11                       ` Peter Zijlstra
  2013-11-02 17:46                         ` Paul E. McKenney
  2013-11-01 16:18                       ` Peter Zijlstra
  3 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-01 16:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > void kbuf_write(int sz, void *buf)
> > {
> > 	u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */
> > 	u64 offset = kbuf->head; /* we already know where we last wrote */
> > 	u64 head = offset + sz;
> > 
> > 	if (!space(tail, offset, head)) {
> > 		/* discard @buf */
> > 		return;
> > 	}
> > 
> > 	/*
> > 	 * Ensure that if we see the userspace tail (ubuf->tail) such
> > 	 * that there is space to write @buf without overwriting data
> > 	 * userspace hasn't seen yet, we won't in fact store data before
> > 	 * that read completes.
> > 	 */
> > 
> > 	smp_mb(); /* A, matches with D */
> > 
> > 	write(kbuf->data + offset, buf, sz);
> > 	kbuf->head = head % kbuf->size;
> > 
> > 	/*
> > 	 * Ensure that we write all the @buf data before we update the
> > 	 * userspace visible ubuf->head pointer.
> > 	 */
> > 	smp_wmb(); /* B, matches with C */
> > 
> > 	ubuf->head = kbuf->head;
> > }

> > Now the whole crux of the question is if we need barrier A at all, since
> > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > read.
> 
> The dependency you are talking about is via the "if" statement?
> Even C/C++11 is not required to respect control dependencies.

But surely we must be able to make it so; otherwise you'd never be able
to write:

void *ptr = obj1;

void foo(void)
{

	/* create obj2, obj3 */

	smp_wmb(); /* ensure the objs are complete */

	/* expose either obj2 or obj3 */
	if (x)
		ptr = obj2;
	else
		ptr = obj3;


	/* free the unused one */
	if (x)
		free(obj3);
	else
		free(obj2);
}

Earlier you said that 'volatile' or '__atomic' avoids speculative
writes; so would:

volatile void *ptr = obj1;

Make the compiler respect control dependencies again? If so, could we
somehow mark that !space() condition volatile?

Currently the above would be considered a valid pattern. But you're
saying its not because the compiler is free to expose both obj2 and obj3
(for however short a time) and thus the free of the 'unused' object is
incorrect and can cause use-after-free.

In fact; how can we be sure that:

void *ptr = NULL;

void bar(void)
{
	void *obj = malloc(...);

	/* fill obj */

	if (!err)
		rcu_assign_pointer(ptr, obj);
	else
		free(obj);
}

Does not get 'optimized' into:

void bar(void)
{
	void *obj = malloc(...);
	void *old_ptr = ptr;

	/* fill obj */

	rcu_assign_pointer(ptr, obj);
	if (err) { /* because runtime profile data says this is unlikely */
		ptr = old_ptr;
		free(obj);
	}
}

We _MUST_ be able to rely on control flow, otherwise me might as well
all go back to writing kernels in asm.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-10-31  6:40                     ` Paul E. McKenney
                                         ` (2 preceding siblings ...)
  2013-11-01 16:11                       ` Peter Zijlstra
@ 2013-11-01 16:18                       ` Peter Zijlstra
  2013-11-02 17:49                         ` Paul E. McKenney
  3 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-01 16:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> The dependency you are talking about is via the "if" statement?
> Even C/C++11 is not required to respect control dependencies.
> 
> This one is a bit annoying.  The x86 TSO means that you really only
> need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> barrier, and so on -- but smp_mb() emits a full barrier.
> 
> Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> before reads, writes before writes, and reads before writes, but not
> writes before reads?  Another approach would be to define a per-arch
> barrier for this particular case.

Supposing a sane language where we can rely on control flow; would that
change the story?

I'm afraid I'm now terminally confused between actual proper memory
model issues and fucked compilers.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:06                           ` Victor Kaplansky
@ 2013-11-01 16:25                             ` David Laight
  2013-11-01 16:30                               ` Victor Kaplansky
  2013-11-02 15:46                             ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: David Laight @ 2013-11-01 16:25 UTC (permalink / raw)
  To: Victor Kaplansky, paulmck
  Cc: Michael Neuling, Mathieu Desnoyers, Peter Zijlstra, LKML,
	Oleg Nesterov, Linux PPC dev, Anton Blanchard,
	Frederic Weisbecker

> But "broken" compiler is much wider issue to be deeply discussed in this
> thread. I'm pretty sure that kernel have tons of very subtle
> code that actually creates locks and memory ordering. Such code
> usually just use the "barrier()"  approach to tell gcc not to combine
> or move memory accesses around it.

gcc will do unexpected memory accesses for bit fields that are
adjacent to volatile data.
In particular it may generate 64bit sized (and aligned) RMW cycles
when accessing bit fields.
And yes, this has caused real problems.

	David



^ permalink raw reply	[flat|nested] 120+ messages in thread

* RE: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:25                             ` David Laight
@ 2013-11-01 16:30                               ` Victor Kaplansky
  2013-11-03 20:57                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 120+ messages in thread
From: Victor Kaplansky @ 2013-11-01 16:30 UTC (permalink / raw)
  To: David Laight
  Cc: Anton Blanchard, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Neuling, Oleg Nesterov, paulmck,
	Peter Zijlstra

"David Laight" <David.Laight@aculab.com> wrote on 11/01/2013 06:25:29 PM:
> gcc will do unexpected memory accesses for bit fields that are
> adjacent to volatile data.
> In particular it may generate 64bit sized (and aligned) RMW cycles
> when accessing bit fields.
> And yes, this has caused real problems.

Thanks, I am aware about this bug/feature in gcc.
-- Victor


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 10:30                               ` Peter Zijlstra
@ 2013-11-02 15:20                                 ` Paul E. McKenney
  2013-11-04  9:07                                   ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote:
> On Fri, Nov 01, 2013 at 02:28:14AM -0700, Paul E. McKenney wrote:
> > > This is a completely untenable position.
> > 
> > Indeed it is!
> > 
> > C/C++ never was intended to be used for parallel programming, 
> 
> And yet pretty much all kernels ever written for SMP systems are written
> in it; what drugs are those people smoking?

There was a time when I wished that the C/C++ standards people had added
concurrency to the language 30 years ago, but I eventually realized that
any attempt at that time would have been totally broken.

> Furthermore there's a gazillion parallel userspace programs.

Most of which have very unaggressive concurrency designs.

> > and this is
> > but one of the problems that can arise when we nevertheless use it for
> > parallel programming.  As compilers get smarter (for some definition of
> > "smarter") and as more systems have special-purpose hardware (such as
> > vector units) that are visible to the compiler, we can expect more of
> > this kind of trouble.
> > 
> > This was one of many reasons that I decided to help with the C/C++11
> > effort, whatever anyone might think about the results.
> 
> Well, I applaud your efforts, but given the results I think the C/C++
> people are all completely insane.

If it makes you feel any better, they have the same opinion of all of
us who use C/C++ for concurrency given that the standard provides no
guarantee.

> > > How do the C/C++ people propose to deal with this?
> > 
> > By marking "ptr" as atomic, thus telling the compiler not to mess with it.
> > And thus requiring that all accesses to it be decorated, which in the
> > case of RCU could be buried in the RCU accessors.
> 
> This seems contradictory; marking it atomic would look like:
> 
> struct foo {
> 	unsigned long value;
> 	__atomic void *ptr;
> 	unsigned long value1;
> };
> 
> Clearly we cannot hide this definition in accessors, because then
> accesses to value* won't see the annotation.

#define __rcu __atomic

Though there are probably placement restrictions for __atomic that
current use of __rcu doesn't pay attention to.

> That said; mandating we mark all 'shared' data with __atomic is
> completely untenable and is not backwards compatible.
> 
> To be safe we must assume all data shared unless indicated otherwise.

Something similar to the compiler directives forcing twos-complement
interpretation of signed overflow could be attractive.  Not sure what
it would do to code generation, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:06                           ` Victor Kaplansky
  2013-11-01 16:25                             ` David Laight
@ 2013-11-02 15:46                             ` Paul E. McKenney
  1 sibling, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 15:46 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

On Fri, Nov 01, 2013 at 06:06:58PM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
> 05:25:43 PM:
> 
> > I really don't care about "fair" -- I care instead about the kernel
> > working reliably.
> 
> Though I don't see how putting a memory barrier without deep understanding
> why it is needed helps kernel reliability, I do agree that reliability
> is more important than performance.

True enough.  Of course, the same applies to removing memory barriers.

> > And it should also be easy for proponents of removing memory barriers to
> > clearly articulate what orderings their code does and does not need.
> 
> I intentionally took a simplified example of circle buffer from
> Documentation/circular-buffers.txt. I think both sides agree about
> memory ordering requirements in the example. At least I didn't see anyone
> argued about them.

Hard to say.  No one has actually stated them clearly, so how could we
know whether or not we agree.

> > You are assuming control dependencies that the C language does not
> > provide.  Now, for all I know right now, there might well be some other
> > reason why a full barrier is not required, but the "if" statement cannot
> > be that reason.
> >
> > Please review section 1.10 of the C++11 standard (or the corresponding
> > section of the C11 standard, if you prefer).  The point is that the
> > C/C++11 covers only data dependencies, not control dependencies.
> 
> I feel you made a wrong assumption about my expertise in compilers. I don't
> need to reread section 1.10 of the C++11 standard, because I do agree that
> potentially compiler can break the code in our case. And I do agree that
> a compiler barrier() or some other means (including a change of the
> standard)
> can be required in future to prevent a compiler from moving memory accesses
> around.

I was simply reacting to what seemed to me to be your statement that
control dependencies affect ordering.  They don't.  The C/C++ standard
does not in any way respect control dependencies.  In fact, there are
implementations that do not respect control dependencies.  But don't
take my word for it, actually try it out on a weakly ordered system.
Or try out either ppcmem or armmem, which does a full state-space search.

Here is the paper:

	http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf

And here is the web-based tool:

	http://www.cl.cam.ac.uk/~pes20/ppcmem/

And here is a much faster version that you can run locally:

	http://www.cl.cam.ac.uk/~pes20/weakmemory/index.html

> But "broken" compiler is much wider issue to be deeply discussed in this
> thread. I'm pretty sure that kernel have tons of very subtle
> code that actually creates locks and memory ordering. Such code
> usually just use the "barrier()"  approach to tell gcc not to combine
> or move memory accesses around it.
> 
> Let's just agree for the sake of this memory barrier discussion that we
> *do* need compiler barrier to tell gcc not to combine or move memory
> accesses around it.

Sometimes barrier() is indeed all you need, other times more is needed.

> > Glad we agree on something!
> 
> I'm glad too!
> 
> > Did you miss the following passage in the paragraph you quoted?
> >
> >    "... likewise, your consumer must issue a memory barrier
> >    instruction after removing an item from the queue and before
> >    reading from its memory."
> >
> > That is why DEC Alpha readers need a read-side memory barrier -- it says
> > so right there.  And as either you or Peter noted earlier in this thread,
> > this barrier can be supplied by smp_read_barrier_depends().
> 
> I did not miss that passage. That passage explains why consumer on Alpha
> processor after reading @head is required to execute an additional
> smp_read_barrier_depends() before it can *read* from memory pointed by
> @tail. And I think that I understand why - because the reader have to wait
> till local caches are fully updated and only then it can read data from
> the data buffer.
> 
> But on the producer side, after we read @tail, we don't need to wait for
> update of local caches before we start *write* data to the buffer, since
> the
> producer is the only one who write data there!

Well, we cannot allow the producer to clobber data while the consumer
is reading it out.  That said, I do agree that we should get some help
from the fact that one element of the array is left empty, so that the
producer goes through a full write before clobbering the cell that the
consumer just vacated.

> > I can sympathize if you are having trouble believing this.  After all,
> > it took the DEC Alpha architects a full hour to convince me, and that was
> > in a face-to-face meeting instead of over email.  (Just for the record,
> > it took me even longer to convince them that their earlier documentation
> > did not clearly indicate the need for these read-side barriers.)  But
> > regardless of whether or not I sympathize, DEC Alpha is what it is.
> 
> Again, I do understand quirkiness of the DEC Alpha, and I still think that
> there is no need in *full* memory barrier on producer side - the one
> before writing data to the buffer and which you've put in kfifo
> implementation.

There really does need to be some sort of memory barrier there to
order the read of the index before the write into the array element.
Now, it might well be that this barrier is supplied by the unlock-lock
pair guarding the producer, but either way, there does need to be some
ordering.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 13:12                         ` Victor Kaplansky
@ 2013-11-02 16:36                           ` Paul E. McKenney
  2013-11-02 17:26                             ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 16:36 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

On Fri, Nov 01, 2013 at 03:12:58PM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
> 08:16:02 AM:
> 
> > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > > around
> > > @head read.
> 
> Just to be sure, that we are talking about the same code - I was
> considering
> ACCESS_ONCE() around @tail in point AAA in the following example from
> Documentation/circular-buffers.txt for CONSUMER:
> 
>         unsigned long head = ACCESS_ONCE(buffer->head);
>         unsigned long tail = buffer->tail;      /* AAA */
> 
>         if (CIRC_CNT(head, tail, buffer->size) >= 1) {
>                 /* read index before reading contents at that index */
>                 smp_read_barrier_depends();
> 
>                 /* extract one item from the buffer */
>                 struct item *item = buffer[tail];
> 
>                 consume_item(item);
> 
>                 smp_mb(); /* finish reading descriptor before incrementing
> tail */
> 
>                 buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */
>         }

Hmmm...  I believe that we need to go back to the original code in
Documentation/circular-buffers.txt.  I do so at the bottom of this email.

> > If you omit the ACCESS_ONCE() calls around @tail, the compiler is within
> > its rights to combine adjacent operations and also to invent loads and
> > stores, for example, in cases of register pressure.
> 
> Right. And I was completely aware about these possible transformations when
> said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or
> even
> completely dismissed reads of @tail in consumer code, are not a problem at
> all,
> since @tail is written exclusively by CONSUMER side.

I believe that the lack of ACCESS_ONCE() around the consumer's store
to buffer->tail is at least a documentation problem.  In the original
consumer code, it is trapped between an smp_mb() and a spin_unlock(),
but it is updating something that is read without synchronization by
some other thread.

> > It is also within
> > its rights to do piece-at-a-time loads and stores, which might sound
> > unlikely, but which can actually has happened when the compiler figures
> > out exactly what is to be stored at compile time, especially on hardware
> > that only allows small immediate values.
> 
> As for writes to @tail, the ACCESS_ONCE around @tail at point AAA,
> doesn't prevent in any way an imaginary super-optimizing compiler
> from moving around the store to @tail (which appears in the code at point
> BBB).
> 
> It is why ACCESS_ONCE at point AAA is completely redundant.

Agreed, it is under the lock that guards modifications, so AAA does not
need ACCESS_ONCE().

OK, here is the producer from Documentation/circular-buffers.txt, with
some comments added:

	spin_lock(&producer_lock);

	unsigned long head = buffer->head;
	unsigned long tail = ACCESS_ONCE(buffer->tail); /* PT */

	if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
		/* insert one item into the buffer */
		struct item *item = buffer[head];

		produce_item(item); /* PD */

		smp_wmb(); /* commit the item before incrementing the head */

		buffer->head = (head + 1) & (buffer->size - 1);  /* PH */

		/* wake_up() will make sure that the head is committed before
		 * waking anyone up */
		wake_up(consumer);
	}

	spin_unlock(&producer_lock);

And here is the consumer, also from Documentation/circular-buffers.txt:

	spin_lock(&consumer_lock);

	unsigned long head = ACCESS_ONCE(buffer->head); /* CH */
	unsigned long tail = buffer->tail;

	if (CIRC_CNT(head, tail, buffer->size) >= 1) {
		/* read index before reading contents at that index */
		smp_read_barrier_depends();

		/* extract one item from the buffer */
		struct item *item = buffer[tail]; /* CD */

		consume_item(item);

		smp_mb(); /* finish reading descriptor before incrementing tail */

		buffer->tail = (tail + 1) & (buffer->size - 1); /* CT */
	}

	spin_unlock(&consumer_lock);

Here are the ordering requirements as I see them:

1.	The producer is not allowed to clobber a location that the
	consumer is in the process of reading from.

2.	The consumer is not allowed to read from a location that the
	producer has not yet completed writing to.

#1 is helped out by the fact that there is always an empty element in
the array, so that the producer will need to produce twice in a row
to catch up to where the consumer is currently consuming.  #2 has no
such benefit: The consumer can consume an item that has just now been
produced.

#1 requires that CD is ordered before CT in a way that pairs with the
ordering of PT and PD.  There is of course no effective ordering between
PT and PD within a given call to the producer, but we only need the
ordering between the read from PT for one call to the producer and the
PD of the -next- call to the producer, courtesy of the fact that there
is always one empty cell in the array.  Therefore, the required ordering
between PT of one call and PD of the next is provided by the unlock-lock
pair.  The ordering of CD and CT is of course provided by the smp_mb().
(And yes, I was missing the unlock-lock pair earlier.  In my defense,
you did leave this unlock-lock pair out of your example.)

So ordering requirement #1 is handled by the original, but only if you
leave the locking in place.  The producer's smp_wmb() does not necessarily
order prior loads against subsequent stores, and the wake_up() only
guarantees ordering if something was actually awakened.  As noted earlier,
the "if" does not necessarily provide ordering.

On to ordering requirement #2.

This requires that CH and CD is ordered in a way that pairs with ordering
between PD and PH.  PD and PH are both writes, so the smp_wmb() does
the trick there.  The consumer side is a bit strange.  On DEC Alpha,
smp_read_barrier_dependes() turns into smp_mb(), so that case is covered
(though by accident).  On other architectures, smp_read_barrier_depends()
generates no code, and there is no data dependency between the CH and CD.
The dependency is instead between the read from ->tail and the write,
and as you noted, ->tail is written by the consumer, not the producer.

But my battery is dying, so more later, including ACCESS_ONCE().

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-02 16:36                           ` Paul E. McKenney
@ 2013-11-02 17:26                             ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 17:26 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra, lfomicki,
	dhowells, mbatty

[ Adding David Howells, Lech Fomicki, and Mark Batty on CC for their
  thoughts given previous discussions. ]

On Sat, Nov 02, 2013 at 09:36:18AM -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2013 at 03:12:58PM +0200, Victor Kaplansky wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
> > 08:16:02 AM:
> > 
> > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > > > around
> > > > @head read.
> > 
> > Just to be sure, that we are talking about the same code - I was
> > considering
> > ACCESS_ONCE() around @tail in point AAA in the following example from
> > Documentation/circular-buffers.txt for CONSUMER:
> > 
> >         unsigned long head = ACCESS_ONCE(buffer->head);
> >         unsigned long tail = buffer->tail;      /* AAA */
> > 
> >         if (CIRC_CNT(head, tail, buffer->size) >= 1) {
> >                 /* read index before reading contents at that index */
> >                 smp_read_barrier_depends();
> > 
> >                 /* extract one item from the buffer */
> >                 struct item *item = buffer[tail];
> > 
> >                 consume_item(item);
> > 
> >                 smp_mb(); /* finish reading descriptor before incrementing
> > tail */
> > 
> >                 buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */
> >         }
> 
> Hmmm...  I believe that we need to go back to the original code in
> Documentation/circular-buffers.txt.  I do so at the bottom of this email.
> 
> > > If you omit the ACCESS_ONCE() calls around @tail, the compiler is within
> > > its rights to combine adjacent operations and also to invent loads and
> > > stores, for example, in cases of register pressure.
> > 
> > Right. And I was completely aware about these possible transformations when
> > said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or
> > even
> > completely dismissed reads of @tail in consumer code, are not a problem at
> > all,
> > since @tail is written exclusively by CONSUMER side.
> 
> I believe that the lack of ACCESS_ONCE() around the consumer's store
> to buffer->tail is at least a documentation problem.  In the original
> consumer code, it is trapped between an smp_mb() and a spin_unlock(),
> but it is updating something that is read without synchronization by
> some other thread.
> 
> > > It is also within
> > > its rights to do piece-at-a-time loads and stores, which might sound
> > > unlikely, but which can actually has happened when the compiler figures
> > > out exactly what is to be stored at compile time, especially on hardware
> > > that only allows small immediate values.
> > 
> > As for writes to @tail, the ACCESS_ONCE around @tail at point AAA,
> > doesn't prevent in any way an imaginary super-optimizing compiler
> > from moving around the store to @tail (which appears in the code at point
> > BBB).
> > 
> > It is why ACCESS_ONCE at point AAA is completely redundant.
> 
> Agreed, it is under the lock that guards modifications, so AAA does not
> need ACCESS_ONCE().
> 
> OK, here is the producer from Documentation/circular-buffers.txt, with
> some comments added:
> 
> 	spin_lock(&producer_lock);
> 
> 	unsigned long head = buffer->head;

The above is updated only under producer_lock, which we hold, so no
ACCESS_ONCE() is needed for buffer->head.

> 	unsigned long tail = ACCESS_ONCE(buffer->tail); /* PT */
> 
> 	if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
> 		/* insert one item into the buffer */
> 		struct item *item = buffer[head];
> 
> 		produce_item(item); /* PD */
> 
> 		smp_wmb(); /* commit the item before incrementing the head */
> 
> 		buffer->head = (head + 1) & (buffer->size - 1);  /* PH */

The above needs to be something like:

		ACCESS_ONCE(buffer->head) = (head + 1) & (buffer->size - 1);

This is because we are writing to a shared variable that might be being
read concurrently.

> 		/* wake_up() will make sure that the head is committed before
> 		 * waking anyone up */
> 		wake_up(consumer);
> 	}
> 
> 	spin_unlock(&producer_lock);
> 
> And here is the consumer, also from Documentation/circular-buffers.txt:
> 
> 	spin_lock(&consumer_lock);
> 
> 	unsigned long head = ACCESS_ONCE(buffer->head); /* CH */
> 	unsigned long tail = buffer->tail;

The above is updated only under consumer_lock, which we hold, so no
ACCESS_ONCE() is needed for buffer->tail.

> 
> 	if (CIRC_CNT(head, tail, buffer->size) >= 1) {
> 		/* read index before reading contents at that index */
> 		smp_read_barrier_depends();
> 
> 		/* extract one item from the buffer */
> 		struct item *item = buffer[tail]; /* CD */
> 
> 		consume_item(item);
> 
> 		smp_mb(); /* finish reading descriptor before incrementing tail */
> 
> 		buffer->tail = (tail + 1) & (buffer->size - 1); /* CT */

And here, for no-execution-cost documentation, if nothing else:

		ACCESS_ONCE(buffer->tail) = (tail + 1) & (buffer->size - 1);

> 	}
> 
> 	spin_unlock(&consumer_lock);
> 
> Here are the ordering requirements as I see them:
> 
> 1.	The producer is not allowed to clobber a location that the
> 	consumer is in the process of reading from.
> 
> 2.	The consumer is not allowed to read from a location that the
> 	producer has not yet completed writing to.
> 
> #1 is helped out by the fact that there is always an empty element in
> the array, so that the producer will need to produce twice in a row
> to catch up to where the consumer is currently consuming.  #2 has no
> such benefit: The consumer can consume an item that has just now been
> produced.
> 
> #1 requires that CD is ordered before CT in a way that pairs with the
> ordering of PT and PD.  There is of course no effective ordering between
> PT and PD within a given call to the producer, but we only need the
> ordering between the read from PT for one call to the producer and the
> PD of the -next- call to the producer, courtesy of the fact that there
> is always one empty cell in the array.  Therefore, the required ordering
> between PT of one call and PD of the next is provided by the unlock-lock
> pair.  The ordering of CD and CT is of course provided by the smp_mb().
> (And yes, I was missing the unlock-lock pair earlier.  In my defense,
> you did leave this unlock-lock pair out of your example.)
> 
> So ordering requirement #1 is handled by the original, but only if you
> leave the locking in place.  The producer's smp_wmb() does not necessarily
> order prior loads against subsequent stores, and the wake_up() only
> guarantees ordering if something was actually awakened.  As noted earlier,
> the "if" does not necessarily provide ordering.
> 
> On to ordering requirement #2.
> 
> This requires that CH and CD is ordered in a way that pairs with ordering
> between PD and PH.  PD and PH are both writes, so the smp_wmb() does
> the trick there.  The consumer side is a bit strange.  On DEC Alpha,
> smp_read_barrier_dependes() turns into smp_mb(), so that case is covered
> (though by accident).  On other architectures, smp_read_barrier_depends()
> generates no code, and there is no data dependency between the CH and CD.
> The dependency is instead between the read from ->tail and the write,

Sigh.  Make that "The dependency is instead between the read from ->tail
and the read from the array."

> and as you noted, ->tail is written by the consumer, not the producer.

And non-dependent reads -can- be speculated, so the
smp_read_barrier_depends() needs to be at least an smp_rmb().

Again, don't take my word for it, try it with either ppcmem or real
weakly ordered hardware.

I am not 100% confident of the patch below, but am getting there.
If a change is really needed, it must of course be propagated to the
uses within the Linux kernel.

							Thanx, Paul

> But my battery is dying, so more later, including ACCESS_ONCE().

documentation: Fix circular-buffer example.

The code sample in Documentation/circular-buffers.txt appears to have a
few ordering bugs.  This patch therefore applies the needed fixes.

Reported-by: Lech Fomicki <lfomicki@poczta.fm>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt
index 8117e5bf6065..a36bed3db4ee 100644
--- a/Documentation/circular-buffers.txt
+++ b/Documentation/circular-buffers.txt
@@ -170,7 +170,7 @@ The producer will look something like this:
 
 		smp_wmb(); /* commit the item before incrementing the head */
 
-		buffer->head = (head + 1) & (buffer->size - 1);
+		ACCESS_ONCE(buffer->head) = (head + 1) & (buffer->size - 1);
 
 		/* wake_up() will make sure that the head is committed before
 		 * waking anyone up */
@@ -183,9 +183,14 @@ This will instruct the CPU that the contents of the new item must be written
 before the head index makes it available to the consumer and then instructs the
 CPU that the revised head index must be written before the consumer is woken.
 
-Note that wake_up() doesn't have to be the exact mechanism used, but whatever
-is used must guarantee a (write) memory barrier between the update of the head
-index and the change of state of the consumer, if a change of state occurs.
+Note that wake_up() does not guarantee any sort of barrier unless something
+is actually awakened.  We therefore cannot rely on it for ordering.  However,
+there is always one element of the array left empty.  Therefore, the
+producer must produce two elements before it could possibly corrupt the
+element currently being read by the consumer.  Therefore, the unlock-lock
+pair between consecutive invocations of the consumer provides the necessary
+ordering between the read of the index indicating that the consumer has
+vacated a given element and the write by the producer to that same element.
 
 
 THE CONSUMER
@@ -200,7 +205,7 @@ The consumer will look something like this:
 
 	if (CIRC_CNT(head, tail, buffer->size) >= 1) {
 		/* read index before reading contents at that index */
-		smp_read_barrier_depends();
+		smp_rmb();
 
 		/* extract one item from the buffer */
 		struct item *item = buffer[tail];
@@ -209,7 +214,7 @@ The consumer will look something like this:
 
 		smp_mb(); /* finish reading descriptor before incrementing tail */
 
-		buffer->tail = (tail + 1) & (buffer->size - 1);
+		ACCESS_ONCE(buffer->tail) = (tail + 1) & (buffer->size - 1);
 	}
 
 	spin_unlock(&consumer_lock);
@@ -223,7 +228,10 @@ Note the use of ACCESS_ONCE() in both algorithms to read the opposition index.
 This prevents the compiler from discarding and reloading its cached value -
 which some compilers will do across smp_read_barrier_depends().  This isn't
 strictly needed if you can be sure that the opposition index will _only_ be
-used the once.
+used the once.  Similarly, ACCESS_ONCE() is used in both algorithms to
+write the thread's index.  This documents the fact that we are writing
+to something that can be read concurrently and also prevents the compiler
+from tearing the store.
 
 
 ===============


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 14:25                       ` Victor Kaplansky
@ 2013-11-02 17:28                         ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 17:28 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, Oleg Nesterov, Peter Zijlstra

On Fri, Nov 01, 2013 at 04:25:42PM +0200, Victor Kaplansky wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013
> 08:40:15 AM:
> 
> > > void ubuf_read(void)
> > > {
> > >    u64 head, tail;
> > >
> > >    tail = ACCESS_ONCE(ubuf->tail);
> > >    head = ACCESS_ONCE(ubuf->head);
> > >
> > >    /*
> > >     * Ensure we read the buffer boundaries before the actual buffer
> > >     * data...
> > >     */
> > >    smp_rmb(); /* C, matches with B */
> > >
> > >    while (tail != head) {
> > >       obj = ubuf->data + tail;
> > >       /* process obj */
> > >       tail += obj->size;
> > >       tail %= ubuf->size;
> > >    }
> > >
> > >    /*
> > >     * Ensure all data reads are complete before we issue the
> > >     * ubuf->tail update; once that update hits, kbuf_write() can
> > >     * observe and overwrite data.
> > >     */
> > >    smp_mb(); /* D, matches with A */
> > >
> > >    ubuf->tail = tail;
> > > }
> 
> > > Could we replace A and C with an smp_read_barrier_depends()?
> >
> > C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail
> > and that the value fetch from ->tail feeds into the address used for
> > the "obj =" assignment.
> 
> No! You must to have a full smp_rmb() at C. The race on the reader side
> is not between fetch of @tail and read from address pointed by @tail.
> The real race here is between a fetch of @head and read of obj from
> memory pointed by @tail.

I believe you are in fact correct, good catch.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 14:56                       ` Peter Zijlstra
@ 2013-11-02 17:32                         ` Paul E. McKenney
  2013-11-03 14:40                           ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 17:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > Now the whole crux of the question is if we need barrier A at all, since
> > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > read.
> > 
> > The dependency you are talking about is via the "if" statement?
> > Even C/C++11 is not required to respect control dependencies.
> > 
> > This one is a bit annoying.  The x86 TSO means that you really only
> > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > barrier, and so on -- but smp_mb() emits a full barrier.
> > 
> > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > before reads, writes before writes, and reads before writes, but not
> > writes before reads?  Another approach would be to define a per-arch
> > barrier for this particular case.
> 
> I suppose we can only introduce new barrier primitives if there's more
> than 1 use-case.

There probably are others.

> > > If the read shows no available space, we simply will not issue those
> > > writes -- therefore we could argue we can avoid the memory barrier.
> > 
> > Proving that means iterating through the permitted combinations of
> > compilers and architectures...  There is always hand-coded assembly
> > language, I suppose.
> 
> I'm starting to think that while the C/C++ language spec says they can
> wreck the world by doing these silly optimization, real world users will
> push back for breaking their existing code.
> 
> I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they
> break the kernel by doing crazy shit like that.
> 
> Given its near impossible to write a correct program in C/C++ and
> tagging the entire kernel with __atomic is equally not going to happen,
> I think we must find a practical solution.
> 
> Either that, or we really need to consider forking the language and
> compiler :-(

Depends on how much benefit the optimizations provide.  If they provide
little or no benefit, I am with you, otherwise we will need to bit some
bullet or another.  Keep in mind that there is a lot of code in the
kernel that runs sequentially (e.g., due to being fully protected by
locks), and aggressive optimizations for that sort of code are harmless.

Can't say I know the answer at the moment, though.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:11                       ` Peter Zijlstra
@ 2013-11-02 17:46                         ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 17:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Fri, Nov 01, 2013 at 05:11:29PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > void kbuf_write(int sz, void *buf)
> > > {
> > > 	u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */
> > > 	u64 offset = kbuf->head; /* we already know where we last wrote */
> > > 	u64 head = offset + sz;
> > > 
> > > 	if (!space(tail, offset, head)) {
> > > 		/* discard @buf */
> > > 		return;
> > > 	}
> > > 
> > > 	/*
> > > 	 * Ensure that if we see the userspace tail (ubuf->tail) such
> > > 	 * that there is space to write @buf without overwriting data
> > > 	 * userspace hasn't seen yet, we won't in fact store data before
> > > 	 * that read completes.
> > > 	 */
> > > 
> > > 	smp_mb(); /* A, matches with D */
> > > 
> > > 	write(kbuf->data + offset, buf, sz);
> > > 	kbuf->head = head % kbuf->size;
> > > 
> > > 	/*
> > > 	 * Ensure that we write all the @buf data before we update the
> > > 	 * userspace visible ubuf->head pointer.
> > > 	 */
> > > 	smp_wmb(); /* B, matches with C */
> > > 
> > > 	ubuf->head = kbuf->head;
> > > }
> 
> > > Now the whole crux of the question is if we need barrier A at all, since
> > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > read.
> > 
> > The dependency you are talking about is via the "if" statement?
> > Even C/C++11 is not required to respect control dependencies.
> 
> But surely we must be able to make it so; otherwise you'd never be able
> to write:
> 
> void *ptr = obj1;
> 
> void foo(void)
> {
> 
> 	/* create obj2, obj3 */
> 
> 	smp_wmb(); /* ensure the objs are complete */
> 
> 	/* expose either obj2 or obj3 */
> 	if (x)
> 		ptr = obj2;
> 	else
> 		ptr = obj3;

OK, the smp_wmb() orders the creation and the exposing.  But the
compiler can do this:

	ptr = obj3;
	if (x)
		ptr = obj2;

And that could momentarily expose obj3 to readers, and these readers
might be fatally disappointed by the free() below.  If you instead said:

	if (x)
		ACCESS_ONCE(ptr) = obj2;
	else
		ACCESS_ONCE(ptr) = obj3;

then the general consensus appears to be that the compiler would not
be permitted to carry out the above optimization.  Since you have
the smp_wmb(), readers that are properly ordered (e.g., smp_rmb() or
rcu_dereference()) would be prevented from seeing pre-initialization
state.

> 	/* free the unused one */
> 	if (x)
> 		free(obj3);
> 	else
> 		free(obj2);
> }
> 
> Earlier you said that 'volatile' or '__atomic' avoids speculative
> writes; so would:
> 
> volatile void *ptr = obj1;
> 
> Make the compiler respect control dependencies again? If so, could we
> somehow mark that !space() condition volatile?

The compiler should, but the CPU is still free to ignore the control
dependencies in the general case.

We might be able to rely on weakly ordered hardware refraining
from speculating stores, but not sure that this applies across all
architectures of interest.  We definitely can -not- rely on weakly
ordered hardware refraining from speculating loads.

> Currently the above would be considered a valid pattern. But you're
> saying its not because the compiler is free to expose both obj2 and obj3
> (for however short a time) and thus the free of the 'unused' object is
> incorrect and can cause use-after-free.

Yes, it is definitely unsafe and invalid in absence of ACCESS_ONCE().

> In fact; how can we be sure that:
> 
> void *ptr = NULL;
> 
> void bar(void)
> {
> 	void *obj = malloc(...);
> 
> 	/* fill obj */
> 
> 	if (!err)
> 		rcu_assign_pointer(ptr, obj);
> 	else
> 		free(obj);
> }
> 
> Does not get 'optimized' into:
> 
> void bar(void)
> {
> 	void *obj = malloc(...);
> 	void *old_ptr = ptr;
> 
> 	/* fill obj */
> 
> 	rcu_assign_pointer(ptr, obj);
> 	if (err) { /* because runtime profile data says this is unlikely */
> 		ptr = old_ptr;
> 		free(obj);
> 	}
> }

In this particular case, the barrier() implied by the smp_wmb() in
rcu_assign_pointer() will prevent this "optimization".  However, other
"optimizations" are the reason why I am working to introduce ACCESS_ONCE()
into rcu_assign_pointer.

> We _MUST_ be able to rely on control flow, otherwise me might as well
> all go back to writing kernels in asm.

It isn't -that- bad!  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:18                       ` Peter Zijlstra
@ 2013-11-02 17:49                         ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-02 17:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling, tony.luck

On Fri, Nov 01, 2013 at 05:18:19PM +0100, Peter Zijlstra wrote:
> On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > The dependency you are talking about is via the "if" statement?
> > Even C/C++11 is not required to respect control dependencies.
> > 
> > This one is a bit annoying.  The x86 TSO means that you really only
> > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > barrier, and so on -- but smp_mb() emits a full barrier.
> > 
> > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > before reads, writes before writes, and reads before writes, but not
> > writes before reads?  Another approach would be to define a per-arch
> > barrier for this particular case.
> 
> Supposing a sane language where we can rely on control flow; would that
> change the story?
> 
> I'm afraid I'm now terminally confused between actual proper memory
> model issues and fucked compilers.

Power and ARM won't speculate stores, but they will happily speculate
loads.  Not sure about Itanium, perhaps Tony knows.  And yes, reordering
by the compilers and CPUs does sometimes seem a bit intertwined.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-02 17:32                         ` Paul E. McKenney
@ 2013-11-03 14:40                           ` Paul E. McKenney
  2013-11-03 15:17                             ` [RFC] arch: Introduce new TSO memory barrier smp_tmb() Peter Zijlstra
  2013-11-03 17:07                             ` perf events ring buffer memory barrier on powerpc Will Deacon
  0 siblings, 2 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-03 14:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > > Now the whole crux of the question is if we need barrier A at all, since
> > > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > > read.
> > > 
> > > The dependency you are talking about is via the "if" statement?
> > > Even C/C++11 is not required to respect control dependencies.
> > > 
> > > This one is a bit annoying.  The x86 TSO means that you really only
> > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > > barrier, and so on -- but smp_mb() emits a full barrier.
> > > 
> > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > > before reads, writes before writes, and reads before writes, but not
> > > writes before reads?  Another approach would be to define a per-arch
> > > barrier for this particular case.
> > 
> > I suppose we can only introduce new barrier primitives if there's more
> > than 1 use-case.
> 
> There probably are others.

If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
There are some corner cases that can happen with the current smp_wmb()
that would be prevented by smp_tmb().  These corner cases are a bit
strange, as follows:

	struct foo gp;

	void P0(void)
	{
		struct foo *p = kmalloc(sizeof(*p);

		if (!p)
			return;
		ACCESS_ONCE(p->a) = 0;
		BUG_ON(ACCESS_ONCE(p->a));
		rcu_assign_pointer(gp, p);
	}

	void P1(void)
	{
		struct foo *p = rcu_dereference(gp);

		if (!p)
			return;
		ACCESS_ONCE(p->a) = 1;
	}

With smp_wmb(), the BUG_ON() can occur because smp_wmb() does
not prevent CPU from reordering the read in the BUG_ON() with the
rcu_assign_pointer().  With smp_tmb(), it could not.

Now, I am not too worried about this because I cannot think of any use
for code like that in P0() and P1().  But if there was an smp_tmb(),
it would be cleaner to make the BUG_ON() impossible.

							Thanx, Paul

> > > > If the read shows no available space, we simply will not issue those
> > > > writes -- therefore we could argue we can avoid the memory barrier.
> > > 
> > > Proving that means iterating through the permitted combinations of
> > > compilers and architectures...  There is always hand-coded assembly
> > > language, I suppose.
> > 
> > I'm starting to think that while the C/C++ language spec says they can
> > wreck the world by doing these silly optimization, real world users will
> > push back for breaking their existing code.
> > 
> > I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they
> > break the kernel by doing crazy shit like that.
> > 
> > Given its near impossible to write a correct program in C/C++ and
> > tagging the entire kernel with __atomic is equally not going to happen,
> > I think we must find a practical solution.
> > 
> > Either that, or we really need to consider forking the language and
> > compiler :-(
> 
> Depends on how much benefit the optimizations provide.  If they provide
> little or no benefit, I am with you, otherwise we will need to bit some
> bullet or another.  Keep in mind that there is a lot of code in the
> kernel that runs sequentially (e.g., due to being fully protected by
> locks), and aggressive optimizations for that sort of code are harmless.
> 
> Can't say I know the answer at the moment, though.
> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 14:40                           ` Paul E. McKenney
@ 2013-11-03 15:17                             ` Peter Zijlstra
  2013-11-03 18:08                               ` Linus Torvalds
  2013-11-03 20:59                               ` Benjamin Herrenschmidt
  2013-11-03 17:07                             ` perf events ring buffer memory barrier on powerpc Will Deacon
  1 sibling, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-03 15:17 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Linus Torvalds

On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().

Well, I'm obviously all for introducing this new barrier, for it will
reduce a full mfence on x86 to a compiler barrier. And ppc can use
lwsync as opposed to sync afaict. Not sure ARM can do better.

---
Subject: arch: Introduce new TSO memory barrier smp_tmb()

A few sites could be downgraded from smp_mb() to smp_tmb() and a few
site should be upgraded to smp_tmb() that are now using smp_wmb().

XXX hope PaulMck explains things better..

X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
reduces to barrier().

PPC can use lwsync instead of sync

For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
is always correct but possibly suboptimal.

Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
Not-Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 arch/alpha/include/asm/barrier.h      | 2 ++
 arch/arc/include/asm/barrier.h        | 2 ++
 arch/arm/include/asm/barrier.h        | 2 ++
 arch/arm64/include/asm/barrier.h      | 2 ++
 arch/avr32/include/asm/barrier.h      | 1 +
 arch/blackfin/include/asm/barrier.h   | 1 +
 arch/cris/include/asm/barrier.h       | 2 ++
 arch/frv/include/asm/barrier.h        | 1 +
 arch/h8300/include/asm/barrier.h      | 2 ++
 arch/hexagon/include/asm/barrier.h    | 1 +
 arch/ia64/include/asm/barrier.h       | 2 ++
 arch/m32r/include/asm/barrier.h       | 2 ++
 arch/m68k/include/asm/barrier.h       | 1 +
 arch/metag/include/asm/barrier.h      | 3 +++
 arch/microblaze/include/asm/barrier.h | 1 +
 arch/mips/include/asm/barrier.h       | 3 +++
 arch/mn10300/include/asm/barrier.h    | 2 ++
 arch/parisc/include/asm/barrier.h     | 1 +
 arch/powerpc/include/asm/barrier.h    | 2 ++
 arch/s390/include/asm/barrier.h       | 1 +
 arch/score/include/asm/barrier.h      | 1 +
 arch/sh/include/asm/barrier.h         | 2 ++
 arch/sparc/include/asm/barrier_32.h   | 1 +
 arch/sparc/include/asm/barrier_64.h   | 3 +++
 arch/tile/include/asm/barrier.h       | 2 ++
 arch/unicore32/include/asm/barrier.h  | 1 +
 arch/x86/include/asm/barrier.h        | 3 +++
 arch/xtensa/include/asm/barrier.h     | 1 +
 28 files changed, 48 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..02ea63897038 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
 #ifdef CONFIG_SMP
 #define __ASM_SMP_MB	"\tmb\n"
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define __ASM_SMP_MB
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while (0)
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..456c790fa1ad 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -22,10 +22,12 @@
 /* TODO-vineetg verify the correctness of macros here */
 #ifdef CONFIG_SMP
 #define smp_mb()        mb()
+#define smp_tmb()	mb()
 #define smp_rmb()       rmb()
 #define smp_wmb()       wmb()
 #else
 #define smp_mb()        barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()       barrier()
 #define smp_wmb()       barrier()
 #endif
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..bc88a8505673 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -51,10 +51,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #else
 #define smp_mb()	dmb(ish)
+#define smp_tmb()	smp_mb()
 #define smp_rmb()	smp_mb()
 #define smp_wmb()	dmb(ishst)
 #endif
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index d4a63338a53c..ec0531f4892f 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -33,10 +33,12 @@
 
 #ifndef CONFIG_SMP
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #else
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
+#define smp_tmb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
 #endif
diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
index 0961275373db..6c6ccb9cf290 100644
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -20,6 +20,7 @@
 # error "The AVR32 port does not support SMP"
 #else
 # define smp_mb()		barrier()
+# define smp_tmb()		barrier()
 # define smp_rmb()		barrier()
 # define smp_wmb()		barrier()
 # define smp_read_barrier_depends() do { } while(0)
diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
index ebb189507dd7..100f49121a18 100644
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -40,6 +40,7 @@
 #endif /* !CONFIG_SMP */
 
 #define smp_mb()  mb()
+#define smp_tmb() mb()
 #define smp_rmb() rmb()
 #define smp_wmb() wmb()
 #define set_mb(var, value) do { var = value; mb(); } while (0)
diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
index 198ad7fa6b25..679c33738b4c 100644
--- a/arch/cris/include/asm/barrier.h
+++ b/arch/cris/include/asm/barrier.h
@@ -12,11 +12,13 @@
 
 #ifdef CONFIG_SMP
 #define smp_mb()        mb()
+#define smp_tmb()       mb()
 #define smp_rmb()       rmb()
 #define smp_wmb()       wmb()
 #define smp_read_barrier_depends()     read_barrier_depends()
 #else
 #define smp_mb()        barrier()
+#define smp_tmb()       barrier()
 #define smp_rmb()       barrier()
 #define smp_wmb()       barrier()
 #define smp_read_barrier_depends()     do { } while(0)
diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
index 06776ad9f5e9..60354ce13ba0 100644
--- a/arch/frv/include/asm/barrier.h
+++ b/arch/frv/include/asm/barrier.h
@@ -20,6 +20,7 @@
 #define read_barrier_depends()	do { } while (0)
 
 #define smp_mb()			barrier()
+#define smp_tmb()			barrier()
 #define smp_rmb()			barrier()
 #define smp_wmb()			barrier()
 #define smp_read_barrier_depends()	do {} while(0)
diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
index 9e0aa9fc195d..e8e297fa4e9a 100644
--- a/arch/h8300/include/asm/barrier.h
+++ b/arch/h8300/include/asm/barrier.h
@@ -16,11 +16,13 @@
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
index 1041a8e70ce8..2dd5b2ad4d21 100644
--- a/arch/hexagon/include/asm/barrier.h
+++ b/arch/hexagon/include/asm/barrier.h
@@ -28,6 +28,7 @@
 #define smp_rmb()			barrier()
 #define smp_read_barrier_depends()	barrier()
 #define smp_wmb()			barrier()
+#define smp_tmb()			barrier()
 #define smp_mb()			barrier()
 #define smp_mb__before_atomic_dec()	barrier()
 #define smp_mb__after_atomic_dec()	barrier()
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index 60576e06b6fb..a5f92146b091 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -42,11 +42,13 @@
 
 #ifdef CONFIG_SMP
 # define smp_mb()	mb()
+# define smp_tmb()	mb()
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
 #else
 # define smp_mb()	barrier()
+# define smp_tmb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
index 6976621efd3f..a6fa29facd7a 100644
--- a/arch/m32r/include/asm/barrier.h
+++ b/arch/m32r/include/asm/barrier.h
@@ -79,12 +79,14 @@
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
 #define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while (0)
diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
index 445ce22c23cb..8ecf52c87847 100644
--- a/arch/m68k/include/asm/barrier.h
+++ b/arch/m68k/include/asm/barrier.h
@@ -13,6 +13,7 @@
 #define set_mb(var, value)	({ (var) = (value); wmb(); })
 
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	((void)0)
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c90bfc6bf648..eb179fbce580 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -50,6 +50,7 @@ static inline void wmb(void)
 #ifndef CONFIG_SMP
 #define fence()		do { } while (0)
 #define smp_mb()        barrier()
+#define smp_tmb()       barrier()
 #define smp_rmb()       barrier()
 #define smp_wmb()       barrier()
 #else
@@ -70,11 +71,13 @@ static inline void fence(void)
 	*flushptr = 0;
 }
 #define smp_mb()        fence()
+#define smp_tmb()       fence()
 #define smp_rmb()       fence()
 #define smp_wmb()       barrier()
 #else
 #define fence()		do { } while (0)
 #define smp_mb()        barrier()
+#define smp_tmb()       barrier()
 #define smp_rmb()       barrier()
 #define smp_wmb()       barrier()
 #endif
diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
index df5be3e87044..d573c170a717 100644
--- a/arch/microblaze/include/asm/barrier.h
+++ b/arch/microblaze/include/asm/barrier.h
@@ -21,6 +21,7 @@
 #define set_wmb(var, value)	do { var = value; wmb(); } while (0)
 
 #define smp_mb()		mb()
+#define smp_tmb()		mb()
 #define smp_rmb()		rmb()
 #define smp_wmb()		wmb()
 
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index 314ab5532019..535e699eec3b 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -144,15 +144,18 @@
 #if defined(CONFIG_WEAK_ORDERING) && defined(CONFIG_SMP)
 # ifdef CONFIG_CPU_CAVIUM_OCTEON
 #  define smp_mb()	__sync()
+#  define smp_tmb()	__sync()
 #  define smp_rmb()	barrier()
 #  define smp_wmb()	__syncw()
 # else
 #  define smp_mb()	__asm__ __volatile__("sync" : : :"memory")
+#  define smp_tmb()	__asm__ __volatile__("sync" : : :"memory")
 #  define smp_rmb()	__asm__ __volatile__("sync" : : :"memory")
 #  define smp_wmb()	__asm__ __volatile__("sync" : : :"memory")
 # endif
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #endif
diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
index 2bd97a5c8af7..a345b0776e5f 100644
--- a/arch/mn10300/include/asm/barrier.h
+++ b/arch/mn10300/include/asm/barrier.h
@@ -19,11 +19,13 @@
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define set_mb(var, value)  do { xchg(&var, value); } while (0)
 #else  /* CONFIG_SMP */
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define set_mb(var, value)  do { var = value;  mb(); } while (0)
diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
index e77d834aa803..f53196b589ec 100644
--- a/arch/parisc/include/asm/barrier.h
+++ b/arch/parisc/include/asm/barrier.h
@@ -25,6 +25,7 @@
 #define rmb()		mb()
 #define wmb()		mb()
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	mb()
 #define smp_wmb()	mb()
 #define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index ae782254e731..d7e8a560f1fe 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -46,11 +46,13 @@
 #endif
 
 #define smp_mb()	mb()
+#define smp_tmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
 #define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
 #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 16760eeb79b0..f0409a874243 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -24,6 +24,7 @@
 #define wmb()				mb()
 #define read_barrier_depends()		do { } while(0)
 #define smp_mb()			mb()
+#define smp_tmb()			mb()
 #define smp_rmb()			rmb()
 #define smp_wmb()			wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
index 0eacb6471e6d..865652083dde 100644
--- a/arch/score/include/asm/barrier.h
+++ b/arch/score/include/asm/barrier.h
@@ -5,6 +5,7 @@
 #define rmb()		barrier()
 #define wmb()		barrier()
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 
diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
index 72c103dae300..f8dce7926432 100644
--- a/arch/sh/include/asm/barrier.h
+++ b/arch/sh/include/asm/barrier.h
@@ -39,11 +39,13 @@
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
index c1b76654ee76..1037ce189cee 100644
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -8,6 +8,7 @@
 #define read_barrier_depends()	do { } while(0)
 #define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
 #define smp_mb()	__asm__ __volatile__("":::"memory")
+#define smp_tmb()	__asm__ __volatile__("":::"memory")
 #define smp_rmb()	__asm__ __volatile__("":::"memory")
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #define smp_read_barrier_depends()	do { } while(0)
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 95d45986f908..0f3c2fdb86b8 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -34,6 +34,7 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
  * memory ordering than required by the specifications.
  */
 #define mb()	membar_safe("#StoreLoad")
+#define tmb()	__asm__ __volatile__("":::"memory")
 #define rmb()	__asm__ __volatile__("":::"memory")
 #define wmb()	__asm__ __volatile__("":::"memory")
 
@@ -43,10 +44,12 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	tmb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #else
 #define smp_mb()	__asm__ __volatile__("":::"memory")
+#define smp_tmb()	__asm__ __volatile__("":::"memory")
 #define smp_rmb()	__asm__ __volatile__("":::"memory")
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #endif
diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
index a9a73da5865d..cad3c6ae28bf 100644
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -127,11 +127,13 @@ mb_incoherent(void)
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
+#define smp_tmb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while (0)
diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
index a6620e5336b6..8b341fffbda6 100644
--- a/arch/unicore32/include/asm/barrier.h
+++ b/arch/unicore32/include/asm/barrier.h
@@ -18,6 +18,7 @@
 #define rmb()				barrier()
 #define wmb()				barrier()
 #define smp_mb()			barrier()
+#define smp_tmb()			barrier()
 #define smp_rmb()			barrier()
 #define smp_wmb()			barrier()
 #define read_barrier_depends()		do { } while (0)
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index c6cd358a1eec..480201d83af1 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -86,14 +86,17 @@
 # define smp_rmb()	barrier()
 #endif
 #ifdef CONFIG_X86_OOSTORE
+# define smp_tmb()	mb()
 # define smp_wmb() 	wmb()
 #else
+# define smp_tmb()	barrier()
 # define smp_wmb()	barrier()
 #endif
 #define smp_read_barrier_depends()	read_barrier_depends()
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	do { } while (0)
diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
index ef021677d536..7839db843ea5 100644
--- a/arch/xtensa/include/asm/barrier.h
+++ b/arch/xtensa/include/asm/barrier.h
@@ -20,6 +20,7 @@
 #error smp_* not defined
 #else
 #define smp_mb()	barrier()
+#define smp_tmb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
 #endif


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-03 14:40                           ` Paul E. McKenney
  2013-11-03 15:17                             ` [RFC] arch: Introduce new TSO memory barrier smp_tmb() Peter Zijlstra
@ 2013-11-03 17:07                             ` Will Deacon
  2013-11-03 22:47                               ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Will Deacon @ 2013-11-03 17:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev

On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote:
> On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote:
> > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > > > Now the whole crux of the question is if we need barrier A at all, since
> > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > > > read.
> > > > 
> > > > The dependency you are talking about is via the "if" statement?
> > > > Even C/C++11 is not required to respect control dependencies.
> > > > 
> > > > This one is a bit annoying.  The x86 TSO means that you really only
> > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > > > barrier, and so on -- but smp_mb() emits a full barrier.
> > > > 
> > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > > > before reads, writes before writes, and reads before writes, but not
> > > > writes before reads?  Another approach would be to define a per-arch
> > > > barrier for this particular case.
> > > 
> > > I suppose we can only introduce new barrier primitives if there's more
> > > than 1 use-case.

Which barrier did you have in mind when you refer to `recent ARM' above? It
seems to me like you'd need a combination if dmb ishld and dmb ishst, since
the former doesn't order writes before writes.

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 15:17                             ` [RFC] arch: Introduce new TSO memory barrier smp_tmb() Peter Zijlstra
@ 2013-11-03 18:08                               ` Linus Torvalds
  2013-11-03 20:01                                 ` Peter Zijlstra
  2013-11-03 20:59                               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 120+ messages in thread
From: Linus Torvalds @ 2013-11-03 18:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling

On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
>> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
>
> Well, I'm obviously all for introducing this new barrier, for it will
> reduce a full mfence on x86 to a compiler barrier. And ppc can use
> lwsync as opposed to sync afaict. Not sure ARM can do better.
>
> ---
> Subject: arch: Introduce new TSO memory barrier smp_tmb()

This is specialized enough that I would *really* like the name to be
more descriptive. Compare to the special "smp_read_barrier_depends()"
maco: it's unusual, and it has very specific semantics, so it gets a
long and descriptive name.

Memory ordering is subtle enough without then using names that are
subtle in themselves. mb/rmb/wmb are conceptually pretty simple
operations, and very basic when talking about memory ordering.
"acquire" and "release" are less simple, but have descriptive names
and have very specific uses in locking.

In contrast "smp_tmb()" is a *horrible* name, because TSO is a
description of the memory ordering, not of a particular barrier. It's
also not even clear that you can have a "tso barrier", since the
ordering (like acquire/release) presumably is really about one
particular *store*, not about some kind of barrier between different
operations.

So please describe exactly what the semantics that barrier has, and
then name the barrier that way.

I assume that in this particular case, the semantics RCU wants is
"write barrier, and no preceding reads can move past this point".

Calling that "smp_tmb()" is f*cking insane, imnsho.

              Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 18:08                               ` Linus Torvalds
@ 2013-11-03 20:01                                 ` Peter Zijlstra
  2013-11-03 22:42                                   ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-03 20:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling

On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
> On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> >> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> >
> > Well, I'm obviously all for introducing this new barrier, for it will
> > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > lwsync as opposed to sync afaict. Not sure ARM can do better.
> >
> > ---
> > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> 
> This is specialized enough that I would *really* like the name to be
> more descriptive. Compare to the special "smp_read_barrier_depends()"
> maco: it's unusual, and it has very specific semantics, so it gets a
> long and descriptive name.
> 
> Memory ordering is subtle enough without then using names that are
> subtle in themselves. mb/rmb/wmb are conceptually pretty simple
> operations, and very basic when talking about memory ordering.
> "acquire" and "release" are less simple, but have descriptive names
> and have very specific uses in locking.
> 
> In contrast "smp_tmb()" is a *horrible* name, because TSO is a
> description of the memory ordering, not of a particular barrier. It's
> also not even clear that you can have a "tso barrier", since the
> ordering (like acquire/release) presumably is really about one
> particular *store*, not about some kind of barrier between different
> operations.
> 
> So please describe exactly what the semantics that barrier has, and
> then name the barrier that way.
> 
> I assume that in this particular case, the semantics RCU wants is
> "write barrier, and no preceding reads can move past this point".
> 
> Calling that "smp_tmb()" is f*cking insane, imnsho.

Fair enough; from what I could gather the proposed semantics are
RELEASE+WMB, such that neither reads not writes can cross over, writes
can't cross back, but reads could.

Since both RELEASE and WMB are trivial under TSO the entire thing
collapses.

Now I'm currently completely confused as to what C/C++ wrecks vs actual
proper memory order issues; let alone fully comprehend the case that
started all this.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-01 16:30                               ` Victor Kaplansky
@ 2013-11-03 20:57                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 120+ messages in thread
From: Benjamin Herrenschmidt @ 2013-11-03 20:57 UTC (permalink / raw)
  To: Victor Kaplansky
  Cc: David Laight, Michael Neuling, Mathieu Desnoyers, Peter Zijlstra,
	Frederic Weisbecker, LKML, Oleg Nesterov, Linux PPC dev,
	Anton Blanchard, paulmck

On Fri, 2013-11-01 at 18:30 +0200, Victor Kaplansky wrote:
> "David Laight" <David.Laight@aculab.com> wrote on 11/01/2013 06:25:29 PM:
> > gcc will do unexpected memory accesses for bit fields that are
> > adjacent to volatile data.
> > In particular it may generate 64bit sized (and aligned) RMW cycles
> > when accessing bit fields.
> > And yes, this has caused real problems.
> 
> Thanks, I am aware about this bug/feature in gcc.

AFAIK, this has been fixed in 4.8 and 4.7.3 ... 

Cheers,
Ben.

> -- Victor
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 15:17                             ` [RFC] arch: Introduce new TSO memory barrier smp_tmb() Peter Zijlstra
  2013-11-03 18:08                               ` Linus Torvalds
@ 2013-11-03 20:59                               ` Benjamin Herrenschmidt
  2013-11-03 22:43                                 ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Benjamin Herrenschmidt @ 2013-11-03 20:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Linus Torvalds

On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
> On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> 
> Well, I'm obviously all for introducing this new barrier, for it will
> reduce a full mfence on x86 to a compiler barrier. And ppc can use
> lwsync as opposed to sync afaict. Not sure ARM can do better.

The patch at the *very least* needs a good description of the semantics
of the barrier, what does it order vs. what etc...

Cheers,
Ben.

> ---
> Subject: arch: Introduce new TSO memory barrier smp_tmb()
> 
> A few sites could be downgraded from smp_mb() to smp_tmb() and a few
> site should be upgraded to smp_tmb() that are now using smp_wmb().
> 
> XXX hope PaulMck explains things better..
> 
> X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
> reduces to barrier().
> 
> PPC can use lwsync instead of sync
> 
> For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
> is always correct but possibly suboptimal.
> 
> Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
> Not-Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> ---
>  arch/alpha/include/asm/barrier.h      | 2 ++
>  arch/arc/include/asm/barrier.h        | 2 ++
>  arch/arm/include/asm/barrier.h        | 2 ++
>  arch/arm64/include/asm/barrier.h      | 2 ++
>  arch/avr32/include/asm/barrier.h      | 1 +
>  arch/blackfin/include/asm/barrier.h   | 1 +
>  arch/cris/include/asm/barrier.h       | 2 ++
>  arch/frv/include/asm/barrier.h        | 1 +
>  arch/h8300/include/asm/barrier.h      | 2 ++
>  arch/hexagon/include/asm/barrier.h    | 1 +
>  arch/ia64/include/asm/barrier.h       | 2 ++
>  arch/m32r/include/asm/barrier.h       | 2 ++
>  arch/m68k/include/asm/barrier.h       | 1 +
>  arch/metag/include/asm/barrier.h      | 3 +++
>  arch/microblaze/include/asm/barrier.h | 1 +
>  arch/mips/include/asm/barrier.h       | 3 +++
>  arch/mn10300/include/asm/barrier.h    | 2 ++
>  arch/parisc/include/asm/barrier.h     | 1 +
>  arch/powerpc/include/asm/barrier.h    | 2 ++
>  arch/s390/include/asm/barrier.h       | 1 +
>  arch/score/include/asm/barrier.h      | 1 +
>  arch/sh/include/asm/barrier.h         | 2 ++
>  arch/sparc/include/asm/barrier_32.h   | 1 +
>  arch/sparc/include/asm/barrier_64.h   | 3 +++
>  arch/tile/include/asm/barrier.h       | 2 ++
>  arch/unicore32/include/asm/barrier.h  | 1 +
>  arch/x86/include/asm/barrier.h        | 3 +++
>  arch/xtensa/include/asm/barrier.h     | 1 +
>  28 files changed, 48 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..02ea63897038 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
>  #ifdef CONFIG_SMP
>  #define __ASM_SMP_MB	"\tmb\n"
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define __ASM_SMP_MB
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while (0)
> diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> index f6cb7c4ffb35..456c790fa1ad 100644
> --- a/arch/arc/include/asm/barrier.h
> +++ b/arch/arc/include/asm/barrier.h
> @@ -22,10 +22,12 @@
>  /* TODO-vineetg verify the correctness of macros here */
>  #ifdef CONFIG_SMP
>  #define smp_mb()        mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()       rmb()
>  #define smp_wmb()       wmb()
>  #else
>  #define smp_mb()        barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()       barrier()
>  #define smp_wmb()       barrier()
>  #endif
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 60f15e274e6d..bc88a8505673 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -51,10 +51,12 @@
>  
>  #ifndef CONFIG_SMP
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #else
>  #define smp_mb()	dmb(ish)
> +#define smp_tmb()	smp_mb()
>  #define smp_rmb()	smp_mb()
>  #define smp_wmb()	dmb(ishst)
>  #endif
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index d4a63338a53c..ec0531f4892f 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -33,10 +33,12 @@
>  
>  #ifndef CONFIG_SMP
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #else
>  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
> +#define smp_tmb()	asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
>  #endif
> diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
> index 0961275373db..6c6ccb9cf290 100644
> --- a/arch/avr32/include/asm/barrier.h
> +++ b/arch/avr32/include/asm/barrier.h
> @@ -20,6 +20,7 @@
>  # error "The AVR32 port does not support SMP"
>  #else
>  # define smp_mb()		barrier()
> +# define smp_tmb()		barrier()
>  # define smp_rmb()		barrier()
>  # define smp_wmb()		barrier()
>  # define smp_read_barrier_depends() do { } while(0)
> diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
> index ebb189507dd7..100f49121a18 100644
> --- a/arch/blackfin/include/asm/barrier.h
> +++ b/arch/blackfin/include/asm/barrier.h
> @@ -40,6 +40,7 @@
>  #endif /* !CONFIG_SMP */
>  
>  #define smp_mb()  mb()
> +#define smp_tmb() mb()
>  #define smp_rmb() rmb()
>  #define smp_wmb() wmb()
>  #define set_mb(var, value) do { var = value; mb(); } while (0)
> diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
> index 198ad7fa6b25..679c33738b4c 100644
> --- a/arch/cris/include/asm/barrier.h
> +++ b/arch/cris/include/asm/barrier.h
> @@ -12,11 +12,13 @@
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()        mb()
> +#define smp_tmb()       mb()
>  #define smp_rmb()       rmb()
>  #define smp_wmb()       wmb()
>  #define smp_read_barrier_depends()     read_barrier_depends()
>  #else
>  #define smp_mb()        barrier()
> +#define smp_tmb()       barrier()
>  #define smp_rmb()       barrier()
>  #define smp_wmb()       barrier()
>  #define smp_read_barrier_depends()     do { } while(0)
> diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
> index 06776ad9f5e9..60354ce13ba0 100644
> --- a/arch/frv/include/asm/barrier.h
> +++ b/arch/frv/include/asm/barrier.h
> @@ -20,6 +20,7 @@
>  #define read_barrier_depends()	do { } while (0)
>  
>  #define smp_mb()			barrier()
> +#define smp_tmb()			barrier()
>  #define smp_rmb()			barrier()
>  #define smp_wmb()			barrier()
>  #define smp_read_barrier_depends()	do {} while(0)
> diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
> index 9e0aa9fc195d..e8e297fa4e9a 100644
> --- a/arch/h8300/include/asm/barrier.h
> +++ b/arch/h8300/include/asm/barrier.h
> @@ -16,11 +16,13 @@
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
> index 1041a8e70ce8..2dd5b2ad4d21 100644
> --- a/arch/hexagon/include/asm/barrier.h
> +++ b/arch/hexagon/include/asm/barrier.h
> @@ -28,6 +28,7 @@
>  #define smp_rmb()			barrier()
>  #define smp_read_barrier_depends()	barrier()
>  #define smp_wmb()			barrier()
> +#define smp_tmb()			barrier()
>  #define smp_mb()			barrier()
>  #define smp_mb__before_atomic_dec()	barrier()
>  #define smp_mb__after_atomic_dec()	barrier()
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index 60576e06b6fb..a5f92146b091 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -42,11 +42,13 @@
>  
>  #ifdef CONFIG_SMP
>  # define smp_mb()	mb()
> +# define smp_tmb()	mb()
>  # define smp_rmb()	rmb()
>  # define smp_wmb()	wmb()
>  # define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  # define smp_mb()	barrier()
> +# define smp_tmb()	barrier()
>  # define smp_rmb()	barrier()
>  # define smp_wmb()	barrier()
>  # define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
> index 6976621efd3f..a6fa29facd7a 100644
> --- a/arch/m32r/include/asm/barrier.h
> +++ b/arch/m32r/include/asm/barrier.h
> @@ -79,12 +79,14 @@
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while (0)
> diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
> index 445ce22c23cb..8ecf52c87847 100644
> --- a/arch/m68k/include/asm/barrier.h
> +++ b/arch/m68k/include/asm/barrier.h
> @@ -13,6 +13,7 @@
>  #define set_mb(var, value)	({ (var) = (value); wmb(); })
>  
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	((void)0)
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index c90bfc6bf648..eb179fbce580 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -50,6 +50,7 @@ static inline void wmb(void)
>  #ifndef CONFIG_SMP
>  #define fence()		do { } while (0)
>  #define smp_mb()        barrier()
> +#define smp_tmb()       barrier()
>  #define smp_rmb()       barrier()
>  #define smp_wmb()       barrier()
>  #else
> @@ -70,11 +71,13 @@ static inline void fence(void)
>  	*flushptr = 0;
>  }
>  #define smp_mb()        fence()
> +#define smp_tmb()       fence()
>  #define smp_rmb()       fence()
>  #define smp_wmb()       barrier()
>  #else
>  #define fence()		do { } while (0)
>  #define smp_mb()        barrier()
> +#define smp_tmb()       barrier()
>  #define smp_rmb()       barrier()
>  #define smp_wmb()       barrier()
>  #endif
> diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
> index df5be3e87044..d573c170a717 100644
> --- a/arch/microblaze/include/asm/barrier.h
> +++ b/arch/microblaze/include/asm/barrier.h
> @@ -21,6 +21,7 @@
>  #define set_wmb(var, value)	do { var = value; wmb(); } while (0)
>  
>  #define smp_mb()		mb()
> +#define smp_tmb()		mb()
>  #define smp_rmb()		rmb()
>  #define smp_wmb()		wmb()
>  
> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> index 314ab5532019..535e699eec3b 100644
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -144,15 +144,18 @@
>  #if defined(CONFIG_WEAK_ORDERING) && defined(CONFIG_SMP)
>  # ifdef CONFIG_CPU_CAVIUM_OCTEON
>  #  define smp_mb()	__sync()
> +#  define smp_tmb()	__sync()
>  #  define smp_rmb()	barrier()
>  #  define smp_wmb()	__syncw()
>  # else
>  #  define smp_mb()	__asm__ __volatile__("sync" : : :"memory")
> +#  define smp_tmb()	__asm__ __volatile__("sync" : : :"memory")
>  #  define smp_rmb()	__asm__ __volatile__("sync" : : :"memory")
>  #  define smp_wmb()	__asm__ __volatile__("sync" : : :"memory")
>  # endif
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #endif
> diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
> index 2bd97a5c8af7..a345b0776e5f 100644
> --- a/arch/mn10300/include/asm/barrier.h
> +++ b/arch/mn10300/include/asm/barrier.h
> @@ -19,11 +19,13 @@
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define set_mb(var, value)  do { xchg(&var, value); } while (0)
>  #else  /* CONFIG_SMP */
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define set_mb(var, value)  do { var = value;  mb(); } while (0)
> diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
> index e77d834aa803..f53196b589ec 100644
> --- a/arch/parisc/include/asm/barrier.h
> +++ b/arch/parisc/include/asm/barrier.h
> @@ -25,6 +25,7 @@
>  #define rmb()		mb()
>  #define wmb()		mb()
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	mb()
>  #define smp_wmb()	mb()
>  #define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index ae782254e731..d7e8a560f1fe 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -46,11 +46,13 @@
>  #endif
>  
>  #define smp_mb()	mb()
> +#define smp_tmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>  #define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
>  #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 16760eeb79b0..f0409a874243 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -24,6 +24,7 @@
>  #define wmb()				mb()
>  #define read_barrier_depends()		do { } while(0)
>  #define smp_mb()			mb()
> +#define smp_tmb()			mb()
>  #define smp_rmb()			rmb()
>  #define smp_wmb()			wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
> diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
> index 0eacb6471e6d..865652083dde 100644
> --- a/arch/score/include/asm/barrier.h
> +++ b/arch/score/include/asm/barrier.h
> @@ -5,6 +5,7 @@
>  #define rmb()		barrier()
>  #define wmb()		barrier()
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  
> diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
> index 72c103dae300..f8dce7926432 100644
> --- a/arch/sh/include/asm/barrier.h
> +++ b/arch/sh/include/asm/barrier.h
> @@ -39,11 +39,13 @@
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
> index c1b76654ee76..1037ce189cee 100644
> --- a/arch/sparc/include/asm/barrier_32.h
> +++ b/arch/sparc/include/asm/barrier_32.h
> @@ -8,6 +8,7 @@
>  #define read_barrier_depends()	do { } while(0)
>  #define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
>  #define smp_mb()	__asm__ __volatile__("":::"memory")
> +#define smp_tmb()	__asm__ __volatile__("":::"memory")
>  #define smp_rmb()	__asm__ __volatile__("":::"memory")
>  #define smp_wmb()	__asm__ __volatile__("":::"memory")
>  #define smp_read_barrier_depends()	do { } while(0)
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 95d45986f908..0f3c2fdb86b8 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -34,6 +34,7 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
>   * memory ordering than required by the specifications.
>   */
>  #define mb()	membar_safe("#StoreLoad")
> +#define tmb()	__asm__ __volatile__("":::"memory")
>  #define rmb()	__asm__ __volatile__("":::"memory")
>  #define wmb()	__asm__ __volatile__("":::"memory")
>  
> @@ -43,10 +44,12 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	tmb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #else
>  #define smp_mb()	__asm__ __volatile__("":::"memory")
> +#define smp_tmb()	__asm__ __volatile__("":::"memory")
>  #define smp_rmb()	__asm__ __volatile__("":::"memory")
>  #define smp_wmb()	__asm__ __volatile__("":::"memory")
>  #endif
> diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
> index a9a73da5865d..cad3c6ae28bf 100644
> --- a/arch/tile/include/asm/barrier.h
> +++ b/arch/tile/include/asm/barrier.h
> @@ -127,11 +127,13 @@ mb_incoherent(void)
>  
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
> +#define smp_tmb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while (0)
> diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
> index a6620e5336b6..8b341fffbda6 100644
> --- a/arch/unicore32/include/asm/barrier.h
> +++ b/arch/unicore32/include/asm/barrier.h
> @@ -18,6 +18,7 @@
>  #define rmb()				barrier()
>  #define wmb()				barrier()
>  #define smp_mb()			barrier()
> +#define smp_tmb()			barrier()
>  #define smp_rmb()			barrier()
>  #define smp_wmb()			barrier()
>  #define read_barrier_depends()		do { } while (0)
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index c6cd358a1eec..480201d83af1 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -86,14 +86,17 @@
>  # define smp_rmb()	barrier()
>  #endif
>  #ifdef CONFIG_X86_OOSTORE
> +# define smp_tmb()	mb()
>  # define smp_wmb() 	wmb()
>  #else
> +# define smp_tmb()	barrier()
>  # define smp_wmb()	barrier()
>  #endif
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	do { } while (0)
> diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
> index ef021677d536..7839db843ea5 100644
> --- a/arch/xtensa/include/asm/barrier.h
> +++ b/arch/xtensa/include/asm/barrier.h
> @@ -20,6 +20,7 @@
>  #error smp_* not defined
>  #else
>  #define smp_mb()	barrier()
> +#define smp_tmb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
>  #endif
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 20:01                                 ` Peter Zijlstra
@ 2013-11-03 22:42                                   ` Paul E. McKenney
  2013-11-03 23:34                                     ` Linus Torvalds
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-03 22:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Sun, Nov 03, 2013 at 09:01:24PM +0100, Peter Zijlstra wrote:
> On Sun, Nov 03, 2013 at 10:08:14AM -0800, Linus Torvalds wrote:
> > On Sun, Nov 3, 2013 at 7:17 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > >> If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> > >
> > > Well, I'm obviously all for introducing this new barrier, for it will
> > > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > > lwsync as opposed to sync afaict. Not sure ARM can do better.
> > >
> > > ---
> > > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> > 
> > This is specialized enough that I would *really* like the name to be
> > more descriptive. Compare to the special "smp_read_barrier_depends()"
> > maco: it's unusual, and it has very specific semantics, so it gets a
> > long and descriptive name.
> > 
> > Memory ordering is subtle enough without then using names that are
> > subtle in themselves. mb/rmb/wmb are conceptually pretty simple
> > operations, and very basic when talking about memory ordering.
> > "acquire" and "release" are less simple, but have descriptive names
> > and have very specific uses in locking.
> > 
> > In contrast "smp_tmb()" is a *horrible* name, because TSO is a
> > description of the memory ordering, not of a particular barrier. It's
> > also not even clear that you can have a "tso barrier", since the
> > ordering (like acquire/release) presumably is really about one
> > particular *store*, not about some kind of barrier between different
> > operations.
> > 
> > So please describe exactly what the semantics that barrier has, and
> > then name the barrier that way.
> > 
> > I assume that in this particular case, the semantics RCU wants is
> > "write barrier, and no preceding reads can move past this point".

Its semantics order prior reads against subsequent reads, prior reads
against subsequent writes, and prior writes against subsequent writes.
It does -not- order prior writes against subsequent reads.

> > Calling that "smp_tmb()" is f*cking insane, imnsho.
> 
> Fair enough; from what I could gather the proposed semantics are
> RELEASE+WMB, such that neither reads not writes can cross over, writes
> can't cross back, but reads could.
> 
> Since both RELEASE and WMB are trivial under TSO the entire thing
> collapses.

And here are some candidate names, with no attempt to sort sanity from
insanity:

smp_storebuffer_mb() -- A barrier that enforces those orderings
	that do not invalidate the hardware store-buffer optimization.

smp_not_w_r_mb() -- A barrier that orders everything except prior
	writes against subsequent reads.

smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
	semantics.  (C/C++ "acquire" orders a specific load against
	subsequent loads and stores, while C/C++ "release" orders
	a specific store against prior loads and stores.)

Others?

> Now I'm currently completely confused as to what C/C++ wrecks vs actual
> proper memory order issues; let alone fully comprehend the case that
> started all this.

Each can result in similar wreckage.  In either case, it is about failing
to guarantee needed orderings.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 20:59                               ` Benjamin Herrenschmidt
@ 2013-11-03 22:43                                 ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-03 22:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Linus Torvalds

On Mon, Nov 04, 2013 at 07:59:23AM +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2013-11-03 at 16:17 +0100, Peter Zijlstra wrote:
> > On Sun, Nov 03, 2013 at 06:40:17AM -0800, Paul E. McKenney wrote:
> > > If there was an smp_tmb(), I would likely use it in rcu_assign_pointer().
> > 
> > Well, I'm obviously all for introducing this new barrier, for it will
> > reduce a full mfence on x86 to a compiler barrier. And ppc can use
> > lwsync as opposed to sync afaict. Not sure ARM can do better.
> 
> The patch at the *very least* needs a good description of the semantics
> of the barrier, what does it order vs. what etc...

Agreed.  Also it needs a name that people can live with.  We will get
there.  ;-)

							Thanx, Paul

> Cheers,
> Ben.
> 
> > ---
> > Subject: arch: Introduce new TSO memory barrier smp_tmb()
> > 
> > A few sites could be downgraded from smp_mb() to smp_tmb() and a few
> > site should be upgraded to smp_tmb() that are now using smp_wmb().
> > 
> > XXX hope PaulMck explains things better..
> > 
> > X86 (!OOSTORE), SPARC have native TSO memory models and smp_tmb()
> > reduces to barrier().
> > 
> > PPC can use lwsync instead of sync
> > 
> > For the other archs, have smp_tmb map to smp_mb, as the stronger barrier
> > is always correct but possibly suboptimal.
> > 
> > Suggested-by: Paul McKenney <paulmck@linux.vnet.ibm.com>
> > Not-Signed-off-by: Peter Zijlstra <peterz@infradead.org>
> > ---
> >  arch/alpha/include/asm/barrier.h      | 2 ++
> >  arch/arc/include/asm/barrier.h        | 2 ++
> >  arch/arm/include/asm/barrier.h        | 2 ++
> >  arch/arm64/include/asm/barrier.h      | 2 ++
> >  arch/avr32/include/asm/barrier.h      | 1 +
> >  arch/blackfin/include/asm/barrier.h   | 1 +
> >  arch/cris/include/asm/barrier.h       | 2 ++
> >  arch/frv/include/asm/barrier.h        | 1 +
> >  arch/h8300/include/asm/barrier.h      | 2 ++
> >  arch/hexagon/include/asm/barrier.h    | 1 +
> >  arch/ia64/include/asm/barrier.h       | 2 ++
> >  arch/m32r/include/asm/barrier.h       | 2 ++
> >  arch/m68k/include/asm/barrier.h       | 1 +
> >  arch/metag/include/asm/barrier.h      | 3 +++
> >  arch/microblaze/include/asm/barrier.h | 1 +
> >  arch/mips/include/asm/barrier.h       | 3 +++
> >  arch/mn10300/include/asm/barrier.h    | 2 ++
> >  arch/parisc/include/asm/barrier.h     | 1 +
> >  arch/powerpc/include/asm/barrier.h    | 2 ++
> >  arch/s390/include/asm/barrier.h       | 1 +
> >  arch/score/include/asm/barrier.h      | 1 +
> >  arch/sh/include/asm/barrier.h         | 2 ++
> >  arch/sparc/include/asm/barrier_32.h   | 1 +
> >  arch/sparc/include/asm/barrier_64.h   | 3 +++
> >  arch/tile/include/asm/barrier.h       | 2 ++
> >  arch/unicore32/include/asm/barrier.h  | 1 +
> >  arch/x86/include/asm/barrier.h        | 3 +++
> >  arch/xtensa/include/asm/barrier.h     | 1 +
> >  28 files changed, 48 insertions(+)
> > 
> > diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> > index ce8860a0b32d..02ea63897038 100644
> > --- a/arch/alpha/include/asm/barrier.h
> > +++ b/arch/alpha/include/asm/barrier.h
> > @@ -18,12 +18,14 @@ __asm__ __volatile__("mb": : :"memory")
> >  #ifdef CONFIG_SMP
> >  #define __ASM_SMP_MB	"\tmb\n"
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  #define __ASM_SMP_MB
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while (0)
> > diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> > index f6cb7c4ffb35..456c790fa1ad 100644
> > --- a/arch/arc/include/asm/barrier.h
> > +++ b/arch/arc/include/asm/barrier.h
> > @@ -22,10 +22,12 @@
> >  /* TODO-vineetg verify the correctness of macros here */
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()        mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()       rmb()
> >  #define smp_wmb()       wmb()
> >  #else
> >  #define smp_mb()        barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()       barrier()
> >  #define smp_wmb()       barrier()
> >  #endif
> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 60f15e274e6d..bc88a8505673 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -51,10 +51,12 @@
> >  
> >  #ifndef CONFIG_SMP
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #else
> >  #define smp_mb()	dmb(ish)
> > +#define smp_tmb()	smp_mb()
> >  #define smp_rmb()	smp_mb()
> >  #define smp_wmb()	dmb(ishst)
> >  #endif
> > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> > index d4a63338a53c..ec0531f4892f 100644
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -33,10 +33,12 @@
> >  
> >  #ifndef CONFIG_SMP
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #else
> >  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
> > +#define smp_tmb()	asm volatile("dmb ish" : : : "memory")
> >  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
> >  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
> >  #endif
> > diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
> > index 0961275373db..6c6ccb9cf290 100644
> > --- a/arch/avr32/include/asm/barrier.h
> > +++ b/arch/avr32/include/asm/barrier.h
> > @@ -20,6 +20,7 @@
> >  # error "The AVR32 port does not support SMP"
> >  #else
> >  # define smp_mb()		barrier()
> > +# define smp_tmb()		barrier()
> >  # define smp_rmb()		barrier()
> >  # define smp_wmb()		barrier()
> >  # define smp_read_barrier_depends() do { } while(0)
> > diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
> > index ebb189507dd7..100f49121a18 100644
> > --- a/arch/blackfin/include/asm/barrier.h
> > +++ b/arch/blackfin/include/asm/barrier.h
> > @@ -40,6 +40,7 @@
> >  #endif /* !CONFIG_SMP */
> >  
> >  #define smp_mb()  mb()
> > +#define smp_tmb() mb()
> >  #define smp_rmb() rmb()
> >  #define smp_wmb() wmb()
> >  #define set_mb(var, value) do { var = value; mb(); } while (0)
> > diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
> > index 198ad7fa6b25..679c33738b4c 100644
> > --- a/arch/cris/include/asm/barrier.h
> > +++ b/arch/cris/include/asm/barrier.h
> > @@ -12,11 +12,13 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()        mb()
> > +#define smp_tmb()       mb()
> >  #define smp_rmb()       rmb()
> >  #define smp_wmb()       wmb()
> >  #define smp_read_barrier_depends()     read_barrier_depends()
> >  #else
> >  #define smp_mb()        barrier()
> > +#define smp_tmb()       barrier()
> >  #define smp_rmb()       barrier()
> >  #define smp_wmb()       barrier()
> >  #define smp_read_barrier_depends()     do { } while(0)
> > diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
> > index 06776ad9f5e9..60354ce13ba0 100644
> > --- a/arch/frv/include/asm/barrier.h
> > +++ b/arch/frv/include/asm/barrier.h
> > @@ -20,6 +20,7 @@
> >  #define read_barrier_depends()	do { } while (0)
> >  
> >  #define smp_mb()			barrier()
> > +#define smp_tmb()			barrier()
> >  #define smp_rmb()			barrier()
> >  #define smp_wmb()			barrier()
> >  #define smp_read_barrier_depends()	do {} while(0)
> > diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
> > index 9e0aa9fc195d..e8e297fa4e9a 100644
> > --- a/arch/h8300/include/asm/barrier.h
> > +++ b/arch/h8300/include/asm/barrier.h
> > @@ -16,11 +16,13 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
> > index 1041a8e70ce8..2dd5b2ad4d21 100644
> > --- a/arch/hexagon/include/asm/barrier.h
> > +++ b/arch/hexagon/include/asm/barrier.h
> > @@ -28,6 +28,7 @@
> >  #define smp_rmb()			barrier()
> >  #define smp_read_barrier_depends()	barrier()
> >  #define smp_wmb()			barrier()
> > +#define smp_tmb()			barrier()
> >  #define smp_mb()			barrier()
> >  #define smp_mb__before_atomic_dec()	barrier()
> >  #define smp_mb__after_atomic_dec()	barrier()
> > diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> > index 60576e06b6fb..a5f92146b091 100644
> > --- a/arch/ia64/include/asm/barrier.h
> > +++ b/arch/ia64/include/asm/barrier.h
> > @@ -42,11 +42,13 @@
> >  
> >  #ifdef CONFIG_SMP
> >  # define smp_mb()	mb()
> > +# define smp_tmb()	mb()
> >  # define smp_rmb()	rmb()
> >  # define smp_wmb()	wmb()
> >  # define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  # define smp_mb()	barrier()
> > +# define smp_tmb()	barrier()
> >  # define smp_rmb()	barrier()
> >  # define smp_wmb()	barrier()
> >  # define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
> > index 6976621efd3f..a6fa29facd7a 100644
> > --- a/arch/m32r/include/asm/barrier.h
> > +++ b/arch/m32r/include/asm/barrier.h
> > @@ -79,12 +79,14 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while (0)
> > diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
> > index 445ce22c23cb..8ecf52c87847 100644
> > --- a/arch/m68k/include/asm/barrier.h
> > +++ b/arch/m68k/include/asm/barrier.h
> > @@ -13,6 +13,7 @@
> >  #define set_mb(var, value)	({ (var) = (value); wmb(); })
> >  
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	((void)0)
> > diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> > index c90bfc6bf648..eb179fbce580 100644
> > --- a/arch/metag/include/asm/barrier.h
> > +++ b/arch/metag/include/asm/barrier.h
> > @@ -50,6 +50,7 @@ static inline void wmb(void)
> >  #ifndef CONFIG_SMP
> >  #define fence()		do { } while (0)
> >  #define smp_mb()        barrier()
> > +#define smp_tmb()       barrier()
> >  #define smp_rmb()       barrier()
> >  #define smp_wmb()       barrier()
> >  #else
> > @@ -70,11 +71,13 @@ static inline void fence(void)
> >  	*flushptr = 0;
> >  }
> >  #define smp_mb()        fence()
> > +#define smp_tmb()       fence()
> >  #define smp_rmb()       fence()
> >  #define smp_wmb()       barrier()
> >  #else
> >  #define fence()		do { } while (0)
> >  #define smp_mb()        barrier()
> > +#define smp_tmb()       barrier()
> >  #define smp_rmb()       barrier()
> >  #define smp_wmb()       barrier()
> >  #endif
> > diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
> > index df5be3e87044..d573c170a717 100644
> > --- a/arch/microblaze/include/asm/barrier.h
> > +++ b/arch/microblaze/include/asm/barrier.h
> > @@ -21,6 +21,7 @@
> >  #define set_wmb(var, value)	do { var = value; wmb(); } while (0)
> >  
> >  #define smp_mb()		mb()
> > +#define smp_tmb()		mb()
> >  #define smp_rmb()		rmb()
> >  #define smp_wmb()		wmb()
> >  
> > diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> > index 314ab5532019..535e699eec3b 100644
> > --- a/arch/mips/include/asm/barrier.h
> > +++ b/arch/mips/include/asm/barrier.h
> > @@ -144,15 +144,18 @@
> >  #if defined(CONFIG_WEAK_ORDERING) && defined(CONFIG_SMP)
> >  # ifdef CONFIG_CPU_CAVIUM_OCTEON
> >  #  define smp_mb()	__sync()
> > +#  define smp_tmb()	__sync()
> >  #  define smp_rmb()	barrier()
> >  #  define smp_wmb()	__syncw()
> >  # else
> >  #  define smp_mb()	__asm__ __volatile__("sync" : : :"memory")
> > +#  define smp_tmb()	__asm__ __volatile__("sync" : : :"memory")
> >  #  define smp_rmb()	__asm__ __volatile__("sync" : : :"memory")
> >  #  define smp_wmb()	__asm__ __volatile__("sync" : : :"memory")
> >  # endif
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #endif
> > diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
> > index 2bd97a5c8af7..a345b0776e5f 100644
> > --- a/arch/mn10300/include/asm/barrier.h
> > +++ b/arch/mn10300/include/asm/barrier.h
> > @@ -19,11 +19,13 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define set_mb(var, value)  do { xchg(&var, value); } while (0)
> >  #else  /* CONFIG_SMP */
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define set_mb(var, value)  do { var = value;  mb(); } while (0)
> > diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
> > index e77d834aa803..f53196b589ec 100644
> > --- a/arch/parisc/include/asm/barrier.h
> > +++ b/arch/parisc/include/asm/barrier.h
> > @@ -25,6 +25,7 @@
> >  #define rmb()		mb()
> >  #define wmb()		mb()
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	mb()
> >  #define smp_wmb()	mb()
> >  #define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> > index ae782254e731..d7e8a560f1fe 100644
> > --- a/arch/powerpc/include/asm/barrier.h
> > +++ b/arch/powerpc/include/asm/barrier.h
> > @@ -46,11 +46,13 @@
> >  #endif
> >  
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >  #define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> >  #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> > index 16760eeb79b0..f0409a874243 100644
> > --- a/arch/s390/include/asm/barrier.h
> > +++ b/arch/s390/include/asm/barrier.h
> > @@ -24,6 +24,7 @@
> >  #define wmb()				mb()
> >  #define read_barrier_depends()		do { } while(0)
> >  #define smp_mb()			mb()
> > +#define smp_tmb()			mb()
> >  #define smp_rmb()			rmb()
> >  #define smp_wmb()			wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> > diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
> > index 0eacb6471e6d..865652083dde 100644
> > --- a/arch/score/include/asm/barrier.h
> > +++ b/arch/score/include/asm/barrier.h
> > @@ -5,6 +5,7 @@
> >  #define rmb()		barrier()
> >  #define wmb()		barrier()
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  
> > diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
> > index 72c103dae300..f8dce7926432 100644
> > --- a/arch/sh/include/asm/barrier.h
> > +++ b/arch/sh/include/asm/barrier.h
> > @@ -39,11 +39,13 @@
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
> > index c1b76654ee76..1037ce189cee 100644
> > --- a/arch/sparc/include/asm/barrier_32.h
> > +++ b/arch/sparc/include/asm/barrier_32.h
> > @@ -8,6 +8,7 @@
> >  #define read_barrier_depends()	do { } while(0)
> >  #define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
> >  #define smp_mb()	__asm__ __volatile__("":::"memory")
> > +#define smp_tmb()	__asm__ __volatile__("":::"memory")
> >  #define smp_rmb()	__asm__ __volatile__("":::"memory")
> >  #define smp_wmb()	__asm__ __volatile__("":::"memory")
> >  #define smp_read_barrier_depends()	do { } while(0)
> > diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> > index 95d45986f908..0f3c2fdb86b8 100644
> > --- a/arch/sparc/include/asm/barrier_64.h
> > +++ b/arch/sparc/include/asm/barrier_64.h
> > @@ -34,6 +34,7 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
> >   * memory ordering than required by the specifications.
> >   */
> >  #define mb()	membar_safe("#StoreLoad")
> > +#define tmb()	__asm__ __volatile__("":::"memory")
> >  #define rmb()	__asm__ __volatile__("":::"memory")
> >  #define wmb()	__asm__ __volatile__("":::"memory")
> >  
> > @@ -43,10 +44,12 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	tmb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #else
> >  #define smp_mb()	__asm__ __volatile__("":::"memory")
> > +#define smp_tmb()	__asm__ __volatile__("":::"memory")
> >  #define smp_rmb()	__asm__ __volatile__("":::"memory")
> >  #define smp_wmb()	__asm__ __volatile__("":::"memory")
> >  #endif
> > diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
> > index a9a73da5865d..cad3c6ae28bf 100644
> > --- a/arch/tile/include/asm/barrier.h
> > +++ b/arch/tile/include/asm/barrier.h
> > @@ -127,11 +127,13 @@ mb_incoherent(void)
> >  
> >  #ifdef CONFIG_SMP
> >  #define smp_mb()	mb()
> > +#define smp_tmb()	mb()
> >  #define smp_rmb()	rmb()
> >  #define smp_wmb()	wmb()
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while (0)
> > diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
> > index a6620e5336b6..8b341fffbda6 100644
> > --- a/arch/unicore32/include/asm/barrier.h
> > +++ b/arch/unicore32/include/asm/barrier.h
> > @@ -18,6 +18,7 @@
> >  #define rmb()				barrier()
> >  #define wmb()				barrier()
> >  #define smp_mb()			barrier()
> > +#define smp_tmb()			barrier()
> >  #define smp_rmb()			barrier()
> >  #define smp_wmb()			barrier()
> >  #define read_barrier_depends()		do { } while (0)
> > diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> > index c6cd358a1eec..480201d83af1 100644
> > --- a/arch/x86/include/asm/barrier.h
> > +++ b/arch/x86/include/asm/barrier.h
> > @@ -86,14 +86,17 @@
> >  # define smp_rmb()	barrier()
> >  #endif
> >  #ifdef CONFIG_X86_OOSTORE
> > +# define smp_tmb()	mb()
> >  # define smp_wmb() 	wmb()
> >  #else
> > +# define smp_tmb()	barrier()
> >  # define smp_wmb()	barrier()
> >  #endif
> >  #define smp_read_barrier_depends()	read_barrier_depends()
> >  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #define smp_read_barrier_depends()	do { } while (0)
> > diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
> > index ef021677d536..7839db843ea5 100644
> > --- a/arch/xtensa/include/asm/barrier.h
> > +++ b/arch/xtensa/include/asm/barrier.h
> > @@ -20,6 +20,7 @@
> >  #error smp_* not defined
> >  #else
> >  #define smp_mb()	barrier()
> > +#define smp_tmb()	barrier()
> >  #define smp_rmb()	barrier()
> >  #define smp_wmb()	barrier()
> >  #endif
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-03 17:07                             ` perf events ring buffer memory barrier on powerpc Will Deacon
@ 2013-11-03 22:47                               ` Paul E. McKenney
  2013-11-04  9:57                                 ` Will Deacon
  0 siblings, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-03 22:47 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev

On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote:
> On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote:
> > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote:
> > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > > > > Now the whole crux of the question is if we need barrier A at all, since
> > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > > > > read.
> > > > > 
> > > > > The dependency you are talking about is via the "if" statement?
> > > > > Even C/C++11 is not required to respect control dependencies.
> > > > > 
> > > > > This one is a bit annoying.  The x86 TSO means that you really only
> > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > > > > barrier, and so on -- but smp_mb() emits a full barrier.
> > > > > 
> > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > > > > before reads, writes before writes, and reads before writes, but not
> > > > > writes before reads?  Another approach would be to define a per-arch
> > > > > barrier for this particular case.
> > > > 
> > > > I suppose we can only introduce new barrier primitives if there's more
> > > > than 1 use-case.
> 
> Which barrier did you have in mind when you refer to `recent ARM' above? It
> seems to me like you'd need a combination if dmb ishld and dmb ishst, since
> the former doesn't order writes before writes.

I heard a rumor that ARM had recently added a new dmb variant that acted
similarly to PowerPC's lwsync, and it was on my list to follow up.

Given your response, I am guessing that there is no truth to this rumor...

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 22:42                                   ` Paul E. McKenney
@ 2013-11-03 23:34                                     ` Linus Torvalds
  2013-11-04 10:51                                       ` Paul E. McKenney
  2013-11-04 11:05                                       ` Will Deacon
  0 siblings, 2 replies; 120+ messages in thread
From: Linus Torvalds @ 2013-11-03 23:34 UTC (permalink / raw)
  To: Paul McKenney
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
>
> smp_storebuffer_mb() -- A barrier that enforces those orderings
>         that do not invalidate the hardware store-buffer optimization.

Ugh. Maybe. Can you guarantee that those are the correct semantics?
And why talk about the hardware semantics, when you really want
specific semantics for the *software*.

> smp_not_w_r_mb() -- A barrier that orders everything except prior
>         writes against subsequent reads.

Ok, that sounds more along the lines of "these are the semantics we
want", but I have to say, it also doesn't make me go "ahh, ok".

> smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
>         semantics.  (C/C++ "acquire" orders a specific load against
>         subsequent loads and stores, while C/C++ "release" orders
>         a specific store against prior loads and stores.)

I don't think this is true. acquire+release is much stronger than what
you're looking for - it doesn't allow subsequent reads to move past
the write (because that would violate the acquire part). On x86, for
example, you'd need to have a locked cycle for smp_acqrel_mb().

So again, what are the guarantees you actually want? Describe those.
And then make a name.

I _think_ the guarantees you want is:
 - SMP write barrier
 - *local* read barrier for reads preceding the write.

but the problem is that the "preceding reads" part is really
specifically about the write that you had. The barrier should really
be attached to the *particular* write operation, it cannot be a
standalone barrier.

So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
problem is that a "smp_rmb()" doesn't really "attach" to the preceding
write.

This is analogous to a "acquire" operation: you cannot make an
"acquire" barrier, because it's not a barrier *between* two ops, it's
associated with one particular op.

So what I *think* you actually really really want is a "store with
release consistency, followed by a write barrier".

In TSO, afaik all stores have release consistency, and all writes are
ordered, which is why this is a no-op in TSO. And x86 also has that
"all stores have release consistency, and all writes are ordered"
model, even if TSO doesn't really describe the x86 model.

But on ARM64, for example, I think you'd really want the store itself
to be done with "stlr" (store with release), and then follow up with a
"dsb st" after that.

And notice how that requires you to mark the store itself. There is no
actual barrier *after* the store that does the optimized model.

Of course, it's entirely possible that it's not worth worrying about
this on ARM64, and that just doing it as a "normal store followed by a
full memory barrier" is good enough. But at least in *theory* a
microarchitecture might make it much cheaper to do a "store with
release consistency" followed by "write barrier".

Anyway, having talked exhaustively about exactly what semantics you
are after, I *think* the best model would be to just have a

  #define smp_store_with_release_semantics(x, y) ...

and use that *and* a "smp_wmb()" for this (possibly a special
"smp_wmb_after_release()" if that allows people to avoid double
barriers). On x86 (and TSO systems), the
smp_store_with_release_semantics() would be just a regular store, and
the smp_wmb() is obviously a no-op. Other platforms would end up doing
other things.

Hmm?

         Linus

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-02 15:20                                 ` Paul E. McKenney
@ 2013-11-04  9:07                                   ` Peter Zijlstra
  2013-11-04 10:00                                     ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-04  9:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Sat, Nov 02, 2013 at 08:20:48AM -0700, Paul E. McKenney wrote:
> On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote:
> > Furthermore there's a gazillion parallel userspace programs.
> 
> Most of which have very unaggressive concurrency designs.

pthread_mutex_t A, B;

char data_A[x];
int  counter_B = 1;

void funA(void)
{
	pthread_mutex_lock(&A);
	memset(data_A, 0, sizeof(data_A));
	pthread_mutex_unlock(&A);
}

void funB(void)
{
	pthread_mutex_lock(&B);
	counter_B++;
	pthread_mutex_unlock(&B);
}

void funC(void)
{
	pthread_mutex_lock(&B)
	printf("%d\n", counter_B);
	pthread_mutex_unlock(&B);
}

Then run: funA, funB, funC concurrently, and end with a funC.

Then explain to userman than his unaggressive program can return:
0
1

Because the memset() thought it might be a cute idea to overwrite
counter_B and fix it up 'later'. Which if I understood you right is
valid in C/C++ :-(

Not that any actual memset implementation exhibiting this trait wouldn't
be shot on the spot.

> > > By marking "ptr" as atomic, thus telling the compiler not to mess with it.
> > > And thus requiring that all accesses to it be decorated, which in the
> > > case of RCU could be buried in the RCU accessors.
> > 
> > This seems contradictory; marking it atomic would look like:
> > 
> > struct foo {
> > 	unsigned long value;
> > 	__atomic void *ptr;
> > 	unsigned long value1;
> > };
> > 
> > Clearly we cannot hide this definition in accessors, because then
> > accesses to value* won't see the annotation.
> 
> #define __rcu __atomic

Yeah, except we don't use __rcu all that consistently; in fact I don't
know if I ever added it.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-03 22:47                               ` Paul E. McKenney
@ 2013-11-04  9:57                                 ` Will Deacon
  2013-11-04 10:52                                   ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Will Deacon @ 2013-11-04  9:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML

Hi Paul,

On Sun, Nov 03, 2013 at 10:47:12PM +0000, Paul E. McKenney wrote:
> On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote:
> > On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote:
> > > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote:
> > > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> > > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > > > > > Now the whole crux of the question is if we need barrier A at all, since
> > > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > > > > > read.
> > > > > > 
> > > > > > The dependency you are talking about is via the "if" statement?
> > > > > > Even C/C++11 is not required to respect control dependencies.
> > > > > > 
> > > > > > This one is a bit annoying.  The x86 TSO means that you really only
> > > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > > > > > barrier, and so on -- but smp_mb() emits a full barrier.
> > > > > > 
> > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > > > > > before reads, writes before writes, and reads before writes, but not
> > > > > > writes before reads?  Another approach would be to define a per-arch
> > > > > > barrier for this particular case.
> > > > > 
> > > > > I suppose we can only introduce new barrier primitives if there's more
> > > > > than 1 use-case.
> > 
> > Which barrier did you have in mind when you refer to `recent ARM' above? It
> > seems to me like you'd need a combination if dmb ishld and dmb ishst, since
> > the former doesn't order writes before writes.
> 
> I heard a rumor that ARM had recently added a new dmb variant that acted
> similarly to PowerPC's lwsync, and it was on my list to follow up.
> 
> Given your response, I am guessing that there is no truth to this rumor...

I think you're talking about the -ld option to dmb, which was introduced in
ARMv8. That option orders loads against loads and stores, but doesn't order
writes against writes. So you could do:

	dmb ishld
	dmb ishst

but it's questionable whether that performs better than a dmb ish.

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-04  9:07                                   ` Peter Zijlstra
@ 2013-11-04 10:00                                     ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 10:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Oleg Nesterov

On Mon, Nov 04, 2013 at 10:07:44AM +0100, Peter Zijlstra wrote:
> On Sat, Nov 02, 2013 at 08:20:48AM -0700, Paul E. McKenney wrote:
> > On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote:
> > > Furthermore there's a gazillion parallel userspace programs.
> > 
> > Most of which have very unaggressive concurrency designs.
> 
> pthread_mutex_t A, B;
> 
> char data_A[x];
> int  counter_B = 1;
> 
> void funA(void)
> {
> 	pthread_mutex_lock(&A);
> 	memset(data_A, 0, sizeof(data_A));
> 	pthread_mutex_unlock(&A);
> }
> 
> void funB(void)
> {
> 	pthread_mutex_lock(&B);
> 	counter_B++;
> 	pthread_mutex_unlock(&B);
> }
> 
> void funC(void)
> {
> 	pthread_mutex_lock(&B)
> 	printf("%d\n", counter_B);
> 	pthread_mutex_unlock(&B);
> }
> 
> Then run: funA, funB, funC concurrently, and end with a funC.
> 
> Then explain to userman than his unaggressive program can return:
> 0
> 1
> 
> Because the memset() thought it might be a cute idea to overwrite
> counter_B and fix it up 'later'. Which if I understood you right is
> valid in C/C++ :-(
> 
> Not that any actual memset implementation exhibiting this trait wouldn't
> be shot on the spot.

Even without such a malicious memcpy() implementation I must still explain
about false sharing when the developer notices that the unaggressive
program isn't running as fast as expected.

> > > > By marking "ptr" as atomic, thus telling the compiler not to mess with it.
> > > > And thus requiring that all accesses to it be decorated, which in the
> > > > case of RCU could be buried in the RCU accessors.
> > > 
> > > This seems contradictory; marking it atomic would look like:
> > > 
> > > struct foo {
> > > 	unsigned long value;
> > > 	__atomic void *ptr;
> > > 	unsigned long value1;
> > > };
> > > 
> > > Clearly we cannot hide this definition in accessors, because then
> > > accesses to value* won't see the annotation.
> > 
> > #define __rcu __atomic
> 
> Yeah, except we don't use __rcu all that consistently; in fact I don't
> know if I ever added it.

There are more than 300 of them in the kernel.  Plus sparse can be
convinced to yell at you if you don't use them.  So lack of __rcu could
be fixed without too much trouble.

The C/C++11 need to annotate functions that take arguments or return
values taken from rcu_dereference() is another story.  But the compilers
have to get significantly more aggressive or developers have to be doing
unusual things that result in rcu_dereference() returning something whose
value the compiler can predict exactly.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 23:34                                     ` Linus Torvalds
@ 2013-11-04 10:51                                       ` Paul E. McKenney
  2013-11-04 11:22                                         ` Peter Zijlstra
  2013-11-04 11:05                                       ` Will Deacon
  1 sibling, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 10:51 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Sun, Nov 03, 2013 at 03:34:00PM -0800, Linus Torvalds wrote:
> On Sun, Nov 3, 2013 at 2:42 PM, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> >
> > smp_storebuffer_mb() -- A barrier that enforces those orderings
> >         that do not invalidate the hardware store-buffer optimization.
> 
> Ugh. Maybe. Can you guarantee that those are the correct semantics?
> And why talk about the hardware semantics, when you really want
> specific semantics for the *software*.
> 
> > smp_not_w_r_mb() -- A barrier that orders everything except prior
> >         writes against subsequent reads.
> 
> Ok, that sounds more along the lines of "these are the semantics we
> want", but I have to say, it also doesn't make me go "ahh, ok".
> 
> > smp_acqrel_mb() -- A barrier that combines C/C++ acquire and release
> >         semantics.  (C/C++ "acquire" orders a specific load against
> >         subsequent loads and stores, while C/C++ "release" orders
> >         a specific store against prior loads and stores.)
> 
> I don't think this is true. acquire+release is much stronger than what
> you're looking for - it doesn't allow subsequent reads to move past
> the write (because that would violate the acquire part). On x86, for
> example, you'd need to have a locked cycle for smp_acqrel_mb().
> 
> So again, what are the guarantees you actually want? Describe those.
> And then make a name.

I was thinking in terms of the guarantee that TSO systems provide
given a barrier() directive, and that PowerPC provides given the lwsync
instruction.  This guarantee is that loads preceding the barrier will
not be reordered with memory referenced following the barrier, and that
stores preceding the barrier will not be reordered with stores following
the barrier.  But given how much easier RCU reviews became after burying
smp_wmb() and smp_read_barrier_depends() into rcu_assign_pointer() and
rcu_dereference(), respectively, I think I prefer an extension of your
idea below.

> I _think_ the guarantees you want is:
>  - SMP write barrier
>  - *local* read barrier for reads preceding the write.
> 
> but the problem is that the "preceding reads" part is really
> specifically about the write that you had. The barrier should really
> be attached to the *particular* write operation, it cannot be a
> standalone barrier.

Indeed, neither rcu_assign_pointer() nor the circular queue really needs a
standalone barrier, so that attaching the barrier to a particular memory
reference would work.  And as you note below, in the case of ARM this
would turn into one of their new memory-reference instructions.

> So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> write.
> 
> This is analogous to a "acquire" operation: you cannot make an
> "acquire" barrier, because it's not a barrier *between* two ops, it's
> associated with one particular op.

But you -could- use any barrier that prevented reordering of any preceding
load with any subsequent memory reference.  Please note that I am -not-
advocating this anymore, because I like the idea of attaching the barrier
to a particular memory operation.  However, for completeness, here it is
in the case of TSO systems and PowerPC, respectively:

#define smp_acquire_mb() barrier();

#define smp_acquire_mb() \
	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory");

This functions correctly, but is a pain to review because you have to
figure out which of many possible preceding loads the smp_acquire_mb()
is supposed to attach to.  As you say, it is -way- better to attach the
barrier to a particular memory operation.

> So what I *think* you actually really really want is a "store with
> release consistency, followed by a write barrier".

I believe that the combination of "store with release consistency" and
"load with acquire consistency" should do the trick for the two use cases
at this point, which again are circular buffers and rcu_assign_pointer().
At this point, I don't see the need for "followed by a write barrier".
But I step through the circular buffers below.

> In TSO, afaik all stores have release consistency, and all writes are
> ordered, which is why this is a no-op in TSO. And x86 also has that
> "all stores have release consistency, and all writes are ordered"
> model, even if TSO doesn't really describe the x86 model.

Yep, as does the mainframe.  And these architectures also have all reads
having acquire consistency.

> But on ARM64, for example, I think you'd really want the store itself
> to be done with "stlr" (store with release), and then follow up with a
> "dsb st" after that.

Agree with the "stlr" but don't (yet, anyway) understand the need for
a subsequent "dsb st".

> And notice how that requires you to mark the store itself. There is no
> actual barrier *after* the store that does the optimized model.

And marking the store itself is a very good thing from my viewpoint.

> Of course, it's entirely possible that it's not worth worrying about
> this on ARM64, and that just doing it as a "normal store followed by a
> full memory barrier" is good enough. But at least in *theory* a
> microarchitecture might make it much cheaper to do a "store with
> release consistency" followed by "write barrier".
> 
> Anyway, having talked exhaustively about exactly what semantics you
> are after, I *think* the best model would be to just have a
> 
>   #define smp_store_with_release_semantics(x, y) ...
> 
> and use that *and* a "smp_wmb()" for this (possibly a special
> "smp_wmb_after_release()" if that allows people to avoid double
> barriers). On x86 (and TSO systems), the
> smp_store_with_release_semantics() would be just a regular store, and
> the smp_wmb() is obviously a no-op. Other platforms would end up doing
> other things.
> 
> Hmm?

OK, something like this for the definitions (though PowerPC might want
to locally abstract the lwsync expansion):

	#define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
	do { \
		barrier(); \
		ACCESS_ONCE(p) = (v); \
	} while (0)

	#define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
	do { \
		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
		ACCESS_ONCE(p) = (v); \
	} while (0)

	#define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
	({ \
		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
		barrier(); \
		_________p1; \
	})

	#define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
	({ \
		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
		_________p1; \
	})

For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
"ldar" instruction and smp_store_with_release_semantics() is a wrapper
around the ARM "stlr" instruction.

Then if I am not too confused (and I would expect Victor to let me know
in short order if I am), the following patch to the current mainline
version of Documentation/circular-buffers.txt would suffice.

Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt
index 8117e5bf6065..1846044bf6cc 100644
--- a/Documentation/circular-buffers.txt
+++ b/Documentation/circular-buffers.txt
@@ -160,6 +160,7 @@ The producer will look something like this:
 	spin_lock(&producer_lock);
 
 	unsigned long head = buffer->head;
+	/* The spin_unlock() and next spin_lock() provide needed ordering. */
 	unsigned long tail = ACCESS_ONCE(buffer->tail);
 
 	if (CIRC_SPACE(head, tail, buffer->size) >= 1) {
@@ -168,9 +169,8 @@ The producer will look something like this:
 
 		produce_item(item);
 
-		smp_wmb(); /* commit the item before incrementing the head */
-
-		buffer->head = (head + 1) & (buffer->size - 1);
+		smp_store_with_release_semantics(buffer->head,
+						 (head + 1) & (buffer->size - 1));
 
 		/* wake_up() will make sure that the head is committed before
 		 * waking anyone up */
@@ -183,9 +183,14 @@ This will instruct the CPU that the contents of the new item must be written
 before the head index makes it available to the consumer and then instructs the
 CPU that the revised head index must be written before the consumer is woken.
 
-Note that wake_up() doesn't have to be the exact mechanism used, but whatever
-is used must guarantee a (write) memory barrier between the update of the head
-index and the change of state of the consumer, if a change of state occurs.
+Note that wake_up() does not guarantee any sort of barrier unless something
+is actually awakened.  We therefore cannot rely on it for ordering.  However,
+there is always one element of the array left empty.  Therefore, the
+producer must produce two elements before it could possibly corrupt the
+element currently being read by the consumer.  Therefore, the unlock-lock
+pair between consecutive invocations of the consumer provides the necessary
+ordering between the read of the index indicating that the consumer has
+vacated a given element and the write by the producer to that same element.
 
 
 THE CONSUMER
@@ -195,21 +200,18 @@ The consumer will look something like this:
 
 	spin_lock(&consumer_lock);
 
-	unsigned long head = ACCESS_ONCE(buffer->head);
+	unsigned long head = smp_load_with_acquire_semantics(buffer->head);
 	unsigned long tail = buffer->tail;
 
 	if (CIRC_CNT(head, tail, buffer->size) >= 1) {
-		/* read index before reading contents at that index */
-		smp_read_barrier_depends();
 
 		/* extract one item from the buffer */
 		struct item *item = buffer[tail];
 
 		consume_item(item);
 
-		smp_mb(); /* finish reading descriptor before incrementing tail */
-
-		buffer->tail = (tail + 1) & (buffer->size - 1);
+		smp_store_with_release_semantics(buffer->tail,
+						 (tail + 1) & (buffer->size - 1));
 	}
 
 	spin_unlock(&consumer_lock);
@@ -218,12 +220,17 @@ This will instruct the CPU to make sure the index is up to date before reading
 the new item, and then it shall make sure the CPU has finished reading the item
 before it writes the new tail pointer, which will erase the item.
 
-
-Note the use of ACCESS_ONCE() in both algorithms to read the opposition index.
-This prevents the compiler from discarding and reloading its cached value -
-which some compilers will do across smp_read_barrier_depends().  This isn't
-strictly needed if you can be sure that the opposition index will _only_ be
-used the once.
+Note the use of ACCESS_ONCE() and smp_load_with_acquire_semantics()
+to read the opposition index.  This prevents the compiler from
+discarding and reloading its cached value - which some compilers will
+do across smp_read_barrier_depends().  This isn't strictly needed
+if you can be sure that the opposition index will _only_ be used
+the once.  The smp_load_with_acquire_semantics() additionally forces
+the CPU to order against subsequent memory references.  Similarly,
+smp_store_with_release_semantics() is used in both algorithms to write
+the thread's index.  This documents the fact that we are writing to
+something that can be read concurrently, prevents the compiler from
+tearing the store, and enforces ordering against previous accesses.
 
 
 ===============


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2013-11-04  9:57                                 ` Will Deacon
@ 2013-11-04 10:52                                   ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 10:52 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML

On Mon, Nov 04, 2013 at 09:57:17AM +0000, Will Deacon wrote:
> Hi Paul,
> 
> On Sun, Nov 03, 2013 at 10:47:12PM +0000, Paul E. McKenney wrote:
> > On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote:
> > > On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote:
> > > > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote:
> > > > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote:
> > > > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote:
> > > > > > > > Now the whole crux of the question is if we need barrier A at all, since
> > > > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail
> > > > > > > > read.
> > > > > > > 
> > > > > > > The dependency you are talking about is via the "if" statement?
> > > > > > > Even C/C++11 is not required to respect control dependencies.
> > > > > > > 
> > > > > > > This one is a bit annoying.  The x86 TSO means that you really only
> > > > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker
> > > > > > > barrier, and so on -- but smp_mb() emits a full barrier.
> > > > > > > 
> > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered
> > > > > > > before reads, writes before writes, and reads before writes, but not
> > > > > > > writes before reads?  Another approach would be to define a per-arch
> > > > > > > barrier for this particular case.
> > > > > > 
> > > > > > I suppose we can only introduce new barrier primitives if there's more
> > > > > > than 1 use-case.
> > > 
> > > Which barrier did you have in mind when you refer to `recent ARM' above? It
> > > seems to me like you'd need a combination if dmb ishld and dmb ishst, since
> > > the former doesn't order writes before writes.
> > 
> > I heard a rumor that ARM had recently added a new dmb variant that acted
> > similarly to PowerPC's lwsync, and it was on my list to follow up.
> > 
> > Given your response, I am guessing that there is no truth to this rumor...
> 
> I think you're talking about the -ld option to dmb, which was introduced in
> ARMv8. That option orders loads against loads and stores, but doesn't order
> writes against writes. So you could do:
> 
> 	dmb ishld
> 	dmb ishst
> 
> but it's questionable whether that performs better than a dmb ish.

If Linus's smp_store_with_release_semantics() approach works out, ARM
should be able to use its shiny new ldar and stlr instructions.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-03 23:34                                     ` Linus Torvalds
  2013-11-04 10:51                                       ` Paul E. McKenney
@ 2013-11-04 11:05                                       ` Will Deacon
  2013-11-04 16:34                                         ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Will Deacon @ 2013-11-04 11:05 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul McKenney, Peter Zijlstra, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling

On Sun, Nov 03, 2013 at 11:34:00PM +0000, Linus Torvalds wrote:
> So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> write.

Agreed.

> This is analogous to a "acquire" operation: you cannot make an
> "acquire" barrier, because it's not a barrier *between* two ops, it's
> associated with one particular op.
> 
> So what I *think* you actually really really want is a "store with
> release consistency, followed by a write barrier".

How does that order reads against reads? (Paul mentioned this as a
requirement). I not clear about the use case for this, so perhaps there is a
dependency that I'm not aware of.

> In TSO, afaik all stores have release consistency, and all writes are
> ordered, which is why this is a no-op in TSO. And x86 also has that
> "all stores have release consistency, and all writes are ordered"
> model, even if TSO doesn't really describe the x86 model.
> 
> But on ARM64, for example, I think you'd really want the store itself
> to be done with "stlr" (store with release), and then follow up with a
> "dsb st" after that.

So a dsb is pretty heavyweight here (it prevents execution of *any* further
instructions until all preceeding stores have completed, as well as
ensuring completion of any ongoing cache flushes). In conjunction with the
store-release, that's going to hold everything up until the store-release
(and therefore any preceeding memory accesses) have completed. Granted, I
think that gives Paul his read/read ordering, but it's a lot heavier than
what's required.

> And notice how that requires you to mark the store itself. There is no
> actual barrier *after* the store that does the optimized model.
> 
> Of course, it's entirely possible that it's not worth worrying about
> this on ARM64, and that just doing it as a "normal store followed by a
> full memory barrier" is good enough. But at least in *theory* a
> microarchitecture might make it much cheaper to do a "store with
> release consistency" followed by "write barrier".

I agree with the sentiment but, given that this stuff is so heavily
microarchitecture-dependent (and not simple to probe), a simple dmb ish
might be the best option after all. That's especially true if the
microarchitecture decided to ignore the barrier options and treat everything
as `all accesses, full system' in order to keep the hardware design simple.

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 10:51                                       ` Paul E. McKenney
@ 2013-11-04 11:22                                         ` Peter Zijlstra
  2013-11-04 16:27                                           ` Paul E. McKenney
  2013-11-07 23:50                                           ` Mathieu Desnoyers
  0 siblings, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-04 11:22 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
> OK, something like this for the definitions (though PowerPC might want
> to locally abstract the lwsync expansion):
> 
> 	#define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
> 	do { \
> 		barrier(); \
> 		ACCESS_ONCE(p) = (v); \
> 	} while (0)
> 
> 	#define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
> 	do { \
> 		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> 		ACCESS_ONCE(p) = (v); \
> 	} while (0)
> 
> 	#define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
> 	({ \
> 		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
> 		barrier(); \
> 		_________p1; \
> 	})
> 
> 	#define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
> 	({ \
> 		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
> 		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> 		_________p1; \
> 	})
> 
> For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
> "ldar" instruction and smp_store_with_release_semantics() is a wrapper
> around the ARM "stlr" instruction.

This still leaves me confused as to what to do with my case :/

Slightly modified since last time -- as the simplified version was maybe
simplified too far.

To recap, I'd like to get rid of barrier A where possible, since that's
now a full barrier for every event written.

However, there's no immediate store I can attach it to; the obvious one
would be the kbuf->head store, but that's complicated by the
local_cmpxchg() thing.

And we need that cmpxchg loop because a hardware NMI event can
interleave with a software event.

And to be honest, I'm still totally confused about memory barriers vs
control flow vs C/C++. The only way we're ever getting to that memcpy is
if we've already observed ubuf->tail, so that LOAD has to be fully
processes and completed.

I'm really not seeing how a STORE from the memcpy() could possibly go
wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
then I suppose we should all just go home.

/me who wants A to be a barrier() but is terminally confused.

---


/*
 * One important detail is that the kbuf part and the kbuf_writer() are
 * strictly per cpu and we can thus rely on program order for those.
 *
 * Only the userspace consumer can possibly run on another cpu, and thus we
 * need to ensure data consistency for those.
 */

struct buffer {
        u64 size;
        u64 tail;
        u64 head;
        void *data;
};

struct buffer *kbuf, *ubuf;

/*
 * If there's space in the buffer; store the data @buf; otherwise
 * discard it.
 */
void kbuf_write(int sz, void *buf)
{
	u64 tail, head, offset;

	do {
		tail = ACCESS_ONCE(ubuf->tail);
		offset = head = kbuf->head;
		if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
			/* discard @buf */
			return;
		}
		head += sz;
	} while (local_cmpxchg(&kbuf->head, offset, head) != offset)

        /*
         * Ensure that if we see the userspace tail (ubuf->tail) such
         * that there is space to write @buf without overwriting data
         * userspace hasn't seen yet, we won't in fact store data before
         * that read completes.
         */

        smp_mb(); /* A, matches with D */

        memcpy(kbuf->data + offset, buf, sz);

        /*
         * Ensure that we write all the @buf data before we update the
         * userspace visible ubuf->head pointer.
         */
        smp_wmb(); /* B, matches with C */

        ubuf->head = kbuf->head;
}

/*
 * Consume the buffer data and update the tail pointer to indicate to
 * kernel space there's 'free' space.
 */
void ubuf_read(void)
{
        u64 head, tail;

        tail = ACCESS_ONCE(ubuf->tail);
        head = ACCESS_ONCE(ubuf->head);

        /*
         * Ensure we read the buffer boundaries before the actual buffer
         * data...
         */
        smp_rmb(); /* C, matches with B */

        while (tail != head) {
                obj = ubuf->data + tail;
                /* process obj */
                tail += obj->size;
                tail %= ubuf->size;
        }

        /*
         * Ensure all data reads are complete before we issue the
         * ubuf->tail update; once that update hits, kbuf_write() can
         * observe and overwrite data.
         */
        smp_mb(); /* D, matches with A */

        ubuf->tail = tail;
}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 11:22                                         ` Peter Zijlstra
@ 2013-11-04 16:27                                           ` Paul E. McKenney
  2013-11-04 16:48                                             ` Peter Zijlstra
  2013-11-04 19:11                                             ` Peter Zijlstra
  2013-11-07 23:50                                           ` Mathieu Desnoyers
  1 sibling, 2 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 16:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 12:22:54PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 02:51:00AM -0800, Paul E. McKenney wrote:
> > OK, something like this for the definitions (though PowerPC might want
> > to locally abstract the lwsync expansion):
> > 
> > 	#define smp_store_with_release_semantics(p, v) /* x86, s390, etc. */ \
> > 	do { \
> > 		barrier(); \
> > 		ACCESS_ONCE(p) = (v); \
> > 	} while (0)
> > 
> > 	#define smp_store_with_release_semantics(p, v) /* PowerPC. */ \
> > 	do { \
> > 		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> > 		ACCESS_ONCE(p) = (v); \
> > 	} while (0)
> > 
> > 	#define smp_load_with_acquire_semantics(p) /* x86, s390, etc. */ \
> > 	({ \
> > 		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
> > 		barrier(); \
> > 		_________p1; \
> > 	})
> > 
> > 	#define smp_load_with_acquire_semantics(p) /* PowerPC. */ \
> > 	({ \
> > 		typeof(*p) *_________p1 = ACCESS_ONCE(p); \
> > 		__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory"); \
> > 		_________p1; \
> > 	})
> > 
> > For ARM, smp_load_with_acquire_semantics() is a wrapper around the ARM
> > "ldar" instruction and smp_store_with_release_semantics() is a wrapper
> > around the ARM "stlr" instruction.
> 
> This still leaves me confused as to what to do with my case :/
> 
> Slightly modified since last time -- as the simplified version was maybe
> simplified too far.
> 
> To recap, I'd like to get rid of barrier A where possible, since that's
> now a full barrier for every event written.
> 
> However, there's no immediate store I can attach it to; the obvious one
> would be the kbuf->head store, but that's complicated by the
> local_cmpxchg() thing.
> 
> And we need that cmpxchg loop because a hardware NMI event can
> interleave with a software event.
> 
> And to be honest, I'm still totally confused about memory barriers vs
> control flow vs C/C++. The only way we're ever getting to that memcpy is
> if we've already observed ubuf->tail, so that LOAD has to be fully
> processes and completed.
> 
> I'm really not seeing how a STORE from the memcpy() could possibly go
> wrong; and if C/C++ can hoist the memcpy() over a compiler barrier()
> then I suppose we should all just go home.
> 
> /me who wants A to be a barrier() but is terminally confused.

Well, let's see...

> ---
> 
> 
> /*
>  * One important detail is that the kbuf part and the kbuf_writer() are
>  * strictly per cpu and we can thus rely on program order for those.
>  *
>  * Only the userspace consumer can possibly run on another cpu, and thus we
>  * need to ensure data consistency for those.
>  */
> 
> struct buffer {
>         u64 size;
>         u64 tail;
>         u64 head;
>         void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
> 	u64 tail, head, offset;
> 
> 	do {
> 		tail = ACCESS_ONCE(ubuf->tail);

So the above load is the key load.  It determines whether or not we
have space in the buffer.  This of course assumes that only this CPU
writes to ->head.

If so, then:

		tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */

> 		offset = head = kbuf->head;
> 		if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
> 			/* discard @buf */
> 			return;
> 		}
> 		head += sz;
> 	} while (local_cmpxchg(&kbuf->head, offset, head) != offset)

If there is an issue with kbuf->head, presumably local_cmpxchg() fails
and we retry.

But sheesh, do you think we could have buried the definitions of
local_cmpxchg() under a few more layers of macro expansion just to
keep things more obscure?  Anyway, griping aside...

o	__cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
	doesn't seem to exclude NMIs, so is not safe for this usage.

o	__cmpxchg_local() in ARM handles NMI as long as the
	argument is 32 bits, otherwise, it uses the aforementionted
	__cmpxchg_local_generic(), which does not handle NMI.  Given your
	u64, this does not look good...

	And some ARM subarches (e.g., metag) seem to fail to handle NMI
	even in the 32-bit case.

o	FRV and M32r seem to act similar to ARM.

Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
does not seem to be safe against NMIs in general.  :-/

That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Of course, x86's local_cmpxchg() has full memory barriers implicitly.

> 
>         /*
>          * Ensure that if we see the userspace tail (ubuf->tail) such
>          * that there is space to write @buf without overwriting data
>          * userspace hasn't seen yet, we won't in fact store data before
>          * that read completes.
>          */
> 
>         smp_mb(); /* A, matches with D */

Given a change to smp_load_with_acquire_semantics() above, you should not
need this smp_mb().

>         memcpy(kbuf->data + offset, buf, sz);
> 
>         /*
>          * Ensure that we write all the @buf data before we update the
>          * userspace visible ubuf->head pointer.
>          */
>         smp_wmb(); /* B, matches with C */
> 
>         ubuf->head = kbuf->head;

Replace the smp_wmb() and the assignment with:

	smp_store_with_release_semantics(ubuf->head, kbuf->head); /* B -> C */

> }
> 
> /*
>  * Consume the buffer data and update the tail pointer to indicate to
>  * kernel space there's 'free' space.
>  */
> void ubuf_read(void)
> {
>         u64 head, tail;
> 
>         tail = ACCESS_ONCE(ubuf->tail);

Does anyone else write tail?  Or is this defense against NMIs?

If no one else writes to tail and if NMIs cannot muck things up, then
the above ACCESS_ONCE() is not needed, though I would not object to its
staying.

>         head = ACCESS_ONCE(ubuf->head);

Make the above be:

	head = smp_load_with_acquire_semantics(ubuf->head);  /* C -> B */

>         /*
>          * Ensure we read the buffer boundaries before the actual buffer
>          * data...
>          */
>         smp_rmb(); /* C, matches with B */

And drop the above memory barrier.

>         while (tail != head) {
>                 obj = ubuf->data + tail;
>                 /* process obj */
>                 tail += obj->size;
>                 tail %= ubuf->size;
>         }
> 
>         /*
>          * Ensure all data reads are complete before we issue the
>          * ubuf->tail update; once that update hits, kbuf_write() can
>          * observe and overwrite data.
>          */
>         smp_mb(); /* D, matches with A */
> 
>         ubuf->tail = tail;

Replace the above barrier and the assignment with:

	smp_store_with_release_semantics(ubuf->tail, tail); /* D -> B. */

> }

All this is leading me to suggest the following shortenings of names:

	smp_load_with_acquire_semantics() -> smp_load_acquire()

	smp_store_with_release_semantics() -> smp_store_release()

But names aside, the above gets rid of explicit barriers on TSO architectures,
allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
the heavier-weight sync.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 11:05                                       ` Will Deacon
@ 2013-11-04 16:34                                         ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 16:34 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling

On Mon, Nov 04, 2013 at 11:05:53AM +0000, Will Deacon wrote:
> On Sun, Nov 03, 2013 at 11:34:00PM +0000, Linus Torvalds wrote:
> > So it would *kind* of act like a "smp_wmb() + smp_rmb()", but the
> > problem is that a "smp_rmb()" doesn't really "attach" to the preceding
> > write.
> 
> Agreed.
> 
> > This is analogous to a "acquire" operation: you cannot make an
> > "acquire" barrier, because it's not a barrier *between* two ops, it's
> > associated with one particular op.
> > 
> > So what I *think* you actually really really want is a "store with
> > release consistency, followed by a write barrier".
> 
> How does that order reads against reads? (Paul mentioned this as a
> requirement). I not clear about the use case for this, so perhaps there is a
> dependency that I'm not aware of.

An smp_store_with_release_semantics() orders against prior reads -and-
writes.  It maps to barrier() for x86, stlr for ARM, and lwsync for
PowerPC, as called out in my prototype definitions.

> > In TSO, afaik all stores have release consistency, and all writes are
> > ordered, which is why this is a no-op in TSO. And x86 also has that
> > "all stores have release consistency, and all writes are ordered"
> > model, even if TSO doesn't really describe the x86 model.
> > 
> > But on ARM64, for example, I think you'd really want the store itself
> > to be done with "stlr" (store with release), and then follow up with a
> > "dsb st" after that.
> 
> So a dsb is pretty heavyweight here (it prevents execution of *any* further
> instructions until all preceeding stores have completed, as well as
> ensuring completion of any ongoing cache flushes). In conjunction with the
> store-release, that's going to hold everything up until the store-release
> (and therefore any preceeding memory accesses) have completed. Granted, I
> think that gives Paul his read/read ordering, but it's a lot heavier than
> what's required.

I do not believe that we need the trailing "dsb st".

> > And notice how that requires you to mark the store itself. There is no
> > actual barrier *after* the store that does the optimized model.
> > 
> > Of course, it's entirely possible that it's not worth worrying about
> > this on ARM64, and that just doing it as a "normal store followed by a
> > full memory barrier" is good enough. But at least in *theory* a
> > microarchitecture might make it much cheaper to do a "store with
> > release consistency" followed by "write barrier".
> 
> I agree with the sentiment but, given that this stuff is so heavily
> microarchitecture-dependent (and not simple to probe), a simple dmb ish
> might be the best option after all. That's especially true if the
> microarchitecture decided to ignore the barrier options and treat everything
> as `all accesses, full system' in order to keep the hardware design simple.

I believe that we can do quite a bit better with current hardware
instructions (in the case of ARM, for a recent definition of "current")
and also simplify the memory ordering quite a bit.

								Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 16:27                                           ` Paul E. McKenney
@ 2013-11-04 16:48                                             ` Peter Zijlstra
  2013-11-04 19:11                                             ` Peter Zijlstra
  1 sibling, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-04 16:48 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> > 
> > 
> > /*
> >  * One important detail is that the kbuf part and the kbuf_writer() are
> >  * strictly per cpu and we can thus rely on program order for those.
> >  *
> >  * Only the userspace consumer can possibly run on another cpu, and thus we
> >  * need to ensure data consistency for those.
> >  */
> > 
> > struct buffer {
> >         u64 size;
> >         u64 tail;
> >         u64 head;
> >         void *data;
> > };
> > 
> > struct buffer *kbuf, *ubuf;
> > 
> > /*
> >  * If there's space in the buffer; store the data @buf; otherwise
> >  * discard it.
> >  */
> > void kbuf_write(int sz, void *buf)
> > {
> > 	u64 tail, head, offset;
> > 
> > 	do {
> > 		tail = ACCESS_ONCE(ubuf->tail);
> 
> So the above load is the key load.  It determines whether or not we
> have space in the buffer.  This of course assumes that only this CPU
> writes to ->head.

This assumption is true.

> If so, then:
> 
> 		tail = smp_load_with_acquire_semantics(ubuf->tail); /* A -> D */
> 

OK, the way I understand ACQUIRE semantics are the semi-permeable LOCK
semantics from Documetnation/memory-barriers.txt. In which case the
relevant STORES below could be hoisted up here, but not across the READ,
which I suppose is sufficient.

> > 		offset = head = kbuf->head;
> > 		if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
> > 			/* discard @buf */
> > 			return;
> > 		}
> > 		head += sz;
> > 	} while (local_cmpxchg(&kbuf->head, offset, head) != offset)
> 
> If there is an issue with kbuf->head, presumably local_cmpxchg() fails
> and we retry.
> 
> But sheesh, do you think we could have buried the definitions of
> local_cmpxchg() under a few more layers of macro expansion just to
> keep things more obscure?  Anyway, griping aside...
> 
> o	__cmpxchg_local_generic() in include/asm-generic/cmpxchg-local.h
> 	doesn't seem to exclude NMIs, so is not safe for this usage.
> 
> o	__cmpxchg_local() in ARM handles NMI as long as the
> 	argument is 32 bits, otherwise, it uses the aforementionted
> 	__cmpxchg_local_generic(), which does not handle NMI.  Given your
> 	u64, this does not look good...
> 
> 	And some ARM subarches (e.g., metag) seem to fail to handle NMI
> 	even in the 32-bit case.
> 
> o	FRV and M32r seem to act similar to ARM.
> 
> Or maybe these architectures don't do NMIs?  If they do, local_cmpxchg()
> does not seem to be safe against NMIs in general.  :-/
> 
> That said, powerpc, 64-bit s390, sparc, and x86 seem to handle it.

Ah my bad, so the in-kernel kbuf variant uses unsigned long, which on
all archs should be the native words size and cover the address space.

Only the public variant (ubuf) is u64 wide to not change data structure
layout on compat etc.

I suppose this was a victim on the simplification :/

And in case of 32bit the upper word will always be zero and the partial
reads should all work out just fine.

> Of course, x86's local_cmpxchg() has full memory barriers implicitly.

Not quite, the 'lock' in __raw_cmpxchg() expands to "" due to
__cmpxchg_local(), etc.. 

> > 
> >         /*
> >          * Ensure that if we see the userspace tail (ubuf->tail) such
> >          * that there is space to write @buf without overwriting data
> >          * userspace hasn't seen yet, we won't in fact store data before
> >          * that read completes.
> >          */
> > 
> >         smp_mb(); /* A, matches with D */
> 
> Given a change to smp_load_with_acquire_semantics() above, you should not
> need this smp_mb().

Because the STORES can not be hoisted across the ACQUIRE, indeed.

> 
> >         memcpy(kbuf->data + offset, buf, sz);
> > 
> >         /*
> >          * Ensure that we write all the @buf data before we update the
> >          * userspace visible ubuf->head pointer.
> >          */
> >         smp_wmb(); /* B, matches with C */
> > 
> >         ubuf->head = kbuf->head;
> 
> Replace the smp_wmb() and the assignment with:
> 
> 	smp_store_with_release_semantics(ubuf->head, kbuf->head); /* B -> C */

And here the RELEASE semantics I assume are the same as the
semi-permeable UNLOCK from _The_ document? In which case the above
STORES cannot be lowered across this store and all should, again, be
well.

> > }
> > 
> > /*
> >  * Consume the buffer data and update the tail pointer to indicate to
> >  * kernel space there's 'free' space.
> >  */
> > void ubuf_read(void)
> > {
> >         u64 head, tail;
> > 
> >         tail = ACCESS_ONCE(ubuf->tail);
> 
> Does anyone else write tail?  Or is this defense against NMIs?

No, we're the sole writer; just general paranoia. Not sure the actual
userspace does this; /me checks. Nope, distinct lack of ACCESS_ONCE()
there, just the rmb(), which including a barrier() should hopefully
accomplish similar things most of the time ;-)

I'll need to introduce ACCESS_ONCE to the userspace code.

> If no one else writes to tail and if NMIs cannot muck things up, then
> the above ACCESS_ONCE() is not needed, though I would not object to its
> staying.
> 
> >         head = ACCESS_ONCE(ubuf->head);
> 
> Make the above be:
> 
> 	head = smp_load_with_acquire_semantics(ubuf->head);  /* C -> B */
> 
> >         /*
> >          * Ensure we read the buffer boundaries before the actual buffer
> >          * data...
> >          */
> >         smp_rmb(); /* C, matches with B */
> 
> And drop the above memory barrier.
> 
> >         while (tail != head) {
> >                 obj = ubuf->data + tail;
> >                 /* process obj */
> >                 tail += obj->size;
> >                 tail %= ubuf->size;
> >         }
> > 
> >         /*
> >          * Ensure all data reads are complete before we issue the
> >          * ubuf->tail update; once that update hits, kbuf_write() can
> >          * observe and overwrite data.
> >          */
> >         smp_mb(); /* D, matches with A */
> > 
> >         ubuf->tail = tail;
> 
> Replace the above barrier and the assignment with:
> 
> 	smp_store_with_release_semantics(ubuf->tail, tail); /* D -> B. */
> 
> > }

Right, so this consumer side isn't called that often and the two
barriers are only per consume, not per event, so I don't care too much
about these.

It would also mean hoisting the implementation of the proposed
primitives into userspace -- which reminds me: should we make
include/asm/barrier.h a uapi header?

> All this is leading me to suggest the following shortenings of names:
> 
> 	smp_load_with_acquire_semantics() -> smp_load_acquire()
> 
> 	smp_store_with_release_semantics() -> smp_store_release()
> 
> But names aside, the above gets rid of explicit barriers on TSO architectures,
> allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
> the heavier-weight sync.

Totally awesome! ;-) And full ack on the shorter names.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 16:27                                           ` Paul E. McKenney
  2013-11-04 16:48                                             ` Peter Zijlstra
@ 2013-11-04 19:11                                             ` Peter Zijlstra
  2013-11-04 19:18                                               ` Peter Zijlstra
  2013-11-04 20:53                                               ` Paul E. McKenney
  1 sibling, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-04 19:11 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> All this is leading me to suggest the following shortenings of names:
> 
> 	smp_load_with_acquire_semantics() -> smp_load_acquire()
> 
> 	smp_store_with_release_semantics() -> smp_store_release()
> 
> But names aside, the above gets rid of explicit barriers on TSO architectures,
> allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
> the heavier-weight sync.

A little something like this? Completely guessed at the arm/arm64/ia64
asm, but at least for those archs I found proper instructions (I hope),
for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
said can do with the lwsync, all others fall back to using a smp_mb().

Should probably come with a proper changelog and an addition to _The_
document.

---
 arch/alpha/include/asm/barrier.h      | 13 +++++++++++
 arch/arc/include/asm/barrier.h        | 13 +++++++++++
 arch/arm/include/asm/barrier.h        | 26 +++++++++++++++++++++
 arch/arm64/include/asm/barrier.h      | 28 +++++++++++++++++++++++
 arch/avr32/include/asm/barrier.h      | 12 ++++++++++
 arch/blackfin/include/asm/barrier.h   | 13 +++++++++++
 arch/cris/include/asm/barrier.h       | 13 +++++++++++
 arch/frv/include/asm/barrier.h        | 13 +++++++++++
 arch/h8300/include/asm/barrier.h      | 13 +++++++++++
 arch/hexagon/include/asm/barrier.h    | 13 +++++++++++
 arch/ia64/include/asm/barrier.h       | 43 +++++++++++++++++++++++++++++++++++
 arch/m32r/include/asm/barrier.h       | 13 +++++++++++
 arch/m68k/include/asm/barrier.h       | 13 +++++++++++
 arch/metag/include/asm/barrier.h      | 13 +++++++++++
 arch/microblaze/include/asm/barrier.h | 13 +++++++++++
 arch/mips/include/asm/barrier.h       | 13 +++++++++++
 arch/mn10300/include/asm/barrier.h    | 13 +++++++++++
 arch/parisc/include/asm/barrier.h     | 13 +++++++++++
 arch/powerpc/include/asm/barrier.h    | 15 ++++++++++++
 arch/s390/include/asm/barrier.h       | 13 +++++++++++
 arch/score/include/asm/barrier.h      | 13 +++++++++++
 arch/sh/include/asm/barrier.h         | 13 +++++++++++
 arch/sparc/include/asm/barrier_32.h   | 13 +++++++++++
 arch/sparc/include/asm/barrier_64.h   | 13 +++++++++++
 arch/tile/include/asm/barrier.h       | 13 +++++++++++
 arch/unicore32/include/asm/barrier.h  | 13 +++++++++++
 arch/x86/include/asm/barrier.h        | 13 +++++++++++
 arch/xtensa/include/asm/barrier.h     | 13 +++++++++++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends()	do { } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()       barrier()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #define smp_mb__before_atomic_dec()	barrier()
 #define smp_mb__after_atomic_dec()	barrier()
 #define smp_mb__before_atomic_inc()	barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..a804093d6891 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 #else
 #define smp_mb()	dmb(ish)
 #define smp_rmb()	smp_mb()
 #define smp_wmb()	dmb(ishst)
+
+#define smp_store_release(p, v)						\
+do {									\
+	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1;						\
+	asm volatile ("ldar %w0, [%1]"					\
+			: "=r" (___p1) : "r" (&p) : "memory");		\
+	return ___p1;							\
+} while (0)
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index d4a63338a53c..0da2d4ebb9a8 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -35,10 +35,38 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #else
+
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
+
+#define smp_store_release(p, v)						\
+do {									\
+	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1;						\
+	asm volatile ("ldar %w0, [%1]"					\
+			: "=r" (___p1) : "r" (&p) : "memory");		\
+	return ___p1;							\
+} while (0)
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
index 0961275373db..a0c48ad684f8 100644
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -25,5 +25,17 @@
 # define smp_read_barrier_depends() do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 
 #endif /* __ASM_AVR32_BARRIER_H */
diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
index ebb189507dd7..67889d9225d9 100644
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -45,4 +45,17 @@
 #define set_mb(var, value) do { var = value; mb(); } while (0)
 #define smp_read_barrier_depends()	read_barrier_depends()
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _BLACKFIN_BARRIER_H */
diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
index 198ad7fa6b25..34243dc44ef1 100644
--- a/arch/cris/include/asm/barrier.h
+++ b/arch/cris/include/asm/barrier.h
@@ -22,4 +22,17 @@
 #define smp_read_barrier_depends()     do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_CRIS_BARRIER_H */
diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
index 06776ad9f5e9..92f89934d4ed 100644
--- a/arch/frv/include/asm/barrier.h
+++ b/arch/frv/include/asm/barrier.h
@@ -26,4 +26,17 @@
 #define set_mb(var, value) \
 	do { var = (value); barrier(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
index 9e0aa9fc195d..516e9d379e25 100644
--- a/arch/h8300/include/asm/barrier.h
+++ b/arch/h8300/include/asm/barrier.h
@@ -26,4 +26,17 @@
 #define smp_read_barrier_depends()	do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _H8300_BARRIER_H */
diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
index 1041a8e70ce8..838a2ebe07a5 100644
--- a/arch/hexagon/include/asm/barrier.h
+++ b/arch/hexagon/include/asm/barrier.h
@@ -38,4 +38,17 @@
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index 60576e06b6fb..4598d390fabb 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -45,11 +45,54 @@
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
+
+#define smp_store_release(p, v)						\
+do {									\
+	switch (sizeof(p)) {						\
+	case 4:								\
+		asm volatile ("st4.acq [%0]=%1" 			\
+				:: "r" (&p), "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("st8.acq [%0]=%1" 			\
+				:: "r" (&p), "r" (v) : "memory"); 	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1;						\
+	switch (sizeof(p)) {						\
+	case 4:								\
+		asm volatile ("ld4.rel %0=[%1]" 			\
+				: "=r"(___p1) : "r" (&p) : "memory"); 	\
+		break;							\
+	case 8:								\
+		asm volatile ("ld8.rel %0=[%1]" 			\
+				: "=r"(___p1) : "r" (&p) : "memory"); 	\
+		break;							\
+	}								\
+	return ___p1;							\
+} while (0)
 #else
 # define smp_mb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 #endif
 
 /*
diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
index 6976621efd3f..e5d42bcf90c5 100644
--- a/arch/m32r/include/asm/barrier.h
+++ b/arch/m32r/include/asm/barrier.h
@@ -91,4 +91,17 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_M32R_BARRIER_H */
diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
index 445ce22c23cb..eeb9ecf713cc 100644
--- a/arch/m68k/include/asm/barrier.h
+++ b/arch/m68k/include/asm/barrier.h
@@ -17,4 +17,17 @@
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	((void)0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _M68K_BARRIER_H */
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c90bfc6bf648..d8e6f2e4a27c 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -82,4 +82,17 @@ static inline void fence(void)
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_METAG_BARRIER_H */
diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
index df5be3e87044..a890702061c9 100644
--- a/arch/microblaze/include/asm/barrier.h
+++ b/arch/microblaze/include/asm/barrier.h
@@ -24,4 +24,17 @@
 #define smp_rmb()		rmb()
 #define smp_wmb()		wmb()
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_MICROBLAZE_BARRIER_H */
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index 314ab5532019..e59bcd051f36 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,4 +180,17 @@
 #define nudge_writes() mb()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
index 2bd97a5c8af7..0e6a0608d4a1 100644
--- a/arch/mn10300/include/asm/barrier.h
+++ b/arch/mn10300/include/asm/barrier.h
@@ -34,4 +34,17 @@
 #define read_barrier_depends()		do {} while (0)
 #define smp_read_barrier_depends()	do {} while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
index e77d834aa803..f1145a8594a0 100644
--- a/arch/parisc/include/asm/barrier.h
+++ b/arch/parisc/include/asm/barrier.h
@@ -32,4 +32,17 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __PARISC_BARRIER_H */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index ae782254e731..b5cc36791f42 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -65,4 +65,19 @@
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+/* use smp_rmb() as that is either lwsync or a barrier() depending on SMP */
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_rmb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_rmb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_POWERPC_BARRIER_H */
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 16760eeb79b0..e8989c40e11c 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -32,4 +32,17 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
index 0eacb6471e6d..5f101ef8ade9 100644
--- a/arch/score/include/asm/barrier.h
+++ b/arch/score/include/asm/barrier.h
@@ -13,4 +13,17 @@
 
 #define set_mb(var, value) 		do {var = value; wmb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_SCORE_BARRIER_H */
diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
index 72c103dae300..611128c2f636 100644
--- a/arch/sh/include/asm/barrier.h
+++ b/arch/sh/include/asm/barrier.h
@@ -51,4 +51,17 @@
 
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_SH_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
index c1b76654ee76..f47f9d51f326 100644
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -12,4 +12,17 @@
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !(__SPARC_BARRIER_H) */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 95d45986f908..77cbe6982ca0 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,4 +53,17 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !(__SPARC64_BARRIER_H) */
diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
index a9a73da5865d..4d5330d4fd31 100644
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -140,5 +140,18 @@ mb_incoherent(void)
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_TILE_BARRIER_H */
diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
index a6620e5336b6..5471ff6aae10 100644
--- a/arch/unicore32/include/asm/barrier.h
+++ b/arch/unicore32/include/asm/barrier.h
@@ -25,4 +25,17 @@
 
 #define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __UNICORE_BARRIER_H__ */
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index c6cd358a1eec..a7fd8201ab09 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -100,6 +100,19 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
index ef021677d536..703d511add49 100644
--- a/arch/xtensa/include/asm/barrier.h
+++ b/arch/xtensa/include/asm/barrier.h
@@ -26,4 +26,17 @@
 
 #define set_mb(var, value)	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p, v)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _XTENSA_SYSTEM_H */

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 19:11                                             ` Peter Zijlstra
@ 2013-11-04 19:18                                               ` Peter Zijlstra
  2013-11-04 20:54                                                 ` Paul E. McKenney
  2013-11-04 20:53                                               ` Paul E. McKenney
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-04 19:18 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> +#define smp_load_acquire(p, v)						\

I R idiot!! :-)

---
 arch/alpha/include/asm/barrier.h      | 13 +++++++++++
 arch/arc/include/asm/barrier.h        | 13 +++++++++++
 arch/arm/include/asm/barrier.h        | 26 +++++++++++++++++++++
 arch/arm64/include/asm/barrier.h      | 28 +++++++++++++++++++++++
 arch/avr32/include/asm/barrier.h      | 12 ++++++++++
 arch/blackfin/include/asm/barrier.h   | 13 +++++++++++
 arch/cris/include/asm/barrier.h       | 13 +++++++++++
 arch/frv/include/asm/barrier.h        | 13 +++++++++++
 arch/h8300/include/asm/barrier.h      | 13 +++++++++++
 arch/hexagon/include/asm/barrier.h    | 13 +++++++++++
 arch/ia64/include/asm/barrier.h       | 43 +++++++++++++++++++++++++++++++++++
 arch/m32r/include/asm/barrier.h       | 13 +++++++++++
 arch/m68k/include/asm/barrier.h       | 13 +++++++++++
 arch/metag/include/asm/barrier.h      | 13 +++++++++++
 arch/microblaze/include/asm/barrier.h | 13 +++++++++++
 arch/mips/include/asm/barrier.h       | 13 +++++++++++
 arch/mn10300/include/asm/barrier.h    | 13 +++++++++++
 arch/parisc/include/asm/barrier.h     | 13 +++++++++++
 arch/powerpc/include/asm/barrier.h    | 15 ++++++++++++
 arch/s390/include/asm/barrier.h       | 13 +++++++++++
 arch/score/include/asm/barrier.h      | 13 +++++++++++
 arch/sh/include/asm/barrier.h         | 13 +++++++++++
 arch/sparc/include/asm/barrier_32.h   | 13 +++++++++++
 arch/sparc/include/asm/barrier_64.h   | 13 +++++++++++
 arch/tile/include/asm/barrier.h       | 13 +++++++++++
 arch/unicore32/include/asm/barrier.h  | 13 +++++++++++
 arch/x86/include/asm/barrier.h        | 13 +++++++++++
 arch/xtensa/include/asm/barrier.h     | 13 +++++++++++
 28 files changed, 423 insertions(+)

diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
index ce8860a0b32d..464139feee97 100644
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends()	do { } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
index f6cb7c4ffb35..a779da846fb5 100644
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,19 @@
 #define smp_wmb()       barrier()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #define smp_mb__before_atomic_dec()	barrier()
 #define smp_mb__after_atomic_dec()	barrier()
 #define smp_mb__before_atomic_inc()	barrier()
diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 60f15e274e6d..4ada4720bdeb 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,10 +53,36 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 #else
 #define smp_mb()	dmb(ish)
 #define smp_rmb()	smp_mb()
 #define smp_wmb()	dmb(ishst)
+
+#define smp_store_release(p, v)						\
+do {									\
+	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1;						\
+	asm volatile ("ldar %w0, [%1]"					\
+			: "=r" (___p1) : "r" (&p) : "memory");		\
+	return ___p1;							\
+} while (0)
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index d4a63338a53c..3dfddc0416f6 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -35,10 +35,38 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #else
+
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
+
+#define smp_store_release(p, v)						\
+do {									\
+	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1;						\
+	asm volatile ("ldar %w0, [%1]" 					\
+			: "=r" (___p1) : "r" (&p) : "memory"); 		\
+	return ___p1;							\
+} while (0)
 #endif
 
 #define read_barrier_depends()		do { } while(0)
diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
index 0961275373db..8fd164648e71 100644
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -25,5 +25,17 @@
 # define smp_read_barrier_depends() do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 
 #endif /* __ASM_AVR32_BARRIER_H */
diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
index ebb189507dd7..c8b85bba843f 100644
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -45,4 +45,17 @@
 #define set_mb(var, value) do { var = value; mb(); } while (0)
 #define smp_read_barrier_depends()	read_barrier_depends()
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _BLACKFIN_BARRIER_H */
diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
index 198ad7fa6b25..26f21f5d1d15 100644
--- a/arch/cris/include/asm/barrier.h
+++ b/arch/cris/include/asm/barrier.h
@@ -22,4 +22,17 @@
 #define smp_read_barrier_depends()     do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_CRIS_BARRIER_H */
diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
index 06776ad9f5e9..4569028382fa 100644
--- a/arch/frv/include/asm/barrier.h
+++ b/arch/frv/include/asm/barrier.h
@@ -26,4 +26,17 @@
 #define set_mb(var, value) \
 	do { var = (value); barrier(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
index 9e0aa9fc195d..45d36738814d 100644
--- a/arch/h8300/include/asm/barrier.h
+++ b/arch/h8300/include/asm/barrier.h
@@ -26,4 +26,17 @@
 #define smp_read_barrier_depends()	do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _H8300_BARRIER_H */
diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
index 1041a8e70ce8..d88d54bd2e6e 100644
--- a/arch/hexagon/include/asm/barrier.h
+++ b/arch/hexagon/include/asm/barrier.h
@@ -38,4 +38,17 @@
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
index 60576e06b6fb..b7f1a8aa03af 100644
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -45,11 +45,54 @@
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
+
+#define smp_store_release(p, v)						\
+do {									\
+	switch (sizeof(p)) {						\
+	case 4:								\
+		asm volatile ("st4.acq [%0]=%1"				\
+				:: "r" (&p), "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("st8.acq [%0]=%1"				\
+				:: "r" (&p), "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1;						\
+	switch (sizeof(p)) {						\
+	case 4:								\
+		asm volatile ("ld4.rel %0=[%1]"				\
+				: "=r"(___p1) : "r" (&p) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("ld8.rel %0=[%1]"				\
+				: "=r"(___p1) : "r" (&p) : "memory");	\
+		break;							\
+	}								\
+	return ___p1;							\
+} while (0)
 #else
 # define smp_mb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
 #endif
 
 /*
diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
index 6976621efd3f..d78612289cb2 100644
--- a/arch/m32r/include/asm/barrier.h
+++ b/arch/m32r/include/asm/barrier.h
@@ -91,4 +91,17 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_M32R_BARRIER_H */
diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
index 445ce22c23cb..1e63b11c424c 100644
--- a/arch/m68k/include/asm/barrier.h
+++ b/arch/m68k/include/asm/barrier.h
@@ -17,4 +17,17 @@
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	((void)0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _M68K_BARRIER_H */
diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
index c90bfc6bf648..9ffd0b167f07 100644
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -82,4 +82,17 @@ static inline void fence(void)
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_METAG_BARRIER_H */
diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
index df5be3e87044..db0b5e205ce3 100644
--- a/arch/microblaze/include/asm/barrier.h
+++ b/arch/microblaze/include/asm/barrier.h
@@ -24,4 +24,17 @@
 #define smp_rmb()		rmb()
 #define smp_wmb()		wmb()
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_MICROBLAZE_BARRIER_H */
diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
index 314ab5532019..8031afcc7f64 100644
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,4 +180,17 @@
 #define nudge_writes() mb()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
index 2bd97a5c8af7..e822ff76f498 100644
--- a/arch/mn10300/include/asm/barrier.h
+++ b/arch/mn10300/include/asm/barrier.h
@@ -34,4 +34,17 @@
 #define read_barrier_depends()		do {} while (0)
 #define smp_read_barrier_depends()	do {} while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_BARRIER_H */
diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
index e77d834aa803..58757747f873 100644
--- a/arch/parisc/include/asm/barrier.h
+++ b/arch/parisc/include/asm/barrier.h
@@ -32,4 +32,17 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __PARISC_BARRIER_H */
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index ae782254e731..54922626b356 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -65,4 +65,19 @@
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+/* use smp_rmb() as that is either lwsync or a barrier() depending on SMP */
+
+#define smp_store_release(p, v)						\
+do {									\
+	smp_rmb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_rmb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_POWERPC_BARRIER_H */
diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
index 16760eeb79b0..babf928649a4 100644
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -32,4 +32,17 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_BARRIER_H */
diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
index 0eacb6471e6d..5905ea57a104 100644
--- a/arch/score/include/asm/barrier.h
+++ b/arch/score/include/asm/barrier.h
@@ -13,4 +13,17 @@
 
 #define set_mb(var, value) 		do {var = value; wmb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _ASM_SCORE_BARRIER_H */
diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
index 72c103dae300..379f500023b6 100644
--- a/arch/sh/include/asm/barrier.h
+++ b/arch/sh/include/asm/barrier.h
@@ -51,4 +51,17 @@
 
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __ASM_SH_BARRIER_H */
diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
index c1b76654ee76..1649081d1b86 100644
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -12,4 +12,17 @@
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !(__SPARC_BARRIER_H) */
diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
index 95d45986f908..5e23ced0a29a 100644
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,4 +53,17 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !(__SPARC64_BARRIER_H) */
diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
index a9a73da5865d..1f08318db3c0 100644
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -140,5 +140,18 @@ mb_incoherent(void)
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_TILE_BARRIER_H */
diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
index a6620e5336b6..fa8bf69d9a09 100644
--- a/arch/unicore32/include/asm/barrier.h
+++ b/arch/unicore32/include/asm/barrier.h
@@ -25,4 +25,17 @@
 
 #define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* __UNICORE_BARRIER_H__ */
diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
index c6cd358a1eec..115ef72b3784 100644
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -100,6 +100,19 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	barrier();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	barrier();							\
+	return ___p1;							\
+} while (0)
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
index ef021677d536..e96a674c337a 100644
--- a/arch/xtensa/include/asm/barrier.h
+++ b/arch/xtensa/include/asm/barrier.h
@@ -26,4 +26,17 @@
 
 #define set_mb(var, value)	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	smp_mb();							\
+	ACCESS_ONCE(p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+do {									\
+	typeof(p) ___p1 = ACCESS_ONCE(p);				\
+	smp_mb();							\
+	return ___p1;							\
+} while (0)
+
 #endif /* _XTENSA_SYSTEM_H */

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 19:11                                             ` Peter Zijlstra
  2013-11-04 19:18                                               ` Peter Zijlstra
@ 2013-11-04 20:53                                               ` Paul E. McKenney
  2013-11-05 14:05                                                 ` Will Deacon
  2013-11-06 12:39                                                 ` Peter Zijlstra
  1 sibling, 2 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 20:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling, linux,
	schwidefsky, heiko.carstens

On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 08:27:32AM -0800, Paul E. McKenney wrote:
> > All this is leading me to suggest the following shortenings of names:
> > 
> > 	smp_load_with_acquire_semantics() -> smp_load_acquire()
> > 
> > 	smp_store_with_release_semantics() -> smp_store_release()
> > 
> > But names aside, the above gets rid of explicit barriers on TSO architectures,
> > allows ARM to avoid full DMB, and allows PowerPC to use lwsync instead of
> > the heavier-weight sync.
> 
> A little something like this? Completely guessed at the arm/arm64/ia64
> asm, but at least for those archs I found proper instructions (I hope),
> for x86,sparc,s390 which are TSO we can do with a barrier and PPC like
> said can do with the lwsync, all others fall back to using a smp_mb().
> 
> Should probably come with a proper changelog and an addition to _The_
> document.

Maybe something like this for the changelog?

	A number of situations currently require the heavyweight smp_mb(),
	even though there is no need to order prior stores against later
	loads.  Many architectures have much cheaper ways to handle these
	situations, but the Linux kernel currently has no portable way
	to make use of them.

	This commit therefore supplies smp_load_acquire() and
	smp_store_release() to remedy this situation.  The new
	smp_load_acquire() primitive orders the specified load against
	any subsequent reads or writes, while the new smp_store_release()
	primitive orders the specifed store against any prior reads or
	writes.  These primitives allow array-based circular FIFOs to be
	implemented without an smp_mb(), and also allow a theoretical
	hole in rcu_assign_pointer() to be closed at no additional
	expense on most architectures.

	In addition, the RCU experience transitioning from explicit
	smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
	and rcu_assign_pointer(), respectively resulted in substantial
	improvements in readability.  It therefore seems likely that
	replacing other explicit barriers with smp_load_acquire() and
	smp_store_release() will provide similar benefits.  It appears
	that roughly half of the explicit barriers in core kernel code
	might be so replaced.

Some comments below.  I believe that opcodes need to be fixed for IA64.
I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
be able to tell us.

Other than that, for the rest:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  arch/alpha/include/asm/barrier.h      | 13 +++++++++++
>  arch/arc/include/asm/barrier.h        | 13 +++++++++++
>  arch/arm/include/asm/barrier.h        | 26 +++++++++++++++++++++
>  arch/arm64/include/asm/barrier.h      | 28 +++++++++++++++++++++++
>  arch/avr32/include/asm/barrier.h      | 12 ++++++++++
>  arch/blackfin/include/asm/barrier.h   | 13 +++++++++++
>  arch/cris/include/asm/barrier.h       | 13 +++++++++++
>  arch/frv/include/asm/barrier.h        | 13 +++++++++++
>  arch/h8300/include/asm/barrier.h      | 13 +++++++++++
>  arch/hexagon/include/asm/barrier.h    | 13 +++++++++++
>  arch/ia64/include/asm/barrier.h       | 43 +++++++++++++++++++++++++++++++++++
>  arch/m32r/include/asm/barrier.h       | 13 +++++++++++
>  arch/m68k/include/asm/barrier.h       | 13 +++++++++++
>  arch/metag/include/asm/barrier.h      | 13 +++++++++++
>  arch/microblaze/include/asm/barrier.h | 13 +++++++++++
>  arch/mips/include/asm/barrier.h       | 13 +++++++++++
>  arch/mn10300/include/asm/barrier.h    | 13 +++++++++++
>  arch/parisc/include/asm/barrier.h     | 13 +++++++++++
>  arch/powerpc/include/asm/barrier.h    | 15 ++++++++++++
>  arch/s390/include/asm/barrier.h       | 13 +++++++++++
>  arch/score/include/asm/barrier.h      | 13 +++++++++++
>  arch/sh/include/asm/barrier.h         | 13 +++++++++++
>  arch/sparc/include/asm/barrier_32.h   | 13 +++++++++++
>  arch/sparc/include/asm/barrier_64.h   | 13 +++++++++++
>  arch/tile/include/asm/barrier.h       | 13 +++++++++++
>  arch/unicore32/include/asm/barrier.h  | 13 +++++++++++
>  arch/x86/include/asm/barrier.h        | 13 +++++++++++
>  arch/xtensa/include/asm/barrier.h     | 13 +++++++++++
>  28 files changed, 423 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..464139feee97 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_read_barrier_depends()	do { } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Yep, not any alternative to smp_mb() here.

>  #define set_mb(var, value) \
>  do { var = value; mb(); } while (0)
> 
> diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> index f6cb7c4ffb35..a779da846fb5 100644
> --- a/arch/arc/include/asm/barrier.h
> +++ b/arch/arc/include/asm/barrier.h
> @@ -30,6 +30,19 @@
>  #define smp_wmb()       barrier()
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Appears to be !SMP, so OK.

>  #define smp_mb__before_atomic_dec()	barrier()
>  #define smp_mb__after_atomic_dec()	barrier()
>  #define smp_mb__before_atomic_inc()	barrier()
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 60f15e274e6d..a804093d6891 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,10 +53,36 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
>  #else
>  #define smp_mb()	dmb(ish)
>  #define smp_rmb()	smp_mb()
>  #define smp_wmb()	dmb(ishst)
> +

Seems like there should be some sort of #ifdef condition to distinguish
between these.  My guess is something like:

#if __LINUX_ARM_ARCH__ > 7

But I must defer to the ARM guys.  For all I know, they might prefer
arch/arm to stick with smp_mb() and have arch/arm64 do the ldar and stlr.

> +#define smp_store_release(p, v)						\
> +do {									\
> +	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	asm volatile ("ldar %w0, [%1]"					\
> +			: "=r" (___p1) : "r" (&p) : "memory");		\
> +	return ___p1;							\
> +} while (0)
>  #endif
> 
>  #define read_barrier_depends()		do { } while(0)
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index d4a63338a53c..0da2d4ebb9a8 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -35,10 +35,38 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #else
> +
>  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	asm volatile ("ldar %w0, [%1]"					\
> +			: "=r" (___p1) : "r" (&p) : "memory");		\
> +	return ___p1;							\
> +} while (0)
>  #endif

Ditto on the instruction format.  The closest thing I see in the kernel
is "stlr %w1, %0" in arch_write_unlock() and arch_spin_unlock().

> 
>  #define read_barrier_depends()		do { } while(0)
> diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
> index 0961275373db..a0c48ad684f8 100644
> --- a/arch/avr32/include/asm/barrier.h
> +++ b/arch/avr32/include/asm/barrier.h
> @@ -25,5 +25,17 @@
>  # define smp_read_barrier_depends() do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)

!SMP, so should be OK.

> 
>  #endif /* __ASM_AVR32_BARRIER_H */
> diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
> index ebb189507dd7..67889d9225d9 100644
> --- a/arch/blackfin/include/asm/barrier.h
> +++ b/arch/blackfin/include/asm/barrier.h
> @@ -45,4 +45,17 @@
>  #define set_mb(var, value) do { var = value; mb(); } while (0)
>  #define smp_read_barrier_depends()	read_barrier_depends()
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Ditto.

>  #endif /* _BLACKFIN_BARRIER_H */
> diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
> index 198ad7fa6b25..34243dc44ef1 100644
> --- a/arch/cris/include/asm/barrier.h
> +++ b/arch/cris/include/asm/barrier.h
> @@ -22,4 +22,17 @@
>  #define smp_read_barrier_depends()     do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Ditto.

>  #endif /* __ASM_CRIS_BARRIER_H */
> diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
> index 06776ad9f5e9..92f89934d4ed 100644
> --- a/arch/frv/include/asm/barrier.h
> +++ b/arch/frv/include/asm/barrier.h
> @@ -26,4 +26,17 @@
>  #define set_mb(var, value) \
>  	do { var = (value); barrier(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Ditto.

>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
> index 9e0aa9fc195d..516e9d379e25 100644
> --- a/arch/h8300/include/asm/barrier.h
> +++ b/arch/h8300/include/asm/barrier.h
> @@ -26,4 +26,17 @@
>  #define smp_read_barrier_depends()	do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

And ditto again...

>  #endif /* _H8300_BARRIER_H */
> diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
> index 1041a8e70ce8..838a2ebe07a5 100644
> --- a/arch/hexagon/include/asm/barrier.h
> +++ b/arch/hexagon/include/asm/barrier.h
> @@ -38,4 +38,17 @@
>  #define set_mb(var, value) \
>  	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

And again...

>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index 60576e06b6fb..4598d390fabb 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -45,11 +45,54 @@
>  # define smp_rmb()	rmb()
>  # define smp_wmb()	wmb()
>  # define smp_read_barrier_depends()	read_barrier_depends()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	switch (sizeof(p)) {						\
> +	case 4:								\
> +		asm volatile ("st4.acq [%0]=%1" 			\

This should be "st4.rel".

> +				:: "r" (&p), "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("st8.acq [%0]=%1" 			\

And this should be "st8.rel"

> +				:: "r" (&p), "r" (v) : "memory"); 	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	switch (sizeof(p)) {						\
> +	case 4:								\
> +		asm volatile ("ld4.rel %0=[%1]" 			\

And this should be "ld4.acq".

> +				: "=r"(___p1) : "r" (&p) : "memory"); 	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ld8.rel %0=[%1]" 			\

And this should be "ld8.acq".

> +				: "=r"(___p1) : "r" (&p) : "memory"); 	\
> +		break;							\
> +	}								\
> +	return ___p1;							\
> +} while (0)

It appears that sizes 2 and 1 are also available, but 4 and 8 seem like
good places to start.

>  #else
>  # define smp_mb()	barrier()
>  # define smp_rmb()	barrier()
>  # define smp_wmb()	barrier()
>  # define smp_read_barrier_depends()	do { } while(0)
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
>  #endif
> 
>  /*
> diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
> index 6976621efd3f..e5d42bcf90c5 100644
> --- a/arch/m32r/include/asm/barrier.h
> +++ b/arch/m32r/include/asm/barrier.h
> @@ -91,4 +91,17 @@
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Another !SMP architecture, so looks good.

>  #endif /* _ASM_M32R_BARRIER_H */
> diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
> index 445ce22c23cb..eeb9ecf713cc 100644
> --- a/arch/m68k/include/asm/barrier.h
> +++ b/arch/m68k/include/asm/barrier.h
> @@ -17,4 +17,17 @@
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	((void)0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Ditto.

>  #endif /* _M68K_BARRIER_H */
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index c90bfc6bf648..d8e6f2e4a27c 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -82,4 +82,17 @@ static inline void fence(void)
>  #define smp_read_barrier_depends()     do { } while (0)
>  #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

This one is a bit unusual, but use of smp_mb() should be safe.

>  #endif /* _ASM_METAG_BARRIER_H */
> diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
> index df5be3e87044..a890702061c9 100644
> --- a/arch/microblaze/include/asm/barrier.h
> +++ b/arch/microblaze/include/asm/barrier.h
> @@ -24,4 +24,17 @@
>  #define smp_rmb()		rmb()
>  #define smp_wmb()		wmb()
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

!SMP only, so good.

>  #endif /* _ASM_MICROBLAZE_BARRIER_H */
> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> index 314ab5532019..e59bcd051f36 100644
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -180,4 +180,17 @@
>  #define nudge_writes() mb()
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Interesting variety here as well.  Again, smp_mb() should be safe.

>  #endif /* __ASM_BARRIER_H */
> diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
> index 2bd97a5c8af7..0e6a0608d4a1 100644
> --- a/arch/mn10300/include/asm/barrier.h
> +++ b/arch/mn10300/include/asm/barrier.h
> @@ -34,4 +34,17 @@
>  #define read_barrier_depends()		do {} while (0)
>  #define smp_read_barrier_depends()	do {} while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

!SMP, so good.

>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
> index e77d834aa803..f1145a8594a0 100644
> --- a/arch/parisc/include/asm/barrier.h
> +++ b/arch/parisc/include/asm/barrier.h
> @@ -32,4 +32,17 @@
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Ditto.

>  #endif /* __PARISC_BARRIER_H */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index ae782254e731..b5cc36791f42 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -65,4 +65,19 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> 
> +/* use smp_rmb() as that is either lwsync or a barrier() depending on SMP */
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_rmb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_rmb();							\
> +	return ___p1;							\
> +} while (0)
> +

I think that this actually does work, strange though it does look.

>  #endif /* _ASM_POWERPC_BARRIER_H */
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 16760eeb79b0..e8989c40e11c 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -32,4 +32,17 @@
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +

I believe that this is OK as well, but must defer to the s390
maintainers.

>  #endif /* __ASM_BARRIER_H */
> diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
> index 0eacb6471e6d..5f101ef8ade9 100644
> --- a/arch/score/include/asm/barrier.h
> +++ b/arch/score/include/asm/barrier.h
> @@ -13,4 +13,17 @@
> 
>  #define set_mb(var, value) 		do {var = value; wmb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

!SMP, so good.

>  #endif /* _ASM_SCORE_BARRIER_H */
> diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
> index 72c103dae300..611128c2f636 100644
> --- a/arch/sh/include/asm/barrier.h
> +++ b/arch/sh/include/asm/barrier.h
> @@ -51,4 +51,17 @@
> 
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

Use of smp_mb() should be safe here.

>  #endif /* __ASM_SH_BARRIER_H */
> diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
> index c1b76654ee76..f47f9d51f326 100644
> --- a/arch/sparc/include/asm/barrier_32.h
> +++ b/arch/sparc/include/asm/barrier_32.h
> @@ -12,4 +12,17 @@
>  #define smp_wmb()	__asm__ __volatile__("":::"memory")
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

The surrounding code looks to be set up for !SMP.  I -thought- that there
were SMP 32-bit SPARC systems, but either way, smp_mb() should be safe.

>  #endif /* !(__SPARC_BARRIER_H) */
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 95d45986f908..77cbe6982ca0 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -53,4 +53,17 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
> 
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +

SPARC64 is TSO, so looks good.

>  #endif /* !(__SPARC64_BARRIER_H) */
> diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
> index a9a73da5865d..4d5330d4fd31 100644
> --- a/arch/tile/include/asm/barrier.h
> +++ b/arch/tile/include/asm/barrier.h
> @@ -140,5 +140,18 @@ mb_incoherent(void)
>  #define set_mb(var, value) \
>  	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

The __mb_incoherent() in the surrounding code looks scary, but smp_mb()
should suffice here as well as elsewhere.

>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_TILE_BARRIER_H */
> diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
> index a6620e5336b6..5471ff6aae10 100644
> --- a/arch/unicore32/include/asm/barrier.h
> +++ b/arch/unicore32/include/asm/barrier.h
> @@ -25,4 +25,17 @@
> 
>  #define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

!SMP, so good.

>  #endif /* __UNICORE_BARRIER_H__ */
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index c6cd358a1eec..a7fd8201ab09 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -100,6 +100,19 @@
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +

TSO, so good.

>  /*
>   * Stop RDTSC speculation. This is needed when you need to use RDTSC
>   * (or get_cycles or vread that possibly accesses the TSC) in a defined
> diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
> index ef021677d536..703d511add49 100644
> --- a/arch/xtensa/include/asm/barrier.h
> +++ b/arch/xtensa/include/asm/barrier.h
> @@ -26,4 +26,17 @@
> 
>  #define set_mb(var, value)	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p, v)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +

The use of smp_mb() should be safe, so good.  Looks like xtensa orders
reads, but not writes -- interesting...

>  #endif /* _XTENSA_SYSTEM_H */
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 19:18                                               ` Peter Zijlstra
@ 2013-11-04 20:54                                                 ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-04 20:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling

On Mon, Nov 04, 2013 at 08:18:11PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> > +#define smp_load_acquire(p, v)						\
> 
> I R idiot!! :-)

OK, I did miss this one as well...  :-/

							Thanx, Paul

> ---
>  arch/alpha/include/asm/barrier.h      | 13 +++++++++++
>  arch/arc/include/asm/barrier.h        | 13 +++++++++++
>  arch/arm/include/asm/barrier.h        | 26 +++++++++++++++++++++
>  arch/arm64/include/asm/barrier.h      | 28 +++++++++++++++++++++++
>  arch/avr32/include/asm/barrier.h      | 12 ++++++++++
>  arch/blackfin/include/asm/barrier.h   | 13 +++++++++++
>  arch/cris/include/asm/barrier.h       | 13 +++++++++++
>  arch/frv/include/asm/barrier.h        | 13 +++++++++++
>  arch/h8300/include/asm/barrier.h      | 13 +++++++++++
>  arch/hexagon/include/asm/barrier.h    | 13 +++++++++++
>  arch/ia64/include/asm/barrier.h       | 43 +++++++++++++++++++++++++++++++++++
>  arch/m32r/include/asm/barrier.h       | 13 +++++++++++
>  arch/m68k/include/asm/barrier.h       | 13 +++++++++++
>  arch/metag/include/asm/barrier.h      | 13 +++++++++++
>  arch/microblaze/include/asm/barrier.h | 13 +++++++++++
>  arch/mips/include/asm/barrier.h       | 13 +++++++++++
>  arch/mn10300/include/asm/barrier.h    | 13 +++++++++++
>  arch/parisc/include/asm/barrier.h     | 13 +++++++++++
>  arch/powerpc/include/asm/barrier.h    | 15 ++++++++++++
>  arch/s390/include/asm/barrier.h       | 13 +++++++++++
>  arch/score/include/asm/barrier.h      | 13 +++++++++++
>  arch/sh/include/asm/barrier.h         | 13 +++++++++++
>  arch/sparc/include/asm/barrier_32.h   | 13 +++++++++++
>  arch/sparc/include/asm/barrier_64.h   | 13 +++++++++++
>  arch/tile/include/asm/barrier.h       | 13 +++++++++++
>  arch/unicore32/include/asm/barrier.h  | 13 +++++++++++
>  arch/x86/include/asm/barrier.h        | 13 +++++++++++
>  arch/xtensa/include/asm/barrier.h     | 13 +++++++++++
>  28 files changed, 423 insertions(+)
> 
> diff --git a/arch/alpha/include/asm/barrier.h b/arch/alpha/include/asm/barrier.h
> index ce8860a0b32d..464139feee97 100644
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -29,6 +29,19 @@ __asm__ __volatile__("mb": : :"memory")
>  #define smp_read_barrier_depends()	do { } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #define set_mb(var, value) \
>  do { var = value; mb(); } while (0)
> 
> diff --git a/arch/arc/include/asm/barrier.h b/arch/arc/include/asm/barrier.h
> index f6cb7c4ffb35..a779da846fb5 100644
> --- a/arch/arc/include/asm/barrier.h
> +++ b/arch/arc/include/asm/barrier.h
> @@ -30,6 +30,19 @@
>  #define smp_wmb()       barrier()
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #define smp_mb__before_atomic_dec()	barrier()
>  #define smp_mb__after_atomic_dec()	barrier()
>  #define smp_mb__before_atomic_inc()	barrier()
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 60f15e274e6d..4ada4720bdeb 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,10 +53,36 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
>  #else
>  #define smp_mb()	dmb(ish)
>  #define smp_rmb()	smp_mb()
>  #define smp_wmb()	dmb(ishst)
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	asm volatile ("ldar %w0, [%1]"					\
> +			: "=r" (___p1) : "r" (&p) : "memory");		\
> +	return ___p1;							\
> +} while (0)
>  #endif
> 
>  #define read_barrier_depends()		do { } while(0)
> diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> index d4a63338a53c..3dfddc0416f6 100644
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -35,10 +35,38 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #else
> +
>  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	asm volatile ("ldar %w0, [%1]" 					\
> +			: "=r" (___p1) : "r" (&p) : "memory"); 		\
> +	return ___p1;							\
> +} while (0)
>  #endif
> 
>  #define read_barrier_depends()		do { } while(0)
> diff --git a/arch/avr32/include/asm/barrier.h b/arch/avr32/include/asm/barrier.h
> index 0961275373db..8fd164648e71 100644
> --- a/arch/avr32/include/asm/barrier.h
> +++ b/arch/avr32/include/asm/barrier.h
> @@ -25,5 +25,17 @@
>  # define smp_read_barrier_depends() do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> 
>  #endif /* __ASM_AVR32_BARRIER_H */
> diff --git a/arch/blackfin/include/asm/barrier.h b/arch/blackfin/include/asm/barrier.h
> index ebb189507dd7..c8b85bba843f 100644
> --- a/arch/blackfin/include/asm/barrier.h
> +++ b/arch/blackfin/include/asm/barrier.h
> @@ -45,4 +45,17 @@
>  #define set_mb(var, value) do { var = value; mb(); } while (0)
>  #define smp_read_barrier_depends()	read_barrier_depends()
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _BLACKFIN_BARRIER_H */
> diff --git a/arch/cris/include/asm/barrier.h b/arch/cris/include/asm/barrier.h
> index 198ad7fa6b25..26f21f5d1d15 100644
> --- a/arch/cris/include/asm/barrier.h
> +++ b/arch/cris/include/asm/barrier.h
> @@ -22,4 +22,17 @@
>  #define smp_read_barrier_depends()     do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __ASM_CRIS_BARRIER_H */
> diff --git a/arch/frv/include/asm/barrier.h b/arch/frv/include/asm/barrier.h
> index 06776ad9f5e9..4569028382fa 100644
> --- a/arch/frv/include/asm/barrier.h
> +++ b/arch/frv/include/asm/barrier.h
> @@ -26,4 +26,17 @@
>  #define set_mb(var, value) \
>  	do { var = (value); barrier(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/h8300/include/asm/barrier.h b/arch/h8300/include/asm/barrier.h
> index 9e0aa9fc195d..45d36738814d 100644
> --- a/arch/h8300/include/asm/barrier.h
> +++ b/arch/h8300/include/asm/barrier.h
> @@ -26,4 +26,17 @@
>  #define smp_read_barrier_depends()	do { } while(0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _H8300_BARRIER_H */
> diff --git a/arch/hexagon/include/asm/barrier.h b/arch/hexagon/include/asm/barrier.h
> index 1041a8e70ce8..d88d54bd2e6e 100644
> --- a/arch/hexagon/include/asm/barrier.h
> +++ b/arch/hexagon/include/asm/barrier.h
> @@ -38,4 +38,17 @@
>  #define set_mb(var, value) \
>  	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/ia64/include/asm/barrier.h b/arch/ia64/include/asm/barrier.h
> index 60576e06b6fb..b7f1a8aa03af 100644
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -45,11 +45,54 @@
>  # define smp_rmb()	rmb()
>  # define smp_wmb()	wmb()
>  # define smp_read_barrier_depends()	read_barrier_depends()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	switch (sizeof(p)) {						\
> +	case 4:								\
> +		asm volatile ("st4.acq [%0]=%1"				\
> +				:: "r" (&p), "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("st8.acq [%0]=%1"				\
> +				:: "r" (&p), "r" (v) : "memory");	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1;						\
> +	switch (sizeof(p)) {						\
> +	case 4:								\
> +		asm volatile ("ld4.rel %0=[%1]"				\
> +				: "=r"(___p1) : "r" (&p) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ld8.rel %0=[%1]"				\
> +				: "=r"(___p1) : "r" (&p) : "memory");	\
> +		break;							\
> +	}								\
> +	return ___p1;							\
> +} while (0)
>  #else
>  # define smp_mb()	barrier()
>  # define smp_rmb()	barrier()
>  # define smp_wmb()	barrier()
>  # define smp_read_barrier_depends()	do { } while(0)
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
>  #endif
> 
>  /*
> diff --git a/arch/m32r/include/asm/barrier.h b/arch/m32r/include/asm/barrier.h
> index 6976621efd3f..d78612289cb2 100644
> --- a/arch/m32r/include/asm/barrier.h
> +++ b/arch/m32r/include/asm/barrier.h
> @@ -91,4 +91,17 @@
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_M32R_BARRIER_H */
> diff --git a/arch/m68k/include/asm/barrier.h b/arch/m68k/include/asm/barrier.h
> index 445ce22c23cb..1e63b11c424c 100644
> --- a/arch/m68k/include/asm/barrier.h
> +++ b/arch/m68k/include/asm/barrier.h
> @@ -17,4 +17,17 @@
>  #define smp_wmb()	barrier()
>  #define smp_read_barrier_depends()	((void)0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _M68K_BARRIER_H */
> diff --git a/arch/metag/include/asm/barrier.h b/arch/metag/include/asm/barrier.h
> index c90bfc6bf648..9ffd0b167f07 100644
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -82,4 +82,17 @@ static inline void fence(void)
>  #define smp_read_barrier_depends()     do { } while (0)
>  #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_METAG_BARRIER_H */
> diff --git a/arch/microblaze/include/asm/barrier.h b/arch/microblaze/include/asm/barrier.h
> index df5be3e87044..db0b5e205ce3 100644
> --- a/arch/microblaze/include/asm/barrier.h
> +++ b/arch/microblaze/include/asm/barrier.h
> @@ -24,4 +24,17 @@
>  #define smp_rmb()		rmb()
>  #define smp_wmb()		wmb()
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_MICROBLAZE_BARRIER_H */
> diff --git a/arch/mips/include/asm/barrier.h b/arch/mips/include/asm/barrier.h
> index 314ab5532019..8031afcc7f64 100644
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -180,4 +180,17 @@
>  #define nudge_writes() mb()
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __ASM_BARRIER_H */
> diff --git a/arch/mn10300/include/asm/barrier.h b/arch/mn10300/include/asm/barrier.h
> index 2bd97a5c8af7..e822ff76f498 100644
> --- a/arch/mn10300/include/asm/barrier.h
> +++ b/arch/mn10300/include/asm/barrier.h
> @@ -34,4 +34,17 @@
>  #define read_barrier_depends()		do {} while (0)
>  #define smp_read_barrier_depends()	do {} while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_BARRIER_H */
> diff --git a/arch/parisc/include/asm/barrier.h b/arch/parisc/include/asm/barrier.h
> index e77d834aa803..58757747f873 100644
> --- a/arch/parisc/include/asm/barrier.h
> +++ b/arch/parisc/include/asm/barrier.h
> @@ -32,4 +32,17 @@
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __PARISC_BARRIER_H */
> diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
> index ae782254e731..54922626b356 100644
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -65,4 +65,19 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> 
> +/* use smp_rmb() as that is either lwsync or a barrier() depending on SMP */
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_rmb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_rmb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_POWERPC_BARRIER_H */
> diff --git a/arch/s390/include/asm/barrier.h b/arch/s390/include/asm/barrier.h
> index 16760eeb79b0..babf928649a4 100644
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -32,4 +32,17 @@
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __ASM_BARRIER_H */
> diff --git a/arch/score/include/asm/barrier.h b/arch/score/include/asm/barrier.h
> index 0eacb6471e6d..5905ea57a104 100644
> --- a/arch/score/include/asm/barrier.h
> +++ b/arch/score/include/asm/barrier.h
> @@ -13,4 +13,17 @@
> 
>  #define set_mb(var, value) 		do {var = value; wmb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _ASM_SCORE_BARRIER_H */
> diff --git a/arch/sh/include/asm/barrier.h b/arch/sh/include/asm/barrier.h
> index 72c103dae300..379f500023b6 100644
> --- a/arch/sh/include/asm/barrier.h
> +++ b/arch/sh/include/asm/barrier.h
> @@ -51,4 +51,17 @@
> 
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __ASM_SH_BARRIER_H */
> diff --git a/arch/sparc/include/asm/barrier_32.h b/arch/sparc/include/asm/barrier_32.h
> index c1b76654ee76..1649081d1b86 100644
> --- a/arch/sparc/include/asm/barrier_32.h
> +++ b/arch/sparc/include/asm/barrier_32.h
> @@ -12,4 +12,17 @@
>  #define smp_wmb()	__asm__ __volatile__("":::"memory")
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* !(__SPARC_BARRIER_H) */
> diff --git a/arch/sparc/include/asm/barrier_64.h b/arch/sparc/include/asm/barrier_64.h
> index 95d45986f908..5e23ced0a29a 100644
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -53,4 +53,17 @@ do {	__asm__ __volatile__("ba,pt	%%xcc, 1f\n\t" \
> 
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* !(__SPARC64_BARRIER_H) */
> diff --git a/arch/tile/include/asm/barrier.h b/arch/tile/include/asm/barrier.h
> index a9a73da5865d..1f08318db3c0 100644
> --- a/arch/tile/include/asm/barrier.h
> +++ b/arch/tile/include/asm/barrier.h
> @@ -140,5 +140,18 @@ mb_incoherent(void)
>  #define set_mb(var, value) \
>  	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_TILE_BARRIER_H */
> diff --git a/arch/unicore32/include/asm/barrier.h b/arch/unicore32/include/asm/barrier.h
> index a6620e5336b6..fa8bf69d9a09 100644
> --- a/arch/unicore32/include/asm/barrier.h
> +++ b/arch/unicore32/include/asm/barrier.h
> @@ -25,4 +25,17 @@
> 
>  #define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* __UNICORE_BARRIER_H__ */
> diff --git a/arch/x86/include/asm/barrier.h b/arch/x86/include/asm/barrier.h
> index c6cd358a1eec..115ef72b3784 100644
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -100,6 +100,19 @@
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	barrier();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	barrier();							\
> +	return ___p1;							\
> +} while (0)
> +
>  /*
>   * Stop RDTSC speculation. This is needed when you need to use RDTSC
>   * (or get_cycles or vread that possibly accesses the TSC) in a defined
> diff --git a/arch/xtensa/include/asm/barrier.h b/arch/xtensa/include/asm/barrier.h
> index ef021677d536..e96a674c337a 100644
> --- a/arch/xtensa/include/asm/barrier.h
> +++ b/arch/xtensa/include/asm/barrier.h
> @@ -26,4 +26,17 @@
> 
>  #define set_mb(var, value)	do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	smp_mb();							\
> +	ACCESS_ONCE(p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +do {									\
> +	typeof(p) ___p1 = ACCESS_ONCE(p);				\
> +	smp_mb();							\
> +	return ___p1;							\
> +} while (0)
> +
>  #endif /* _XTENSA_SYSTEM_H */
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 20:53                                               ` Paul E. McKenney
@ 2013-11-05 14:05                                                 ` Will Deacon
  2013-11-05 14:49                                                   ` Paul E. McKenney
  2013-11-05 18:49                                                   ` Peter Zijlstra
  2013-11-06 12:39                                                 ` Peter Zijlstra
  1 sibling, 2 replies; 120+ messages in thread
From: Will Deacon @ 2013-11-05 14:05 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Linus Torvalds, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, linux, schwidefsky, heiko.carstens

On Mon, Nov 04, 2013 at 08:53:44PM +0000, Paul E. McKenney wrote:
> On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> Some comments below.  I believe that opcodes need to be fixed for IA64.
> I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
> be able to tell us.

[...]

> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 60f15e274e6d..a804093d6891 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -53,10 +53,36 @@
> >  #define smp_mb()     barrier()
> >  #define smp_rmb()    barrier()
> >  #define smp_wmb()    barrier()
> > +
> > +#define smp_store_release(p, v)                                              \
> > +do {                                                                 \
> > +     smp_mb();                                                       \
> > +     ACCESS_ONCE(p) = (v);                                           \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p, v)                                               \
> > +do {                                                                 \
> > +     typeof(p) ___p1 = ACCESS_ONCE(p);                               \
> > +     smp_mb();                                                       \
> > +     return ___p1;                                                   \
> > +} while (0)

What data sizes do these accessors operate on? Assuming that we want
single-copy atomicity (with respect to interrupts in the UP case), we
probably want a check to stop people passing in things like structs.

> >  #else
> >  #define smp_mb()     dmb(ish)
> >  #define smp_rmb()    smp_mb()
> >  #define smp_wmb()    dmb(ishst)
> > +
> 
> Seems like there should be some sort of #ifdef condition to distinguish
> between these.  My guess is something like:
> 
> #if __LINUX_ARM_ARCH__ > 7
> 
> But I must defer to the ARM guys.  For all I know, they might prefer
> arch/arm to stick with smp_mb() and have arch/arm64 do the ldar and stlr.

Yes. For arch/arm/, I'd rather we stick with the smp_mb() for the time
being. We don't (yet) have any 32-bit ARMv8 support, and the efforts towards
a single zImage could do without minor variations like this, not to mention
the usual backlash I get whenever introducing something that needs a
relatively recent binutils.

> > +#define smp_store_release(p, v)                                              \
> > +do {                                                                 \
> > +     asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)                                          \
> > +do {                                                                 \
> > +     typeof(p) ___p1;                                                \
> > +     asm volatile ("ldar %w0, [%1]"                                  \
> > +                     : "=r" (___p1) : "r" (&p) : "memory");          \
> > +     return ___p1;                                                   \
> > +} while (0)
> >  #endif
> >
> >  #define read_barrier_depends()               do { } while(0)
> > diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
> > index d4a63338a53c..0da2d4ebb9a8 100644
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -35,10 +35,38 @@
> >  #define smp_mb()     barrier()
> >  #define smp_rmb()    barrier()
> >  #define smp_wmb()    barrier()
> > +
> > +#define smp_store_release(p, v)                                              \
> > +do {                                                                 \
> > +     smp_mb();                                                       \
> > +     ACCESS_ONCE(p) = (v);                                           \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p, v)                                               \
> > +do {                                                                 \
> > +     typeof(p) ___p1 = ACCESS_ONCE(p);                               \
> > +     smp_mb();                                                       \
> > +     return ___p1;                                                   \
> > +} while (0)
> > +
> >  #else
> > +
> >  #define smp_mb()     asm volatile("dmb ish" : : : "memory")
> >  #define smp_rmb()    asm volatile("dmb ishld" : : : "memory")
> >  #define smp_wmb()    asm volatile("dmb ishst" : : : "memory")
> > +
> > +#define smp_store_release(p, v)                                              \
> > +do {                                                                 \
> > +     asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\

Missing comma between the operands. Also, that 'w' output modifier enforces
a 32-bit store (same early question about sizes). Finally, it might be more
efficient to use "=Q" for the addressing mode, rather than take the address
of p manually.

> > +} while (0)
> > +
> > +#define smp_load_acquire(p)                                          \
> > +do {                                                                 \
> > +     typeof(p) ___p1;                                                \
> > +     asm volatile ("ldar %w0, [%1]"                                  \
> > +                     : "=r" (___p1) : "r" (&p) : "memory");          \
> > +     return ___p1;                                                   \

Similar comments here wrt Q constraint.

Random other question: have you considered how these accessors should behave
when presented with __iomem pointers?

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-05 14:05                                                 ` Will Deacon
@ 2013-11-05 14:49                                                   ` Paul E. McKenney
  2013-11-05 18:49                                                   ` Peter Zijlstra
  1 sibling, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-05 14:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linus Torvalds, Victor Kaplansky, Oleg Nesterov,
	Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker,
	LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman,
	Michael Neuling, linux, schwidefsky, heiko.carstens

On Tue, Nov 05, 2013 at 02:05:48PM +0000, Will Deacon wrote:
> On Mon, Nov 04, 2013 at 08:53:44PM +0000, Paul E. McKenney wrote:
> > On Mon, Nov 04, 2013 at 08:11:27PM +0100, Peter Zijlstra wrote:
> > Some comments below.  I believe that opcodes need to be fixed for IA64.
> > I am unsure of the ifdefs and opcodes for arm64, but the ARM folks should
> > be able to tell us.

[ . . . ]

> > > +} while (0)
> > > +
> > > +#define smp_load_acquire(p)                                          \
> > > +do {                                                                 \
> > > +     typeof(p) ___p1;                                                \
> > > +     asm volatile ("ldar %w0, [%1]"                                  \
> > > +                     : "=r" (___p1) : "r" (&p) : "memory");          \
> > > +     return ___p1;                                                   \
> 
> Similar comments here wrt Q constraint.
> 
> Random other question: have you considered how these accessors should behave
> when presented with __iomem pointers?

Should we have something to make sparse yell if not __kernel or some such?

								Thanx, Paul


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-05 14:05                                                 ` Will Deacon
  2013-11-05 14:49                                                   ` Paul E. McKenney
@ 2013-11-05 18:49                                                   ` Peter Zijlstra
  2013-11-06 11:00                                                     ` Will Deacon
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-05 18:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, linux, schwidefsky,
	heiko.carstens

On Tue, Nov 05, 2013 at 02:05:48PM +0000, Will Deacon wrote:
> > > +
> > > +#define smp_store_release(p, v)                                              \
> > > +do {                                                                 \
> > > +     smp_mb();                                                       \
> > > +     ACCESS_ONCE(p) = (v);                                           \
> > > +} while (0)
> > > +
> > > +#define smp_load_acquire(p, v)                                               \
> > > +do {                                                                 \
> > > +     typeof(p) ___p1 = ACCESS_ONCE(p);                               \
> > > +     smp_mb();                                                       \
> > > +     return ___p1;                                                   \
> > > +} while (0)
> 
> What data sizes do these accessors operate on? Assuming that we want
> single-copy atomicity (with respect to interrupts in the UP case), we
> probably want a check to stop people passing in things like structs.

Fair enough; I think we should restrict to native word sizes same as we
do for atomics.

Something like so perhaps:

#ifdef CONFIG_64BIT
#define __check_native_word(t)	(sizeof(t) == 4 || sizeof(t) == 8)
#else
#define __check_native_word(t)	(sizeof(t) == 4)
#endif

#define smp_store_release(p, v) 		\
do {						\
	BUILD_BUG_ON(!__check_native_word(p));	\
	smp_mb();				\
	ACCESS_ONCE(p) = (v);			\
} while (0)

> > > +#define smp_store_release(p, v)                                              \
> > > +do {                                                                 \
> > > +     asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> 
> Missing comma between the operands. Also, that 'w' output modifier enforces
> a 32-bit store (same early question about sizes). Finally, it might be more
> efficient to use "=Q" for the addressing mode, rather than take the address
> of p manually.

so something like:

	asm volatile ("stlr %0, [%1]" : : "r" (v), "=Q" (p) : "memory");

?

My inline asm foo is horrid and I mostly get by with copy paste from a
semi similar existing form :/

> Random other question: have you considered how these accessors should behave
> when presented with __iomem pointers?

A what? ;-)

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-05 18:49                                                   ` Peter Zijlstra
@ 2013-11-06 11:00                                                     ` Will Deacon
  0 siblings, 0 replies; 120+ messages in thread
From: Will Deacon @ 2013-11-06 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, linux, schwidefsky,
	heiko.carstens

On Tue, Nov 05, 2013 at 06:49:43PM +0000, Peter Zijlstra wrote:
> On Tue, Nov 05, 2013 at 02:05:48PM +0000, Will Deacon wrote:
> > > > +
> > > > +#define smp_store_release(p, v)                                              \
> > > > +do {                                                                 \
> > > > +     smp_mb();                                                       \
> > > > +     ACCESS_ONCE(p) = (v);                                           \
> > > > +} while (0)
> > > > +
> > > > +#define smp_load_acquire(p, v)                                               \
> > > > +do {                                                                 \
> > > > +     typeof(p) ___p1 = ACCESS_ONCE(p);                               \
> > > > +     smp_mb();                                                       \
> > > > +     return ___p1;                                                   \
> > > > +} while (0)
> > 
> > What data sizes do these accessors operate on? Assuming that we want
> > single-copy atomicity (with respect to interrupts in the UP case), we
> > probably want a check to stop people passing in things like structs.
> 
> Fair enough; I think we should restrict to native word sizes same as we
> do for atomics.
> 
> Something like so perhaps:
> 
> #ifdef CONFIG_64BIT
> #define __check_native_word(t)	(sizeof(t) == 4 || sizeof(t) == 8)

Ok, if we want to support 32-bit accesses on 64-bit machines, that will
complicate some of your assembly (more below).

> #else
> #define __check_native_word(t)	(sizeof(t) == 4)
> #endif
> 
> #define smp_store_release(p, v) 		\
> do {						\
> 	BUILD_BUG_ON(!__check_native_word(p));	\
> 	smp_mb();				\
> 	ACCESS_ONCE(p) = (v);			\
> } while (0)
> 
> > > > +#define smp_store_release(p, v)                                              \
> > > > +do {                                                                 \
> > > > +     asm volatile ("stlr %w0 [%1]" : : "r" (v), "r" (&p) : "memory");\
> > 
> > Missing comma between the operands. Also, that 'w' output modifier enforces
> > a 32-bit store (same early question about sizes). Finally, it might be more
> > efficient to use "=Q" for the addressing mode, rather than take the address
> > of p manually.
> 
> so something like:
> 
> 	asm volatile ("stlr %0, [%1]" : : "r" (v), "=Q" (p) : "memory");
> 
> ?
> 
> My inline asm foo is horrid and I mostly get by with copy paste from a
> semi similar existing form :/

Almost: you just need to drop the square brackets and make the memory
location an output operand:

	asm volatile("stlr %1, %0" : "=Q" (p) : "r" (v) : "memory");

however, for a 32-bit access, you need to use an output modifier:

	asm volatile("stlr %w1, %0" : "=Q" (p) : "r" (v) : "memory");

so I guess a switch on sizeof(p) is required.

> > Random other question: have you considered how these accessors should behave
> > when presented with __iomem pointers?
> 
> A what? ;-)

Then let's go with Paul's suggestion of mandating __kernel, or the like
(unless we need to worry about __user addresses for things like futexes?).

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 20:53                                               ` Paul E. McKenney
  2013-11-05 14:05                                                 ` Will Deacon
@ 2013-11-06 12:39                                                 ` Peter Zijlstra
  2013-11-06 12:51                                                   ` Geert Uytterhoeven
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 12:39 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Linus Torvalds, Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling, linux,
	schwidefsky, heiko.carstens, tony.luck


Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.

Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/memory-barriers.txt     |  157 +++++++++++++++++-----------------
 arch/alpha/include/asm/barrier.h      |   15 +++
 arch/arc/include/asm/barrier.h        |   15 +++
 arch/arm/include/asm/barrier.h        |   15 +++
 arch/arm64/include/asm/barrier.h      |   50 ++++++++++
 arch/avr32/include/asm/barrier.h      |   14 +++
 arch/blackfin/include/asm/barrier.h   |   15 +++
 arch/cris/include/asm/barrier.h       |   15 +++
 arch/frv/include/asm/barrier.h        |   15 +++
 arch/h8300/include/asm/barrier.h      |   15 +++
 arch/hexagon/include/asm/barrier.h    |   15 +++
 arch/ia64/include/asm/barrier.h       |   49 ++++++++++
 arch/m32r/include/asm/barrier.h       |   15 +++
 arch/m68k/include/asm/barrier.h       |   15 +++
 arch/metag/include/asm/barrier.h      |   15 +++
 arch/microblaze/include/asm/barrier.h |   15 +++
 arch/mips/include/asm/barrier.h       |   15 +++
 arch/mn10300/include/asm/barrier.h    |   15 +++
 arch/parisc/include/asm/barrier.h     |   15 +++
 arch/powerpc/include/asm/barrier.h    |   21 ++++
 arch/s390/include/asm/barrier.h       |   15 +++
 arch/score/include/asm/barrier.h      |   15 +++
 arch/sh/include/asm/barrier.h         |   15 +++
 arch/sparc/include/asm/barrier_32.h   |   15 +++
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h       |   15 +++
 arch/unicore32/include/asm/barrier.h  |   15 +++
 arch/x86/include/asm/barrier.h        |   15 +++
 arch/xtensa/include/asm/barrier.h     |   15 +++
 include/linux/compiler.h              |    9 +
 30 files changed, 581 insertions(+), 79 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
      This acts as a one-way permeable barrier.  It guarantees that all memory
-     operations after the LOCK operation will appear to happen after the LOCK
-     operation with respect to the other components of the system.
+     operations after the ACQUIRE operation will appear to happen after the
+     ACQUIRE operation with respect to the other components of the system.
 
-     Memory operations that occur before a LOCK operation may appear to happen
-     after it completes.
+     Memory operations that occur before a ACQUIRE operation may appear to
+     happen after it completes.
 
-     A LOCK operation should almost always be paired with an UNLOCK operation.
+     A ACQUIRE operation should almost always be paired with an RELEASE
+     operation.
 
 
- (6) UNLOCK operations.
+ (6) RELEASE operations.
 
      This also acts as a one-way permeable barrier.  It guarantees that all
-     memory operations before the UNLOCK operation will appear to happen before
-     the UNLOCK operation with respect to the other components of the system.
+     memory operations before the RELEASE operation will appear to happen
+     before the RELEASE operation with respect to the other components of the
+     system.
 
-     Memory operations that occur after an UNLOCK operation may appear to
+     Memory operations that occur after an RELEASE operation may appear to
      happen before it completes.
 
-     LOCK and UNLOCK operations are guaranteed to appear with respect to each
-     other strictly in the order specified.
+     ACQUIRE and RELEASE operations are guaranteed to appear with respect to
+     each other strictly in the order specified.
 
-     The use of LOCK and UNLOCK operations generally precludes the need for
-     other sorts of memory barrier (but note the exceptions mentioned in the
-     subsection "MMIO write barrier").
+     The use of ACQUIRE and RELEASE operations generally precludes the need
+     for other sorts of memory barrier (but note the exceptions mentioned in
+     the subsection "MMIO write barrier").
 
 
 Memory barriers are only required where there's a possibility of interaction
@@ -1135,7 +1137,7 @@ CPU from reordering them.
 	clear_bit( ... );
 
      This prevents memory operations before the clear leaking to after it.  See
-     the subsection on "Locking Functions" with reference to UNLOCK operation
+     the subsection on "Locking Functions" with reference to RELEASE operation
      implications.
 
      See Documentation/atomic_ops.txt for more information.  See the "Atomic
@@ -1181,65 +1183,66 @@ LOCKING FUNCTIONS
  (*) R/W semaphores
  (*) RCU
 
-In all cases there are variants on "LOCK" operations and "UNLOCK" operations
+In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
 for each construct.  These operations all imply certain barriers:
 
- (1) LOCK operation implication:
+ (1) ACQUIRE operation implication:
 
-     Memory operations issued after the LOCK will be completed after the LOCK
-     operation has completed.
+     Memory operations issued after the ACQUIRE will be completed after the
+     ACQUIRE operation has completed.
 
-     Memory operations issued before the LOCK may be completed after the LOCK
-     operation has completed.
+     Memory operations issued before the ACQUIRE may be completed after the
+     ACQUIRE operation has completed.
 
- (2) UNLOCK operation implication:
+ (2) RELEASE operation implication:
 
-     Memory operations issued before the UNLOCK will be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued before the RELEASE will be completed before the
+     RELEASE operation has completed.
 
-     Memory operations issued after the UNLOCK may be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued after the RELEASE may be completed before the
+     RELEASE operation has completed.
 
- (3) LOCK vs LOCK implication:
+ (3) ACQUIRE vs ACQUIRE implication:
 
-     All LOCK operations issued before another LOCK operation will be completed
-     before that LOCK operation.
+     All ACQUIRE operations issued before another ACQUIRE operation will be
+     completed before that ACQUIRE operation.
 
- (4) LOCK vs UNLOCK implication:
+ (4) ACQUIRE vs RELEASE implication:
 
-     All LOCK operations issued before an UNLOCK operation will be completed
-     before the UNLOCK operation.
+     All ACQUIRE operations issued before an RELEASE operation will be
+     completed before the RELEASE operation.
 
-     All UNLOCK operations issued before a LOCK operation will be completed
-     before the LOCK operation.
+     All RELEASE operations issued before a ACQUIRE operation will be
+     completed before the ACQUIRE operation.
 
- (5) Failed conditional LOCK implication:
+ (5) Failed conditional ACQUIRE implication:
 
-     Certain variants of the LOCK operation may fail, either due to being
+     Certain variants of the ACQUIRE operation may fail, either due to being
      unable to get the lock immediately, or due to receiving an unblocked
      signal whilst asleep waiting for the lock to become available.  Failed
      locks do not imply any sort of barrier.
 
-Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
-equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
+Therefore, from (1), (2) and (4) an RELEASE followed by an unconditional
+ACQUIRE is equivalent to a full barrier, but a ACQUIRE followed by an RELEASE
+is not.
 
 [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
     barriers is that the effects of instructions outside of a critical section
     may seep into the inside of the critical section.
 
-A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
-because it is possible for an access preceding the LOCK to happen after the
-LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
-two accesses can themselves then cross:
+A ACQUIRE followed by an RELEASE may not be assumed to be full memory barrier
+because it is possible for an access preceding the ACQUIRE to happen after the
+ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
+the two accesses can themselves then cross:
 
 	*A = a;
-	LOCK
-	UNLOCK
+	ACQUIRE
+	RELEASE
 	*B = b;
 
 may occur as:
 
-	LOCK, STORE *B, STORE *A, UNLOCK
+	ACQUIRE, STORE *B, STORE *A, RELEASE
 
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
@@ -1253,33 +1256,33 @@ See also the section on "Inter-CPU locki
 
 	*A = a;
 	*B = b;
-	LOCK
+	ACQUIRE
 	*C = c;
 	*D = d;
-	UNLOCK
+	RELEASE
 	*E = e;
 	*F = f;
 
 The following sequence of events is acceptable:
 
-	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
 
 	[+] Note that {*F,*A} indicates a combined access.
 
 But none of the following are:
 
-	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
-	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
-	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
-	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
+	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
+	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
+	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
+	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
 
 
 
 INTERRUPT DISABLING FUNCTIONS
 -----------------------------
 
-Functions that disable interrupts (LOCK equivalent) and enable interrupts
-(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
+Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
+(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
 barriers are required in such a situation, they must be provided from some
 other means.
 
@@ -1436,24 +1439,24 @@ Consider the following: the system has a
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;				*E = e;
-	LOCK M				LOCK Q
+	ACQUIRE M			ACQUIRE Q
 	*B = b;				*F = f;
 	*C = c;				*G = g;
-	UNLOCK M			UNLOCK Q
+	RELEASE M			RELEASE Q
 	*D = d;				*H = h;
 
 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
 on the separate CPUs. It might, for example, see:
 
-	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
 
 But it won't see any of:
 
-	*B, *C or *D preceding LOCK M
-	*A, *B or *C following UNLOCK M
-	*F, *G or *H preceding LOCK Q
-	*E, *F or *G following UNLOCK Q
+	*B, *C or *D preceding ACQUIRE M
+	*A, *B or *C following RELEASE M
+	*F, *G or *H preceding ACQUIRE Q
+	*E, *F or *G following RELEASE Q
 
 
 However, if the following occurs:
@@ -1461,28 +1464,28 @@ through *H occur in, other than the cons
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;
-	LOCK M		[1]
+	ACQUIRE M	[1]
 	*B = b;
 	*C = c;
-	UNLOCK M	[1]
+	RELEASE M	[1]
 	*D = d;				*E = e;
-					LOCK M		[2]
+					ACQUIRE M	[2]
 					*F = f;
 					*G = g;
-					UNLOCK M	[2]
+					RELEASE M	[2]
 					*H = h;
 
 CPU 3 might see:
 
-	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
-		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
+	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
+	    ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
 
 But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
 
-	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
-	*A, *B or *C following UNLOCK M [1]
-	*F, *G or *H preceding LOCK M [2]
-	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
+	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
+	*A, *B or *C following RELEASE M [1]
+	*F, *G or *H preceding ACQUIRE M [2]
+	*A, *B, *C, *E, *F or *G following RELEASE M [2]
 
 
 LOCKS VS I/O ACCESSES
@@ -1702,13 +1705,13 @@ about the state (old or new) implies an
 	test_and_clear_bit();
 	test_and_change_bit();
 
-These are used for such things as implementing LOCK-class and UNLOCK-class
+These are used for such things as implementing ACQUIRE-class and RELEASE-class
 operations and adjusting reference counters towards object destruction, and as
 such the implicit memory barrier effects are necessary.
 
 
 The following operations are potential problems as they do _not_ imply memory
-barriers, but might be used for implementing such things as UNLOCK-class
+barriers, but might be used for implementing such things as RELEASE-class
 operations:
 
 	atomic_set();
@@ -1750,9 +1753,9 @@ barriers are needed or not.
 	clear_bit_unlock();
 	__clear_bit_unlock();
 
-These implement LOCK-class and UNLOCK-class operations. These should be used in
-preference to other operations when implementing locking primitives, because
-their implementations can be optimised on many architectures.
+These implement ACQUIRE-class and RELEASE-class operations. These should be
+used in preference to other operations when implementing locking primitives,
+because their implementations can be optimised on many architectures.
 
 [!] Note that special memory barrier primitives are available for these
 situations because on some CPUs the atomic instructions used imply full memory
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -29,6 +29,21 @@ __asm__ __volatile__("mb": : :"memory")
 #define smp_read_barrier_depends()	do { } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #define set_mb(var, value) \
 do { var = value; mb(); } while (0)
 
--- a/arch/arc/include/asm/barrier.h
+++ b/arch/arc/include/asm/barrier.h
@@ -30,6 +30,21 @@
 #define smp_wmb()       barrier()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #define smp_mb__before_atomic_dec()	barrier()
 #define smp_mb__after_atomic_dec()	barrier()
 #define smp_mb__before_atomic_inc()	barrier()
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -35,11 +35,59 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #else
+
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
-#endif
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("stlr %w1, [%0]"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("stlr %1, [%0]"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ldar %w0, [%1]"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	case 8:								\
+		asm volatile ("ldar %0, [%1]"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	}								\
+	___p1;								\
+})
 
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -25,5 +25,19 @@
 # define smp_read_barrier_depends() do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
 
 #endif /* __ASM_AVR32_BARRIER_H */
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -45,4 +45,19 @@
 #define set_mb(var, value) do { var = value; mb(); } while (0)
 #define smp_read_barrier_depends()	read_barrier_depends()
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _BLACKFIN_BARRIER_H */
--- a/arch/cris/include/asm/barrier.h
+++ b/arch/cris/include/asm/barrier.h
@@ -22,4 +22,19 @@
 #define smp_read_barrier_depends()     do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __ASM_CRIS_BARRIER_H */
--- a/arch/frv/include/asm/barrier.h
+++ b/arch/frv/include/asm/barrier.h
@@ -26,4 +26,19 @@
 #define set_mb(var, value) \
 	do { var = (value); barrier(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_BARRIER_H */
--- a/arch/h8300/include/asm/barrier.h
+++ b/arch/h8300/include/asm/barrier.h
@@ -26,4 +26,19 @@
 #define smp_read_barrier_depends()	do { } while(0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _H8300_BARRIER_H */
--- a/arch/hexagon/include/asm/barrier.h
+++ b/arch/hexagon/include/asm/barrier.h
@@ -38,4 +38,19 @@
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_BARRIER_H */
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -45,11 +45,60 @@
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("st4.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("st8.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ld4.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("ld8.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	}								\
+	___p1;								\
+})
+
 #else
+
 # define smp_mb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
 #endif
 
 /*
--- a/arch/m32r/include/asm/barrier.h
+++ b/arch/m32r/include/asm/barrier.h
@@ -91,4 +91,19 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_M32R_BARRIER_H */
--- a/arch/m68k/include/asm/barrier.h
+++ b/arch/m68k/include/asm/barrier.h
@@ -17,4 +17,19 @@
 #define smp_wmb()	barrier()
 #define smp_read_barrier_depends()	((void)0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _M68K_BARRIER_H */
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -82,4 +82,19 @@ static inline void fence(void)
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_METAG_BARRIER_H */
--- a/arch/microblaze/include/asm/barrier.h
+++ b/arch/microblaze/include/asm/barrier.h
@@ -24,4 +24,19 @@
 #define smp_rmb()		rmb()
 #define smp_wmb()		wmb()
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_MICROBLAZE_BARRIER_H */
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,4 +180,19 @@
 #define nudge_writes() mb()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
--- a/arch/mn10300/include/asm/barrier.h
+++ b/arch/mn10300/include/asm/barrier.h
@@ -34,4 +34,19 @@
 #define read_barrier_depends()		do {} while (0)
 #define smp_read_barrier_depends()	do {} while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_BARRIER_H */
--- a/arch/parisc/include/asm/barrier.h
+++ b/arch/parisc/include/asm/barrier.h
@@ -32,4 +32,19 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __PARISC_BARRIER_H */
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -45,11 +45,15 @@
 #    define SMPWMB      eieio
 #endif
 
+#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+
 #define smp_mb()	mb()
-#define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define smp_rmb()	__lwsync()
 #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
+#define __lwsync()	barrier()
+
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
@@ -65,4 +69,19 @@
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	___p1;								\
+})
+
 #endif /* _ASM_POWERPC_BARRIER_H */
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -32,4 +32,19 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
--- a/arch/score/include/asm/barrier.h
+++ b/arch/score/include/asm/barrier.h
@@ -13,4 +13,19 @@
 
 #define set_mb(var, value) 		do {var = value; wmb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_SCORE_BARRIER_H */
--- a/arch/sh/include/asm/barrier.h
+++ b/arch/sh/include/asm/barrier.h
@@ -51,4 +51,19 @@
 
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __ASM_SH_BARRIER_H */
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -12,4 +12,19 @@
 #define smp_wmb()	__asm__ __volatile__("":::"memory")
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* !(__SPARC_BARRIER_H) */
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,4 +53,19 @@ do {	__asm__ __volatile__("ba,pt	%%xcc,
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* !(__SPARC64_BARRIER_H) */
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -140,5 +140,20 @@ mb_incoherent(void)
 #define set_mb(var, value) \
 	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_TILE_BARRIER_H */
--- a/arch/unicore32/include/asm/barrier.h
+++ b/arch/unicore32/include/asm/barrier.h
@@ -25,4 +25,19 @@
 
 #define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __UNICORE_BARRIER_H__ */
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -100,6 +100,21 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
--- a/arch/xtensa/include/asm/barrier.h
+++ b/arch/xtensa/include/asm/barrier.h
@@ -26,4 +26,19 @@
 
 #define set_mb(var, value)	do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _XTENSA_SYSTEM_H */
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -298,6 +298,11 @@ void ftrace_likely_update(struct ftrace_
 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 #endif
 
+/* Is this type a native word size -- useful for atomic operations */
+#ifndef __native_word
+# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
+#endif
+
 /* Compile time object size, -1 for unknown */
 #ifndef __compiletime_object_size
 # define __compiletime_object_size(obj) -1
@@ -337,6 +342,10 @@ void ftrace_likely_update(struct ftrace_
 #define compiletime_assert(condition, msg) \
 	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
 
+#define compiletime_assert_atomic_type(t)				\
+	compiletime_assert(__native_word(t),				\
+		"Need native word sized stores/loads for atomicity.")
+
 /*
  * Prevent the compiler from merging or refetching accesses.  The compiler
  * is also forbidden from reordering successive instances of ACCESS_ONCE(),

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-06 12:39                                                 ` Peter Zijlstra
@ 2013-11-06 12:51                                                   ` Geert Uytterhoeven
  2013-11-06 13:57                                                     ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Geert Uytterhoeven @ 2013-11-06 12:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Russell King,
	Martin Schwidefsky, Heiko Carstens, Tony Luck

On Wed, Nov 6, 2013 at 1:39 PM, Peter Zijlstra <peterz@infradead.org> wrote:
>  Documentation/memory-barriers.txt     |  157 +++++++++++++++++-----------------
>  arch/alpha/include/asm/barrier.h      |   15 +++
>  arch/arc/include/asm/barrier.h        |   15 +++
>  arch/arm/include/asm/barrier.h        |   15 +++
>  arch/arm64/include/asm/barrier.h      |   50 ++++++++++
>  arch/avr32/include/asm/barrier.h      |   14 +++
>  arch/blackfin/include/asm/barrier.h   |   15 +++
>  arch/cris/include/asm/barrier.h       |   15 +++
>  arch/frv/include/asm/barrier.h        |   15 +++
>  arch/h8300/include/asm/barrier.h      |   15 +++
>  arch/hexagon/include/asm/barrier.h    |   15 +++
>  arch/ia64/include/asm/barrier.h       |   49 ++++++++++
>  arch/m32r/include/asm/barrier.h       |   15 +++
>  arch/m68k/include/asm/barrier.h       |   15 +++
>  arch/metag/include/asm/barrier.h      |   15 +++
>  arch/microblaze/include/asm/barrier.h |   15 +++
>  arch/mips/include/asm/barrier.h       |   15 +++
>  arch/mn10300/include/asm/barrier.h    |   15 +++
>  arch/parisc/include/asm/barrier.h     |   15 +++
>  arch/powerpc/include/asm/barrier.h    |   21 ++++
>  arch/s390/include/asm/barrier.h       |   15 +++
>  arch/score/include/asm/barrier.h      |   15 +++
>  arch/sh/include/asm/barrier.h         |   15 +++
>  arch/sparc/include/asm/barrier_32.h   |   15 +++
>  arch/sparc/include/asm/barrier_64.h   |   15 +++
>  arch/tile/include/asm/barrier.h       |   15 +++
>  arch/unicore32/include/asm/barrier.h  |   15 +++
>  arch/x86/include/asm/barrier.h        |   15 +++
>  arch/xtensa/include/asm/barrier.h     |   15 +++
>  include/linux/compiler.h              |    9 +
>  30 files changed, 581 insertions(+), 79 deletions(-)

This is screaming for a default implementation in asm-generic.

Gr{oetje,eeting}s,

                        Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply	[flat|nested] 120+ messages in thread

* [tip:perf/core] tools/perf: Add required memory barriers
  2013-10-30 10:42                       ` Peter Zijlstra
  2013-10-30 11:48                         ` James Hogan
@ 2013-11-06 13:19                         ` tip-bot for Peter Zijlstra
  2013-11-06 13:50                           ` Vince Weaver
  1 sibling, 1 reply; 120+ messages in thread
From: tip-bot for Peter Zijlstra @ 2013-11-06 13:19 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, mathieu.desnoyers, anton, hpa, mingo, michael,
	peterz, paulmck, vince, fweisbec, benh, oleg, tglx, VICTORK,
	mikey

Commit-ID:  a94d342b9cb09edfe888ea972af0883b6a8d992b
Gitweb:     http://git.kernel.org/tip/a94d342b9cb09edfe888ea972af0883b6a8d992b
Author:     Peter Zijlstra <peterz@infradead.org>
AuthorDate: Wed, 30 Oct 2013 11:42:46 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 6 Nov 2013 12:34:26 +0100

tools/perf: Add required memory barriers

To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
ordering") change userspace to also adhere to the ordering outlined.

Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: james.hogan@imgtec.com
Cc: Vince Weaver <vince@deater.net>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Cc: Michael Ellerman <michael@ellerman.id.au>
Link: http://lkml.kernel.org/r/20131030104246.GH16117@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 tools/perf/perf.h        | 59 ++++++++++++++++++++++++++++++++++++++----------
 tools/perf/tests/rdpmc.c |  2 --
 tools/perf/util/evlist.h |  4 ++--
 3 files changed, 49 insertions(+), 16 deletions(-)

diff --git a/tools/perf/perf.h b/tools/perf/perf.h
index f61c230..6a587e84 100644
--- a/tools/perf/perf.h
+++ b/tools/perf/perf.h
@@ -4,6 +4,8 @@
 #include <asm/unistd.h>
 
 #if defined(__i386__)
+#define mb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
+#define wmb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define rmb()		asm volatile("lock; addl $0,0(%%esp)" ::: "memory")
 #define cpu_relax()	asm volatile("rep; nop" ::: "memory");
 #define CPUINFO_PROC	"model name"
@@ -13,6 +15,8 @@
 #endif
 
 #if defined(__x86_64__)
+#define mb()		asm volatile("mfence" ::: "memory")
+#define wmb()		asm volatile("sfence" ::: "memory")
 #define rmb()		asm volatile("lfence" ::: "memory")
 #define cpu_relax()	asm volatile("rep; nop" ::: "memory");
 #define CPUINFO_PROC	"model name"
@@ -23,45 +27,61 @@
 
 #ifdef __powerpc__
 #include "../../arch/powerpc/include/uapi/asm/unistd.h"
+#define mb()		asm volatile ("sync" ::: "memory")
+#define wmb()		asm volatile ("sync" ::: "memory")
 #define rmb()		asm volatile ("sync" ::: "memory")
-#define cpu_relax()	asm volatile ("" ::: "memory");
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __s390__
+#define mb()		asm volatile("bcr 15,0" ::: "memory")
+#define wmb()		asm volatile("bcr 15,0" ::: "memory")
 #define rmb()		asm volatile("bcr 15,0" ::: "memory")
-#define cpu_relax()	asm volatile("" ::: "memory");
 #endif
 
 #ifdef __sh__
 #if defined(__SH4A__) || defined(__SH5__)
+# define mb()		asm volatile("synco" ::: "memory")
+# define wmb()		asm volatile("synco" ::: "memory")
 # define rmb()		asm volatile("synco" ::: "memory")
 #else
+# define mb()		asm volatile("" ::: "memory")
+# define wmb()		asm volatile("" ::: "memory")
 # define rmb()		asm volatile("" ::: "memory")
 #endif
-#define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"cpu type"
 #endif
 
 #ifdef __hppa__
+#define mb()		asm volatile("" ::: "memory")
+#define wmb()		asm volatile("" ::: "memory")
 #define rmb()		asm volatile("" ::: "memory")
-#define cpu_relax()	asm volatile("" ::: "memory");
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __sparc__
+#ifdef __LP64__
+#define mb()		asm volatile("ba,pt %%xcc, 1f\n"	\
+				     "membar #StoreLoad\n"	\
+				     "1:\n":::"memory")
+#else
+#define mb()		asm volatile("":::"memory")
+#endif
+#define wmb()		asm volatile("":::"memory")
 #define rmb()		asm volatile("":::"memory")
-#define cpu_relax()	asm volatile("":::"memory")
 #define CPUINFO_PROC	"cpu"
 #endif
 
 #ifdef __alpha__
+#define mb()		asm volatile("mb" ::: "memory")
+#define wmb()		asm volatile("wmb" ::: "memory")
 #define rmb()		asm volatile("mb" ::: "memory")
-#define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"cpu model"
 #endif
 
 #ifdef __ia64__
+#define mb()		asm volatile ("mf" ::: "memory")
+#define wmb()		asm volatile ("mf" ::: "memory")
 #define rmb()		asm volatile ("mf" ::: "memory")
 #define cpu_relax()	asm volatile ("hint @pause" ::: "memory")
 #define CPUINFO_PROC	"model name"
@@ -72,40 +92,55 @@
  * Use the __kuser_memory_barrier helper in the CPU helper page. See
  * arch/arm/kernel/entry-armv.S in the kernel source for details.
  */
+#define mb()		((void(*)(void))0xffff0fa0)()
+#define wmb()		((void(*)(void))0xffff0fa0)()
 #define rmb()		((void(*)(void))0xffff0fa0)()
-#define cpu_relax()	asm volatile("":::"memory")
 #define CPUINFO_PROC	"Processor"
 #endif
 
 #ifdef __aarch64__
-#define rmb()		asm volatile("dmb ld" ::: "memory")
+#define mb()		asm volatile("dmb ish" ::: "memory")
+#define wmb()		asm volatile("dmb ishld" ::: "memory")
+#define rmb()		asm volatile("dmb ishst" ::: "memory")
 #define cpu_relax()	asm volatile("yield" ::: "memory")
 #endif
 
 #ifdef __mips__
-#define rmb()		asm volatile(					\
+#define mb()		asm volatile(					\
 				".set	mips2\n\t"			\
 				"sync\n\t"				\
 				".set	mips0"				\
 				: /* no output */			\
 				: /* no input */			\
 				: "memory")
-#define cpu_relax()	asm volatile("" ::: "memory")
+#define wmb()	mb()
+#define rmb()	mb()
 #define CPUINFO_PROC	"cpu model"
 #endif
 
 #ifdef __arc__
+#define mb()		asm volatile("" ::: "memory")
+#define wmb()		asm volatile("" ::: "memory")
 #define rmb()		asm volatile("" ::: "memory")
-#define cpu_relax()	rmb()
 #define CPUINFO_PROC	"Processor"
 #endif
 
 #ifdef __metag__
+#define mb()		asm volatile("" ::: "memory")
+#define wmb()		asm volatile("" ::: "memory")
 #define rmb()		asm volatile("" ::: "memory")
-#define cpu_relax()	asm volatile("" ::: "memory")
 #define CPUINFO_PROC	"CPU"
 #endif
 
+#define barrier() asm volatile ("" ::: "memory")
+
+#ifndef cpu_relax
+#define cpu_relax() barrier()
+#endif
+
+#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
+
+
 #include <time.h>
 #include <unistd.h>
 #include <sys/types.h>
diff --git a/tools/perf/tests/rdpmc.c b/tools/perf/tests/rdpmc.c
index ff94886..46649c2 100644
--- a/tools/perf/tests/rdpmc.c
+++ b/tools/perf/tests/rdpmc.c
@@ -9,8 +9,6 @@
 
 #if defined(__x86_64__) || defined(__i386__)
 
-#define barrier() asm volatile("" ::: "memory")
-
 static u64 rdpmc(unsigned int counter)
 {
 	unsigned int low, high;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index e99eaed..ecaa582 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -177,7 +177,7 @@ int perf_evlist__strerror_open(struct perf_evlist *evlist, int err, char *buf, s
 static inline unsigned int perf_mmap__read_head(struct perf_mmap *mm)
 {
 	struct perf_event_mmap_page *pc = mm->base;
-	int head = pc->data_head;
+	int head = ACCESS_ONCE(pc->data_head);
 	rmb();
 	return head;
 }
@@ -190,7 +190,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md,
 	/*
 	 * ensure all reads are done before we write the tail out.
 	 */
-	/* mb(); */
+	mb();
 	pc->data_tail = tail;
 }
 

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 13:19                         ` [tip:perf/core] tools/perf: Add required memory barriers tip-bot for Peter Zijlstra
@ 2013-11-06 13:50                           ` Vince Weaver
  2013-11-06 14:00                             ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Vince Weaver @ 2013-11-06 13:50 UTC (permalink / raw)
  To: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, peterz,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey
  Cc: linux-tip-commits

On Wed, 6 Nov 2013, tip-bot for Peter Zijlstra wrote:

> Commit-ID:  a94d342b9cb09edfe888ea972af0883b6a8d992b
> Gitweb:     http://git.kernel.org/tip/a94d342b9cb09edfe888ea972af0883b6a8d992b
> Author:     Peter Zijlstra <peterz@infradead.org>
> AuthorDate: Wed, 30 Oct 2013 11:42:46 +0100
> Committer:  Ingo Molnar <mingo@kernel.org>
> CommitDate: Wed, 6 Nov 2013 12:34:26 +0100
> 
> tools/perf: Add required memory barriers
> 
> To match patch bf378d341e48 ("perf: Fix perf ring buffer memory
> ordering") change userspace to also adhere to the ordering outlined.

...

> +++ b/tools/perf/util/evlist.h
> @@ -177,7 +177,7 @@ int perf_evlist__strerror_open(struct perf_evlist *evlist, int err, char *buf, s
>  static inline unsigned int perf_mmap__read_head(struct perf_mmap *mm)
>  {
>  	struct perf_event_mmap_page *pc = mm->base;
> -	int head = pc->data_head;
> +	int head = ACCESS_ONCE(pc->data_head);
>  	rmb();
>  	return head;

so is this ACCESS_ONCE required now for proper access to the mmap buffer?

remember that there are users trying to use this outside of the kernel 
where we don't necessarily have access to internal kernl macros.  Some of 
these users aren't necessarily GPLv2 compatible either (PAPI for example 
is more or less BSD licensed) so just cutting and pasting chunks of 
internal kernel macros isn't always the best route either.

Vince


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-06 12:51                                                   ` Geert Uytterhoeven
@ 2013-11-06 13:57                                                     ` Peter Zijlstra
  2013-11-06 18:48                                                       ` Paul E. McKenney
  2013-11-07 11:17                                                       ` Will Deacon
  0 siblings, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 13:57 UTC (permalink / raw)
  To: Geert Uytterhoeven
  Cc: Paul E. McKenney, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Russell King,
	Martin Schwidefsky, Heiko Carstens, Tony Luck

On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
> This is screaming for a default implementation in asm-generic.

Right you are... how about a little something like this?

There's a few archs I didn't fully merge with the generic one because of
weird nop implementations.

asm volatile ("nop" :: ) vs asm volatile ("nop" ::: "memory") and the
like. They probably can (and should) use the regular asm volatile
("nop") but I misplaced the toolchains for many of the weird archs so I
didn't attempt.

Also fixed a silly mistake in the return type definition for most
smp_load_acquire() implementions: typeof(p) vs typeof(*p).

---
Subject: arch: Introduce smp_load_acquire(), smp_store_release()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon, 4 Nov 2013 20:18:11 +0100

A number of situations currently require the heavyweight smp_mb(),
even though there is no need to order prior stores against later
loads.  Many architectures have much cheaper ways to handle these
situations, but the Linux kernel currently has no portable way
to make use of them.

This commit therefore supplies smp_load_acquire() and
smp_store_release() to remedy this situation.  The new
smp_load_acquire() primitive orders the specified load against
any subsequent reads or writes, while the new smp_store_release()
primitive orders the specifed store against any prior reads or
writes.  These primitives allow array-based circular FIFOs to be
implemented without an smp_mb(), and also allow a theoretical
hole in rcu_assign_pointer() to be closed at no additional
expense on most architectures.

In addition, the RCU experience transitioning from explicit
smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
and rcu_assign_pointer(), respectively resulted in substantial
improvements in readability.  It therefore seems likely that
replacing other explicit barriers with smp_load_acquire() and
smp_store_release() will provide similar benefits.  It appears
that roughly half of the explicit barriers in core kernel code
might be so replaced.


Cc: Michael Ellerman <michael@ellerman.id.au>
Cc: Michael Neuling <mikey@neuling.org>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Victor Kaplansky <VICTORK@il.ibm.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
---
 Documentation/memory-barriers.txt     |  157 +++++++++++++++++-----------------
 arch/alpha/include/asm/barrier.h      |   25 +----
 arch/arc/include/asm/Kbuild           |    1 
 arch/arc/include/asm/atomic.h         |    5 +
 arch/arc/include/asm/barrier.h        |   42 ---------
 arch/arm/include/asm/barrier.h        |   15 +++
 arch/arm64/include/asm/barrier.h      |   50 ++++++++++
 arch/avr32/include/asm/barrier.h      |   17 +--
 arch/blackfin/include/asm/barrier.h   |   18 ---
 arch/cris/include/asm/Kbuild          |    1 
 arch/cris/include/asm/barrier.h       |   25 -----
 arch/frv/include/asm/barrier.h        |    8 -
 arch/h8300/include/asm/barrier.h      |   21 ----
 arch/hexagon/include/asm/Kbuild       |    1 
 arch/hexagon/include/asm/barrier.h    |   41 --------
 arch/ia64/include/asm/barrier.h       |   49 ++++++++++
 arch/m32r/include/asm/barrier.h       |   80 -----------------
 arch/m68k/include/asm/barrier.h       |   14 ---
 arch/metag/include/asm/barrier.h      |   15 +++
 arch/microblaze/include/asm/Kbuild    |    1 
 arch/microblaze/include/asm/barrier.h |   27 -----
 arch/mips/include/asm/barrier.h       |   15 +++
 arch/mn10300/include/asm/Kbuild       |    1 
 arch/mn10300/include/asm/barrier.h    |   37 --------
 arch/parisc/include/asm/Kbuild        |    1 
 arch/parisc/include/asm/barrier.h     |   35 -------
 arch/powerpc/include/asm/barrier.h    |   21 ++++
 arch/s390/include/asm/barrier.h       |   15 +++
 arch/score/include/asm/Kbuild         |    1 
 arch/score/include/asm/barrier.h      |   16 ---
 arch/sh/include/asm/barrier.h         |   21 ----
 arch/sparc/include/asm/barrier_32.h   |   11 --
 arch/sparc/include/asm/barrier_64.h   |   15 +++
 arch/tile/include/asm/barrier.h       |   68 --------------
 arch/unicore32/include/asm/barrier.h  |   11 --
 arch/x86/include/asm/barrier.h        |   15 +++
 arch/xtensa/include/asm/barrier.h     |    9 -
 include/asm-generic/barrier.h         |   55 +++++++++--
 include/linux/compiler.h              |    9 +
 39 files changed, 375 insertions(+), 594 deletions(-)

--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
 
 And a couple of implicit varieties:
 
- (5) LOCK operations.
+ (5) ACQUIRE operations.
 
      This acts as a one-way permeable barrier.  It guarantees that all memory
-     operations after the LOCK operation will appear to happen after the LOCK
-     operation with respect to the other components of the system.
+     operations after the ACQUIRE operation will appear to happen after the
+     ACQUIRE operation with respect to the other components of the system.
 
-     Memory operations that occur before a LOCK operation may appear to happen
-     after it completes.
+     Memory operations that occur before a ACQUIRE operation may appear to
+     happen after it completes.
 
-     A LOCK operation should almost always be paired with an UNLOCK operation.
+     A ACQUIRE operation should almost always be paired with an RELEASE
+     operation.
 
 
- (6) UNLOCK operations.
+ (6) RELEASE operations.
 
      This also acts as a one-way permeable barrier.  It guarantees that all
-     memory operations before the UNLOCK operation will appear to happen before
-     the UNLOCK operation with respect to the other components of the system.
+     memory operations before the RELEASE operation will appear to happen
+     before the RELEASE operation with respect to the other components of the
+     system.
 
-     Memory operations that occur after an UNLOCK operation may appear to
+     Memory operations that occur after an RELEASE operation may appear to
      happen before it completes.
 
-     LOCK and UNLOCK operations are guaranteed to appear with respect to each
-     other strictly in the order specified.
+     ACQUIRE and RELEASE operations are guaranteed to appear with respect to
+     each other strictly in the order specified.
 
-     The use of LOCK and UNLOCK operations generally precludes the need for
-     other sorts of memory barrier (but note the exceptions mentioned in the
-     subsection "MMIO write barrier").
+     The use of ACQUIRE and RELEASE operations generally precludes the need
+     for other sorts of memory barrier (but note the exceptions mentioned in
+     the subsection "MMIO write barrier").
 
 
 Memory barriers are only required where there's a possibility of interaction
@@ -1135,7 +1137,7 @@ CPU from reordering them.
 	clear_bit( ... );
 
      This prevents memory operations before the clear leaking to after it.  See
-     the subsection on "Locking Functions" with reference to UNLOCK operation
+     the subsection on "Locking Functions" with reference to RELEASE operation
      implications.
 
      See Documentation/atomic_ops.txt for more information.  See the "Atomic
@@ -1181,65 +1183,66 @@ LOCKING FUNCTIONS
  (*) R/W semaphores
  (*) RCU
 
-In all cases there are variants on "LOCK" operations and "UNLOCK" operations
+In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
 for each construct.  These operations all imply certain barriers:
 
- (1) LOCK operation implication:
+ (1) ACQUIRE operation implication:
 
-     Memory operations issued after the LOCK will be completed after the LOCK
-     operation has completed.
+     Memory operations issued after the ACQUIRE will be completed after the
+     ACQUIRE operation has completed.
 
-     Memory operations issued before the LOCK may be completed after the LOCK
-     operation has completed.
+     Memory operations issued before the ACQUIRE may be completed after the
+     ACQUIRE operation has completed.
 
- (2) UNLOCK operation implication:
+ (2) RELEASE operation implication:
 
-     Memory operations issued before the UNLOCK will be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued before the RELEASE will be completed before the
+     RELEASE operation has completed.
 
-     Memory operations issued after the UNLOCK may be completed before the
-     UNLOCK operation has completed.
+     Memory operations issued after the RELEASE may be completed before the
+     RELEASE operation has completed.
 
- (3) LOCK vs LOCK implication:
+ (3) ACQUIRE vs ACQUIRE implication:
 
-     All LOCK operations issued before another LOCK operation will be completed
-     before that LOCK operation.
+     All ACQUIRE operations issued before another ACQUIRE operation will be
+     completed before that ACQUIRE operation.
 
- (4) LOCK vs UNLOCK implication:
+ (4) ACQUIRE vs RELEASE implication:
 
-     All LOCK operations issued before an UNLOCK operation will be completed
-     before the UNLOCK operation.
+     All ACQUIRE operations issued before an RELEASE operation will be
+     completed before the RELEASE operation.
 
-     All UNLOCK operations issued before a LOCK operation will be completed
-     before the LOCK operation.
+     All RELEASE operations issued before a ACQUIRE operation will be
+     completed before the ACQUIRE operation.
 
- (5) Failed conditional LOCK implication:
+ (5) Failed conditional ACQUIRE implication:
 
-     Certain variants of the LOCK operation may fail, either due to being
+     Certain variants of the ACQUIRE operation may fail, either due to being
      unable to get the lock immediately, or due to receiving an unblocked
      signal whilst asleep waiting for the lock to become available.  Failed
      locks do not imply any sort of barrier.
 
-Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
-equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
+Therefore, from (1), (2) and (4) an RELEASE followed by an unconditional
+ACQUIRE is equivalent to a full barrier, but a ACQUIRE followed by an RELEASE
+is not.
 
 [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
     barriers is that the effects of instructions outside of a critical section
     may seep into the inside of the critical section.
 
-A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
-because it is possible for an access preceding the LOCK to happen after the
-LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
-two accesses can themselves then cross:
+A ACQUIRE followed by an RELEASE may not be assumed to be full memory barrier
+because it is possible for an access preceding the ACQUIRE to happen after the
+ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
+the two accesses can themselves then cross:
 
 	*A = a;
-	LOCK
-	UNLOCK
+	ACQUIRE
+	RELEASE
 	*B = b;
 
 may occur as:
 
-	LOCK, STORE *B, STORE *A, UNLOCK
+	ACQUIRE, STORE *B, STORE *A, RELEASE
 
 Locks and semaphores may not provide any guarantee of ordering on UP compiled
 systems, and so cannot be counted on in such a situation to actually achieve
@@ -1253,33 +1256,33 @@ See also the section on "Inter-CPU locki
 
 	*A = a;
 	*B = b;
-	LOCK
+	ACQUIRE
 	*C = c;
 	*D = d;
-	UNLOCK
+	RELEASE
 	*E = e;
 	*F = f;
 
 The following sequence of events is acceptable:
 
-	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
+	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
 
 	[+] Note that {*F,*A} indicates a combined access.
 
 But none of the following are:
 
-	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
-	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
-	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
-	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
+	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
+	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
+	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
+	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
 
 
 
 INTERRUPT DISABLING FUNCTIONS
 -----------------------------
 
-Functions that disable interrupts (LOCK equivalent) and enable interrupts
-(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
+Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
+(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
 barriers are required in such a situation, they must be provided from some
 other means.
 
@@ -1436,24 +1439,24 @@ Consider the following: the system has a
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;				*E = e;
-	LOCK M				LOCK Q
+	ACQUIRE M			ACQUIRE Q
 	*B = b;				*F = f;
 	*C = c;				*G = g;
-	UNLOCK M			UNLOCK Q
+	RELEASE M			RELEASE Q
 	*D = d;				*H = h;
 
 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
 through *H occur in, other than the constraints imposed by the separate locks
 on the separate CPUs. It might, for example, see:
 
-	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
+	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
 
 But it won't see any of:
 
-	*B, *C or *D preceding LOCK M
-	*A, *B or *C following UNLOCK M
-	*F, *G or *H preceding LOCK Q
-	*E, *F or *G following UNLOCK Q
+	*B, *C or *D preceding ACQUIRE M
+	*A, *B or *C following RELEASE M
+	*F, *G or *H preceding ACQUIRE Q
+	*E, *F or *G following RELEASE Q
 
 
 However, if the following occurs:
@@ -1461,28 +1464,28 @@ through *H occur in, other than the cons
 	CPU 1				CPU 2
 	===============================	===============================
 	*A = a;
-	LOCK M		[1]
+	ACQUIRE M	[1]
 	*B = b;
 	*C = c;
-	UNLOCK M	[1]
+	RELEASE M	[1]
 	*D = d;				*E = e;
-					LOCK M		[2]
+					ACQUIRE M	[2]
 					*F = f;
 					*G = g;
-					UNLOCK M	[2]
+					RELEASE M	[2]
 					*H = h;
 
 CPU 3 might see:
 
-	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
-		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
+	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
+	    ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
 
 But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
 
-	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
-	*A, *B or *C following UNLOCK M [1]
-	*F, *G or *H preceding LOCK M [2]
-	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
+	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
+	*A, *B or *C following RELEASE M [1]
+	*F, *G or *H preceding ACQUIRE M [2]
+	*A, *B, *C, *E, *F or *G following RELEASE M [2]
 
 
 LOCKS VS I/O ACCESSES
@@ -1702,13 +1705,13 @@ about the state (old or new) implies an
 	test_and_clear_bit();
 	test_and_change_bit();
 
-These are used for such things as implementing LOCK-class and UNLOCK-class
+These are used for such things as implementing ACQUIRE-class and RELEASE-class
 operations and adjusting reference counters towards object destruction, and as
 such the implicit memory barrier effects are necessary.
 
 
 The following operations are potential problems as they do _not_ imply memory
-barriers, but might be used for implementing such things as UNLOCK-class
+barriers, but might be used for implementing such things as RELEASE-class
 operations:
 
 	atomic_set();
@@ -1750,9 +1753,9 @@ barriers are needed or not.
 	clear_bit_unlock();
 	__clear_bit_unlock();
 
-These implement LOCK-class and UNLOCK-class operations. These should be used in
-preference to other operations when implementing locking primitives, because
-their implementations can be optimised on many architectures.
+These implement ACQUIRE-class and RELEASE-class operations. These should be
+used in preference to other operations when implementing locking primitives,
+because their implementations can be optimised on many architectures.
 
 [!] Note that special memory barrier primitives are available for these
 situations because on some CPUs the atomic instructions used imply full memory
--- a/arch/alpha/include/asm/barrier.h
+++ b/arch/alpha/include/asm/barrier.h
@@ -3,33 +3,18 @@
 
 #include <asm/compiler.h>
 
-#define mb() \
-__asm__ __volatile__("mb": : :"memory")
+#define mb()	__asm__ __volatile__("mb": : :"memory")
+#define rmb()	__asm__ __volatile__("mb": : :"memory")
+#define wmb()	__asm__ __volatile__("wmb": : :"memory")
 
-#define rmb() \
-__asm__ __volatile__("mb": : :"memory")
-
-#define wmb() \
-__asm__ __volatile__("wmb": : :"memory")
-
-#define read_barrier_depends() \
-__asm__ __volatile__("mb": : :"memory")
+#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
 
 #ifdef CONFIG_SMP
 #define __ASM_SMP_MB	"\tmb\n"
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define __ASM_SMP_MB
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
 #endif
 
-#define set_mb(var, value) \
-do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif		/* __BARRIER_H */
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -47,3 +47,4 @@ generic-y += user.h
 generic-y += vga.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/arc/include/asm/atomic.h
+++ b/arch/arc/include/asm/atomic.h
@@ -190,6 +190,11 @@ static inline void atomic_clear_mask(uns
 
 #endif /* !CONFIG_ARC_HAS_LLSC */
 
+#define smp_mb__before_atomic_dec()	barrier()
+#define smp_mb__after_atomic_dec()	barrier()
+#define smp_mb__before_atomic_inc()	barrier()
+#define smp_mb__after_atomic_inc()	barrier()
+
 /**
  * __atomic_add_unless - add unless the number is a given value
  * @v: pointer of type atomic_t
--- a/arch/arc/include/asm/barrier.h
+++ /dev/null
@@ -1,42 +0,0 @@
-/*
- * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- */
-
-#ifndef __ASM_BARRIER_H
-#define __ASM_BARRIER_H
-
-#ifndef __ASSEMBLY__
-
-/* TODO-vineetg: Need to see what this does, don't we need sync anywhere */
-#define mb() __asm__ __volatile__ ("" : : : "memory")
-#define rmb() mb()
-#define wmb() mb()
-#define set_mb(var, value)  do { var = value; mb(); } while (0)
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-#define read_barrier_depends()  mb()
-
-/* TODO-vineetg verify the correctness of macros here */
-#ifdef CONFIG_SMP
-#define smp_mb()        mb()
-#define smp_rmb()       rmb()
-#define smp_wmb()       wmb()
-#else
-#define smp_mb()        barrier()
-#define smp_rmb()       barrier()
-#define smp_wmb()       barrier()
-#endif
-
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
-#define smp_read_barrier_depends()      do { } while (0)
-
-#endif
-
-#endif
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -59,6 +59,21 @@
 #define smp_wmb()	dmb(ishst)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -35,11 +35,59 @@
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #else
+
 #define smp_mb()	asm volatile("dmb ish" : : : "memory")
 #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
 #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
-#endif
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("stlr %w1, [%0]"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("stlr %1, [%0]"				\
+				: "=Q" (*p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ldar %w0, [%1]"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	case 8:								\
+		asm volatile ("ldar %0, [%1]"				\
+			: "=r" (___p1) : "Q" (*p) : "memory");		\
+		break;							\
+	}								\
+	___p1;								\
+})
 
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
--- a/arch/avr32/include/asm/barrier.h
+++ b/arch/avr32/include/asm/barrier.h
@@ -8,22 +8,15 @@
 #ifndef __ASM_AVR32_BARRIER_H
 #define __ASM_AVR32_BARRIER_H
 
-#define nop()			asm volatile("nop")
-
-#define mb()			asm volatile("" : : : "memory")
-#define rmb()			mb()
-#define wmb()			asm volatile("sync 0" : : : "memory")
-#define read_barrier_depends()  do { } while(0)
-#define set_mb(var, value)      do { var = value; mb(); } while(0)
+/*
+ * Weirdest thing ever.. no full barrier, but it has a write barrier!
+ */
+#define wmb()	asm volatile("sync 0" : : : "memory")
 
 #ifdef CONFIG_SMP
 # error "The AVR32 port does not support SMP"
-#else
-# define smp_mb()		barrier()
-# define smp_rmb()		barrier()
-# define smp_wmb()		barrier()
-# define smp_read_barrier_depends() do { } while(0)
 #endif
 
+#include <asm-generic/barrier.h>
 
 #endif /* __ASM_AVR32_BARRIER_H */
--- a/arch/blackfin/include/asm/barrier.h
+++ b/arch/blackfin/include/asm/barrier.h
@@ -23,26 +23,10 @@
 # define rmb()	do { barrier(); smp_check_barrier(); } while (0)
 # define wmb()	do { barrier(); smp_mark_barrier(); } while (0)
 # define read_barrier_depends()	do { barrier(); smp_check_barrier(); } while (0)
-#else
-# define mb()	barrier()
-# define rmb()	barrier()
-# define wmb()	barrier()
-# define read_barrier_depends()	do { } while (0)
 #endif
 
-#else /* !CONFIG_SMP */
-
-#define mb()	barrier()
-#define rmb()	barrier()
-#define wmb()	barrier()
-#define read_barrier_depends()	do { } while (0)
-
 #endif /* !CONFIG_SMP */
 
-#define smp_mb()  mb()
-#define smp_rmb() rmb()
-#define smp_wmb() wmb()
-#define set_mb(var, value) do { var = value; mb(); } while (0)
-#define smp_read_barrier_depends()	read_barrier_depends()
+#include <asm-generic/barrier.h>
 
 #endif /* _BLACKFIN_BARRIER_H */
--- a/arch/cris/include/asm/Kbuild
+++ b/arch/cris/include/asm/Kbuild
@@ -12,3 +12,4 @@ generic-y += trace_clock.h
 generic-y += vga.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/cris/include/asm/barrier.h
+++ /dev/null
@@ -1,25 +0,0 @@
-#ifndef __ASM_CRIS_BARRIER_H
-#define __ASM_CRIS_BARRIER_H
-
-#define nop() __asm__ __volatile__ ("nop");
-
-#define barrier() __asm__ __volatile__("": : :"memory")
-#define mb() barrier()
-#define rmb() mb()
-#define wmb() mb()
-#define read_barrier_depends() do { } while(0)
-#define set_mb(var, value)  do { var = value; mb(); } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()        mb()
-#define smp_rmb()       rmb()
-#define smp_wmb()       wmb()
-#define smp_read_barrier_depends()     read_barrier_depends()
-#else
-#define smp_mb()        barrier()
-#define smp_rmb()       barrier()
-#define smp_wmb()       barrier()
-#define smp_read_barrier_depends()     do { } while(0)
-#endif
-
-#endif /* __ASM_CRIS_BARRIER_H */
--- a/arch/frv/include/asm/barrier.h
+++ b/arch/frv/include/asm/barrier.h
@@ -17,13 +17,7 @@
 #define mb()			asm volatile ("membar" : : :"memory")
 #define rmb()			asm volatile ("membar" : : :"memory")
 #define wmb()			asm volatile ("membar" : : :"memory")
-#define read_barrier_depends()	do { } while (0)
 
-#define smp_mb()			barrier()
-#define smp_rmb()			barrier()
-#define smp_wmb()			barrier()
-#define smp_read_barrier_depends()	do {} while(0)
-#define set_mb(var, value) \
-	do { var = (value); barrier(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* _ASM_BARRIER_H */
--- a/arch/h8300/include/asm/barrier.h
+++ b/arch/h8300/include/asm/barrier.h
@@ -3,27 +3,8 @@
 
 #define nop()  asm volatile ("nop"::)
 
-/*
- * Force strict CPU ordering.
- * Not really required on H8...
- */
-#define mb()   asm volatile (""   : : :"memory")
-#define rmb()  asm volatile (""   : : :"memory")
-#define wmb()  asm volatile (""   : : :"memory")
 #define set_mb(var, value) do { xchg(&var, value); } while (0)
 
-#define read_barrier_depends()	do { } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while(0)
-#endif
+#include <asm-generic/barrier.h>
 
 #endif /* _H8300_BARRIER_H */
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -54,3 +54,4 @@ generic-y += ucontext.h
 generic-y += unaligned.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/hexagon/include/asm/barrier.h
+++ /dev/null
@@ -1,41 +0,0 @@
-/*
- * Memory barrier definitions for the Hexagon architecture
- *
- * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 and
- * only version 2 as published by the Free Software Foundation.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU General Public License for more details.
- *
- * You should have received a copy of the GNU General Public License
- * along with this program; if not, write to the Free Software
- * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
- * 02110-1301, USA.
- */
-
-#ifndef _ASM_BARRIER_H
-#define _ASM_BARRIER_H
-
-#define rmb()				barrier()
-#define read_barrier_depends()		barrier()
-#define wmb()				barrier()
-#define mb()				barrier()
-#define smp_rmb()			barrier()
-#define smp_read_barrier_depends()	barrier()
-#define smp_wmb()			barrier()
-#define smp_mb()			barrier()
-#define smp_mb__before_atomic_dec()	barrier()
-#define smp_mb__after_atomic_dec()	barrier()
-#define smp_mb__before_atomic_inc()	barrier()
-#define smp_mb__after_atomic_inc()	barrier()
-
-/*  Set a value and use a memory barrier.  Used by the scheduler somewhere.  */
-#define set_mb(var, value) \
-	do { var = value; mb(); } while (0)
-
-#endif /* _ASM_BARRIER_H */
--- a/arch/ia64/include/asm/barrier.h
+++ b/arch/ia64/include/asm/barrier.h
@@ -45,11 +45,60 @@
 # define smp_rmb()	rmb()
 # define smp_wmb()	wmb()
 # define smp_read_barrier_depends()	read_barrier_depends()
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("st4.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("st8.rel [%0]=%1"				\
+				: "=r" (p) : "r" (v) : "memory");	\
+		break;							\
+	}								\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1;						\
+	compiletime_assert_atomic_type(*p);				\
+	switch (sizeof(*p)) {						\
+	case 4:								\
+		asm volatile ("ld4.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	case 8:								\
+		asm volatile ("ld8.acq %0=[%1]"				\
+				: "=r" (___p1) : "r" (p) : "memory");	\
+		break;							\
+	}								\
+	___p1;								\
+})
+
 #else
+
 # define smp_mb()	barrier()
 # define smp_rmb()	barrier()
 # define smp_wmb()	barrier()
 # define smp_read_barrier_depends()	do { } while(0)
+
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
 #endif
 
 /*
--- a/arch/m32r/include/asm/barrier.h
+++ b/arch/m32r/include/asm/barrier.h
@@ -11,84 +11,6 @@
 
 #define nop()  __asm__ __volatile__ ("nop" : : )
 
-/*
- * Memory barrier.
- *
- * mb() prevents loads and stores being reordered across this point.
- * rmb() prevents loads being reordered across this point.
- * wmb() prevents stores being reordered across this point.
- */
-#define mb()   barrier()
-#define rmb()  mb()
-#define wmb()  mb()
-
-/**
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier.  All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads.  This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies.  See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- *      CPU 0                           CPU 1
- *
- *      b = 2;
- *      memory_barrier();
- *      p = &b;                         q = p;
- *                                      read_barrier_depends();
- *                                      d = *q;
- * </programlisting>
- *
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends().  However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- *      CPU 0                           CPU 1
- *
- *      a = 2;
- *      memory_barrier();
- *      b = 3;                          y = b;
- *                                      read_barrier_depends();
- *                                      x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b".  Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
- * in cases like this where there are no data dependencies.
- **/
-
-#define read_barrier_depends()	do { } while (0)
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
-#define set_mb(var, value) do { var = value; barrier(); } while (0)
-#endif
+#include <asm-generic/barrier.h>
 
 #endif /* _ASM_M32R_BARRIER_H */
--- a/arch/m68k/include/asm/barrier.h
+++ b/arch/m68k/include/asm/barrier.h
@@ -1,20 +1,8 @@
 #ifndef _M68K_BARRIER_H
 #define _M68K_BARRIER_H
 
-/*
- * Force strict CPU ordering.
- * Not really required on m68k...
- */
 #define nop()		do { asm volatile ("nop"); barrier(); } while (0)
-#define mb()		barrier()
-#define rmb()		barrier()
-#define wmb()		barrier()
-#define read_barrier_depends()	((void)0)
-#define set_mb(var, value)	({ (var) = (value); wmb(); })
 
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	((void)0)
+#include <asm-generic/barrier.h>
 
 #endif /* _M68K_BARRIER_H */
--- a/arch/metag/include/asm/barrier.h
+++ b/arch/metag/include/asm/barrier.h
@@ -82,4 +82,19 @@ static inline void fence(void)
 #define smp_read_barrier_depends()     do { } while (0)
 #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* _ASM_METAG_BARRIER_H */
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -4,3 +4,4 @@ generic-y += exec.h
 generic-y += trace_clock.h
 generic-y += syscalls.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/microblaze/include/asm/barrier.h
+++ /dev/null
@@ -1,27 +0,0 @@
-/*
- * Copyright (C) 2006 Atmark Techno, Inc.
- *
- * This file is subject to the terms and conditions of the GNU General Public
- * License. See the file "COPYING" in the main directory of this archive
- * for more details.
- */
-
-#ifndef _ASM_MICROBLAZE_BARRIER_H
-#define _ASM_MICROBLAZE_BARRIER_H
-
-#define nop()                  asm volatile ("nop")
-
-#define smp_read_barrier_depends()	do {} while (0)
-#define read_barrier_depends()		do {} while (0)
-
-#define mb()			barrier()
-#define rmb()			mb()
-#define wmb()			mb()
-#define set_mb(var, value)	do { var = value; mb(); } while (0)
-#define set_wmb(var, value)	do { var = value; wmb(); } while (0)
-
-#define smp_mb()		mb()
-#define smp_rmb()		rmb()
-#define smp_wmb()		wmb()
-
-#endif /* _ASM_MICROBLAZE_BARRIER_H */
--- a/arch/mips/include/asm/barrier.h
+++ b/arch/mips/include/asm/barrier.h
@@ -180,4 +180,19 @@
 #define nudge_writes() mb()
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
--- a/arch/mn10300/include/asm/Kbuild
+++ b/arch/mn10300/include/asm/Kbuild
@@ -3,3 +3,4 @@ generic-y += clkdev.h
 generic-y += exec.h
 generic-y += trace_clock.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/mn10300/include/asm/barrier.h
+++ /dev/null
@@ -1,37 +0,0 @@
-/* MN10300 memory barrier definitions
- *
- * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
- * Written by David Howells (dhowells@redhat.com)
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public Licence
- * as published by the Free Software Foundation; either version
- * 2 of the Licence, or (at your option) any later version.
- */
-#ifndef _ASM_BARRIER_H
-#define _ASM_BARRIER_H
-
-#define nop()	asm volatile ("nop")
-
-#define mb()	asm volatile ("": : :"memory")
-#define rmb()	mb()
-#define wmb()	asm volatile ("": : :"memory")
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define set_mb(var, value)  do { xchg(&var, value); } while (0)
-#else  /* CONFIG_SMP */
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define set_mb(var, value)  do { var = value;  mb(); } while (0)
-#endif /* CONFIG_SMP */
-
-#define set_wmb(var, value) do { var = value; wmb(); } while (0)
-
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
-
-#endif /* _ASM_BARRIER_H */
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += word-at-a-time.h auxvec.h u
 	  poll.h xor.h clkdev.h exec.h
 generic-y += trace_clock.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/parisc/include/asm/barrier.h
+++ /dev/null
@@ -1,35 +0,0 @@
-#ifndef __PARISC_BARRIER_H
-#define __PARISC_BARRIER_H
-
-/*
-** This is simply the barrier() macro from linux/kernel.h but when serial.c
-** uses tqueue.h uses smp_mb() defined using barrier(), linux/kernel.h
-** hasn't yet been included yet so it fails, thus repeating the macro here.
-**
-** PA-RISC architecture allows for weakly ordered memory accesses although
-** none of the processors use it. There is a strong ordered bit that is
-** set in the O-bit of the page directory entry. Operating systems that
-** can not tolerate out of order accesses should set this bit when mapping
-** pages. The O-bit of the PSW should also be set to 1 (I don't believe any
-** of the processor implemented the PSW O-bit). The PCX-W ERS states that
-** the TLB O-bit is not implemented so the page directory does not need to
-** have the O-bit set when mapping pages (section 3.1). This section also
-** states that the PSW Y, Z, G, and O bits are not implemented.
-** So it looks like nothing needs to be done for parisc-linux (yet).
-** (thanks to chada for the above comment -ggg)
-**
-** The __asm__ op below simple prevents gcc/ld from reordering
-** instructions across the mb() "call".
-*/
-#define mb()		__asm__ __volatile__("":::"memory")	/* barrier() */
-#define rmb()		mb()
-#define wmb()		mb()
-#define smp_mb()	mb()
-#define smp_rmb()	mb()
-#define smp_wmb()	mb()
-#define smp_read_barrier_depends()	do { } while(0)
-#define read_barrier_depends()		do { } while(0)
-
-#define set_mb(var, value)		do { var = value; mb(); } while (0)
-
-#endif /* __PARISC_BARRIER_H */
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -45,11 +45,15 @@
 #    define SMPWMB      eieio
 #endif
 
+#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+
 #define smp_mb()	mb()
-#define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
+#define smp_rmb()	__lwsync()
 #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
 #define smp_read_barrier_depends()	read_barrier_depends()
 #else
+#define __lwsync()	barrier()
+
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
@@ -65,4 +69,19 @@
 #define data_barrier(x)	\
 	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	__lwsync();							\
+	___p1;								\
+})
+
 #endif /* _ASM_POWERPC_BARRIER_H */
--- a/arch/s390/include/asm/barrier.h
+++ b/arch/s390/include/asm/barrier.h
@@ -32,4 +32,19 @@
 
 #define set_mb(var, value)		do { var = value; mb(); } while (0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* __ASM_BARRIER_H */
--- a/arch/score/include/asm/Kbuild
+++ b/arch/score/include/asm/Kbuild
@@ -5,3 +5,4 @@ generic-y += clkdev.h
 generic-y += trace_clock.h
 generic-y += xor.h
 generic-y += preempt.h
+generic-y += barrier.h
--- a/arch/score/include/asm/barrier.h
+++ /dev/null
@@ -1,16 +0,0 @@
-#ifndef _ASM_SCORE_BARRIER_H
-#define _ASM_SCORE_BARRIER_H
-
-#define mb()		barrier()
-#define rmb()		barrier()
-#define wmb()		barrier()
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
-
-#define set_mb(var, value) 		do {var = value; wmb(); } while (0)
-
-#endif /* _ASM_SCORE_BARRIER_H */
--- a/arch/sh/include/asm/barrier.h
+++ b/arch/sh/include/asm/barrier.h
@@ -26,29 +26,14 @@
 #if defined(CONFIG_CPU_SH4A) || defined(CONFIG_CPU_SH5)
 #define mb()		__asm__ __volatile__ ("synco": : :"memory")
 #define rmb()		mb()
-#define wmb()		__asm__ __volatile__ ("synco": : :"memory")
+#define wmb()		mb()
 #define ctrl_barrier()	__icbi(PAGE_OFFSET)
-#define read_barrier_depends()	do { } while(0)
 #else
-#define mb()		__asm__ __volatile__ ("": : :"memory")
-#define rmb()		mb()
-#define wmb()		__asm__ __volatile__ ("": : :"memory")
 #define ctrl_barrier()	__asm__ __volatile__ ("nop;nop;nop;nop;nop;nop;nop;nop")
-#define read_barrier_depends()	do { } while(0)
-#endif
-
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while(0)
 #endif
 
 #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
 
+#include <asm-generic/barrier.h>
+
 #endif /* __ASM_SH_BARRIER_H */
--- a/arch/sparc/include/asm/barrier_32.h
+++ b/arch/sparc/include/asm/barrier_32.h
@@ -1,15 +1,6 @@
 #ifndef __SPARC_BARRIER_H
 #define __SPARC_BARRIER_H
 
-/* XXX Change this if we ever use a PSO mode kernel. */
-#define mb()	__asm__ __volatile__ ("" : : : "memory")
-#define rmb()	mb()
-#define wmb()	mb()
-#define read_barrier_depends()	do { } while(0)
-#define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
-#define smp_mb()	__asm__ __volatile__("":::"memory")
-#define smp_rmb()	__asm__ __volatile__("":::"memory")
-#define smp_wmb()	__asm__ __volatile__("":::"memory")
-#define smp_read_barrier_depends()	do { } while(0)
+#include <asm-generic/barrier.h>
 
 #endif /* !(__SPARC_BARRIER_H) */
--- a/arch/sparc/include/asm/barrier_64.h
+++ b/arch/sparc/include/asm/barrier_64.h
@@ -53,4 +53,19 @@ do {	__asm__ __volatile__("ba,pt	%%xcc,
 
 #define smp_read_barrier_depends()	do { } while(0)
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 #endif /* !(__SPARC64_BARRIER_H) */
--- a/arch/tile/include/asm/barrier.h
+++ b/arch/tile/include/asm/barrier.h
@@ -22,59 +22,6 @@
 #include <arch/spr_def.h>
 #include <asm/timex.h>
 
-/*
- * read_barrier_depends - Flush all pending reads that subsequents reads
- * depend on.
- *
- * No data-dependent reads from memory-like regions are ever reordered
- * over this barrier.  All reads preceding this primitive are guaranteed
- * to access memory (but not necessarily other CPUs' caches) before any
- * reads following this primitive that depend on the data return by
- * any of the preceding reads.  This primitive is much lighter weight than
- * rmb() on most CPUs, and is never heavier weight than is
- * rmb().
- *
- * These ordering constraints are respected by both the local CPU
- * and the compiler.
- *
- * Ordering is not guaranteed by anything other than these primitives,
- * not even by data dependencies.  See the documentation for
- * memory_barrier() for examples and URLs to more information.
- *
- * For example, the following code would force ordering (the initial
- * value of "a" is zero, "b" is one, and "p" is "&a"):
- *
- * <programlisting>
- *	CPU 0				CPU 1
- *
- *	b = 2;
- *	memory_barrier();
- *	p = &b;				q = p;
- *					read_barrier_depends();
- *					d = *q;
- * </programlisting>
- *
- * because the read of "*q" depends on the read of "p" and these
- * two reads are separated by a read_barrier_depends().  However,
- * the following code, with the same initial values for "a" and "b":
- *
- * <programlisting>
- *	CPU 0				CPU 1
- *
- *	a = 2;
- *	memory_barrier();
- *	b = 3;				y = b;
- *					read_barrier_depends();
- *					x = a;
- * </programlisting>
- *
- * does not enforce ordering, since there is no data dependency between
- * the read of "a" and the read of "b".  Therefore, on some CPUs, such
- * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
- * in cases like this where there are no data dependencies.
- */
-#define read_barrier_depends()	do { } while (0)
-
 #define __sync()	__insn_mf()
 
 #include <hv/syscall_public.h>
@@ -125,20 +72,7 @@ mb_incoherent(void)
 #define mb()		fast_mb()
 #define iob()		fast_iob()
 
-#ifdef CONFIG_SMP
-#define smp_mb()	mb()
-#define smp_rmb()	rmb()
-#define smp_wmb()	wmb()
-#define smp_read_barrier_depends()	read_barrier_depends()
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
-#define smp_read_barrier_depends()	do { } while (0)
-#endif
-
-#define set_mb(var, value) \
-	do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_TILE_BARRIER_H */
--- a/arch/unicore32/include/asm/barrier.h
+++ b/arch/unicore32/include/asm/barrier.h
@@ -14,15 +14,6 @@
 #define dsb() __asm__ __volatile__ ("" : : : "memory")
 #define dmb() __asm__ __volatile__ ("" : : : "memory")
 
-#define mb()				barrier()
-#define rmb()				barrier()
-#define wmb()				barrier()
-#define smp_mb()			barrier()
-#define smp_rmb()			barrier()
-#define smp_wmb()			barrier()
-#define read_barrier_depends()		do { } while (0)
-#define smp_read_barrier_depends()	do { } while (0)
-
-#define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* __UNICORE_BARRIER_H__ */
--- a/arch/x86/include/asm/barrier.h
+++ b/arch/x86/include/asm/barrier.h
@@ -100,6 +100,21 @@
 #define set_mb(var, value) do { var = value; barrier(); } while (0)
 #endif
 
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	barrier();							\
+	___p1;								\
+})
+
 /*
  * Stop RDTSC speculation. This is needed when you need to use RDTSC
  * (or get_cycles or vread that possibly accesses the TSC) in a defined
--- a/arch/xtensa/include/asm/barrier.h
+++ b/arch/xtensa/include/asm/barrier.h
@@ -9,21 +9,14 @@
 #ifndef _XTENSA_SYSTEM_H
 #define _XTENSA_SYSTEM_H
 
-#define smp_read_barrier_depends() do { } while(0)
-#define read_barrier_depends() do { } while(0)
-
 #define mb()  ({ __asm__ __volatile__("memw" : : : "memory"); })
 #define rmb() barrier()
 #define wmb() mb()
 
 #ifdef CONFIG_SMP
 #error smp_* not defined
-#else
-#define smp_mb()	barrier()
-#define smp_rmb()	barrier()
-#define smp_wmb()	barrier()
 #endif
 
-#define set_mb(var, value)	do { var = value; mb(); } while (0)
+#include <asm-generic/barrier.h>
 
 #endif /* _XTENSA_SYSTEM_H */
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -1,4 +1,5 @@
-/* Generic barrier definitions, based on MN10300 definitions.
+/*
+ * Generic barrier definitions, based on MN10300 definitions.
  *
  * It should be possible to use these on really simple architectures,
  * but it serves more as a starting point for new ports.
@@ -16,35 +17,67 @@
 
 #ifndef __ASSEMBLY__
 
-#define nop() asm volatile ("nop")
+#include <asm/compiler.h>
+
+#ifndef nop
+#define nop()	asm volatile ("nop")
+#endif
 
 /*
- * Force strict CPU ordering.
- * And yes, this is required on UP too when we're talking
- * to devices.
+ * Force strict CPU ordering. And yes, this is required on UP too when we're
+ * talking to devices.
  *
- * This implementation only contains a compiler barrier.
+ * Fall back to compiler barriers if nothing better is provided.
  */
 
-#define mb()	asm volatile ("": : :"memory")
-#define rmb()	mb()
-#define wmb()	asm volatile ("": : :"memory")
+#ifndef mb
+#define mb()	barrier()
+#endif
+
+#ifndef rmb
+#define rmb()	barrier()
+#endif
+
+#ifndef wmb
+#define wmb()	barrier()
+#endif
+
+#ifndef read_barrier_depends
+#define read_barrier_depends()		do {} while (0)
+#endif
 
 #ifdef CONFIG_SMP
 #define smp_mb()	mb()
 #define smp_rmb()	rmb()
 #define smp_wmb()	wmb()
+#define smp_read_barrier_depends()	read_barrier_depends()
 #else
 #define smp_mb()	barrier()
 #define smp_rmb()	barrier()
 #define smp_wmb()	barrier()
+#define smp_read_barrier_depends()	do {} while (0)
 #endif
 
+#ifndef set_mb
 #define set_mb(var, value)  do { var = value;  mb(); } while (0)
+#endif
+
 #define set_wmb(var, value) do { var = value; wmb(); } while (0)
 
-#define read_barrier_depends()		do {} while (0)
-#define smp_read_barrier_depends()	do {} while (0)
+#define smp_store_release(p, v)						\
+do {									\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	ACCESS_ONCE(*p) = (v);						\
+} while (0)
+
+#define smp_load_acquire(p)						\
+({									\
+	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
+	compiletime_assert_atomic_type(*p);				\
+	smp_mb();							\
+	___p1;								\
+})
 
 #endif /* !__ASSEMBLY__ */
 #endif /* __ASM_GENERIC_BARRIER_H */
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -298,6 +298,11 @@ void ftrace_likely_update(struct ftrace_
 # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 #endif
 
+/* Is this type a native word size -- useful for atomic operations */
+#ifndef __native_word
+# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
+#endif
+
 /* Compile time object size, -1 for unknown */
 #ifndef __compiletime_object_size
 # define __compiletime_object_size(obj) -1
@@ -337,6 +342,10 @@ void ftrace_likely_update(struct ftrace_
 #define compiletime_assert(condition, msg) \
 	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
 
+#define compiletime_assert_atomic_type(t)				\
+	compiletime_assert(__native_word(t),				\
+		"Need native word sized stores/loads for atomicity.")
+
 /*
  * Prevent the compiler from merging or refetching accesses.  The compiler
  * is also forbidden from reordering successive instances of ACCESS_ONCE(),

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 13:50                           ` Vince Weaver
@ 2013-11-06 14:00                             ` Peter Zijlstra
  2013-11-06 14:28                               ` Peter Zijlstra
  2013-11-06 14:44                               ` Peter Zijlstra
  0 siblings, 2 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 14:00 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 08:50:47AM -0500, Vince Weaver wrote:
> On Wed, 6 Nov 2013, tip-bot for Peter Zijlstra wrote:
> 
> > +++ b/tools/perf/util/evlist.h
> > @@ -177,7 +177,7 @@ int perf_evlist__strerror_open(struct perf_evlist *evlist, int err, char *buf, s
> >  static inline unsigned int perf_mmap__read_head(struct perf_mmap *mm)
> >  {
> >  	struct perf_event_mmap_page *pc = mm->base;
> > -	int head = pc->data_head;
> > +	int head = ACCESS_ONCE(pc->data_head);
> >  	rmb();
> >  	return head;
> 
> so is this ACCESS_ONCE required now for proper access to the mmap buffer?

Pretty much; otherwise your C compiler is allowed to mess it up.

> remember that there are users trying to use this outside of the kernel 
> where we don't necessarily have access to internal kernl macros.  Some of 
> these users aren't necessarily GPLv2 compatible either (PAPI for example 
> is more or less BSD licensed) so just cutting and pasting chunks of 
> internal kernel macros isn't always the best route either.

Other license stuff is not my problem; that said I doubt there's much
copyright to claim on a volatile cast.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 14:00                             ` Peter Zijlstra
@ 2013-11-06 14:28                               ` Peter Zijlstra
  2013-11-06 14:55                                 ` Vince Weaver
  2013-11-06 14:44                               ` Peter Zijlstra
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 14:28 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 03:00:11PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 08:50:47AM -0500, Vince Weaver wrote:
> > On Wed, 6 Nov 2013, tip-bot for Peter Zijlstra wrote:
> > 
> > > +++ b/tools/perf/util/evlist.h
> > > @@ -177,7 +177,7 @@ int perf_evlist__strerror_open(struct perf_evlist *evlist, int err, char *buf, s
> > >  static inline unsigned int perf_mmap__read_head(struct perf_mmap *mm)
> > >  {
> > >  	struct perf_event_mmap_page *pc = mm->base;
> > > -	int head = pc->data_head;
> > > +	int head = ACCESS_ONCE(pc->data_head);
> > >  	rmb();
> > >  	return head;
> > 
> > so is this ACCESS_ONCE required now for proper access to the mmap buffer?
> 
> Pretty much; otherwise your C compiler is allowed to mess it up.
> 
> > remember that there are users trying to use this outside of the kernel 
> > where we don't necessarily have access to internal kernl macros.  Some of 
> > these users aren't necessarily GPLv2 compatible either (PAPI for example 
> > is more or less BSD licensed) so just cutting and pasting chunks of 
> > internal kernel macros isn't always the best route either.
> 
> Other license stuff is not my problem; that said I doubt there's much
> copyright to claim on a volatile cast.

Also, does PAPI actually use the buffer then? I thought that was
strictly self monitoring.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 14:00                             ` Peter Zijlstra
  2013-11-06 14:28                               ` Peter Zijlstra
@ 2013-11-06 14:44                               ` Peter Zijlstra
  2013-11-06 16:07                                 ` Peter Zijlstra
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 14:44 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 03:00:11PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 08:50:47AM -0500, Vince Weaver wrote:
> > On Wed, 6 Nov 2013, tip-bot for Peter Zijlstra wrote:
> > 
> > > +++ b/tools/perf/util/evlist.h
> > > @@ -177,7 +177,7 @@ int perf_evlist__strerror_open(struct perf_evlist *evlist, int err, char *buf, s
> > >  static inline unsigned int perf_mmap__read_head(struct perf_mmap *mm)
> > >  {
> > >  	struct perf_event_mmap_page *pc = mm->base;
> > > -	int head = pc->data_head;
> > > +	int head = ACCESS_ONCE(pc->data_head);
> > >  	rmb();
> > >  	return head;
> > 
> > so is this ACCESS_ONCE required now for proper access to the mmap buffer?
> 
> Pretty much; otherwise your C compiler is allowed to mess it up.

long head = ((__atomic long)pc->data_head).load(memory_order_acquire);

coupled with:

((__atomic long)pc->data_tail).store(tail, memory_order_release);

might be the 'right' and proper C11 incantations to avoid having to
touch kernel macros; but would obviously require a recent compiler.

Barring that, I think we're stuck with:

long head = ACCESS_ONCE(pc->data_head);
smp_rmb();

...

smp_mb();
pc->data_tail = tail;

And using the right asm goo for the barriers. That said, all these asm
barriers should include a compiler barriers (memory clobber) which
_should_ avoid the worst compiler trickery -- although I don't think it
completely obviates the need for ACCESS_ONCE() -- uncertain there.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 14:28                               ` Peter Zijlstra
@ 2013-11-06 14:55                                 ` Vince Weaver
  2013-11-06 15:10                                   ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Vince Weaver @ 2013-11-06 14:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, 6 Nov 2013, Peter Zijlstra wrote:

> On Wed, Nov 06, 2013 at 03:00:11PM +0100, Peter Zijlstra wrote:
> > On Wed, Nov 06, 2013 at 08:50:47AM -0500, Vince Weaver wrote:
> > 
> > > remember that there are users trying to use this outside of the kernel 
> > > where we don't necessarily have access to internal kernl macros.  Some of 
> > > these users aren't necessarily GPLv2 compatible either (PAPI for example 
> > > is more or less BSD licensed) so just cutting and pasting chunks of 
> > > internal kernel macros isn't always the best route either.
> > 
> > Other license stuff is not my problem; that said I doubt there's much
> > copyright to claim on a volatile cast.

perhaps, but everytime some internal kernel stuff goes into the visible 
API it makes it harder to use.  I don't think most other system calls leak 
kernel interfaces like this.

It's also an issue because people build PAPI with non-gcc compilers like 
Intel and clang so there's no guarantee kernel macro tricks are going to 
work for everyone.

Having perf in the kernel tree really makes it hard for you guys to keep a 
clean API/ABI it seems.

> Also, does PAPI actually use the buffer then? I thought that was
> strictly self monitoring.

PAPI has always supported a simplistic sampling mode, where you 
enable a one-shot overflow signal handler to be triggered after X events, 
and then read out the instruction pointer from the mmap buffer.  Then 
usually you re-enable for the next sample.  (You may dimly remember this 
because this usage style is very different than perf's so it breaks often 
w/o anyone but us noticing).

Not very effective and not taking full advantage of all perf_event has to 
support, but it's a cross-platform legacy interface supported easily by 
most implementations.

We've been planning to do a proper sampled perf_event interface but it's 
been hard trying to hammer out a clean interface that fits well with the 
existing PAPI design.

Vince

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 14:55                                 ` Vince Weaver
@ 2013-11-06 15:10                                   ` Peter Zijlstra
  2013-11-06 15:23                                     ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 15:10 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 09:55:17AM -0500, Vince Weaver wrote:
> Having perf in the kernel tree really makes it hard for you guys to keep a
> clean API/ABI it seems.

Lock free buffers are 'fun'.. The ABI can be described as:

  read pc->data_head

  // ensure no other reads get before this point and ->data_head
  // doesn't get re-read hereafter.

  read data; using pc->data_tail, until the read head value.

  // ensure all reads are completed before issuing

  write pc->data_tail

How you want to implement that on your compiler/arch combination is up
to you. Like said, C11 has the __atomic bits you can use to implement
this in proper C; barring that, you'll have to get creative and use
assembly one way or another.

On x86/sparc/s390 which have relatively strong memory models its fairly
easy, on powerpc/arm which have much weaker models its more fun.


This isn't actually something that has changed; its just that we
recently found some implementations thereof were buggy.

We provide an implementation in GNU C, if you want to use something else
you get to deal with that other compiler.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 15:10                                   ` Peter Zijlstra
@ 2013-11-06 15:23                                     ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 15:23 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 04:10:55PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 09:55:17AM -0500, Vince Weaver wrote:
> > Having perf in the kernel tree really makes it hard for you guys to keep a
> > clean API/ABI it seems.
> 
> Lock free buffers are 'fun'.. The ABI can be described as:
> 
>   read pc->data_head
> 
>   // ensure no other reads get before this point and ->data_head
>   // doesn't get re-read hereafter.

FWIW; this is where barrier() and ACCESS_ONCE() differ afaict, barrier()
only accomplishes that if the ->data_head read re-appears it must
re-issue it; whereas ACCESS_ONCE(), by marking it volatile, completely
avoids that read insertion from being possible.

Then again, the compiler should not lower the read over a compiler
barrier anyway, so on that account the insertion of the second read
would also be invalid.

So you're _probably_ good without the ACCESS_ONCE, but please ask a
compiler person, not me.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 14:44                               ` Peter Zijlstra
@ 2013-11-06 16:07                                 ` Peter Zijlstra
  2013-11-06 17:31                                   ` Vince Weaver
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 16:07 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 03:44:56PM +0100, Peter Zijlstra wrote:
> long head = ((__atomic long)pc->data_head).load(memory_order_acquire);
> 
> coupled with:
> 
> ((__atomic long)pc->data_tail).store(tail, memory_order_release);
> 
> might be the 'right' and proper C11 incantations to avoid having to
> touch kernel macros; but would obviously require a recent compiler.
> 
> Barring that, I think we're stuck with:
> 
> long head = ACCESS_ONCE(pc->data_head);
> smp_rmb();
> 
> ...
> 
> smp_mb();
> pc->data_tail = tail;
> 
> And using the right asm goo for the barriers. That said, all these asm
> barriers should include a compiler barriers (memory clobber) which
> _should_ avoid the worst compiler trickery -- although I don't think it
> completely obviates the need for ACCESS_ONCE() -- uncertain there.

http://software.intel.com/en-us/articles/single-producer-single-consumer-queue/

There's one for icc on x86.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 16:07                                 ` Peter Zijlstra
@ 2013-11-06 17:31                                   ` Vince Weaver
  2013-11-06 18:24                                     ` Peter Zijlstra
  0 siblings, 1 reply; 120+ messages in thread
From: Vince Weaver @ 2013-11-06 17:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, 6 Nov 2013, Peter Zijlstra wrote:

> On Wed, Nov 06, 2013 at 03:44:56PM +0100, Peter Zijlstra wrote:
> > long head = ((__atomic long)pc->data_head).load(memory_order_acquire);
> > 
> > coupled with:
> > 
> > ((__atomic long)pc->data_tail).store(tail, memory_order_release);
> > 
> > might be the 'right' and proper C11 incantations to avoid having to
> > touch kernel macros; but would obviously require a recent compiler.
> > 
> > Barring that, I think we're stuck with:
> > 
> > long head = ACCESS_ONCE(pc->data_head);
> > smp_rmb();
> > 
> > ...
> > 
> > smp_mb();
> > pc->data_tail = tail;
> > 
> > And using the right asm goo for the barriers. That said, all these asm
> > barriers should include a compiler barriers (memory clobber) which
> > _should_ avoid the worst compiler trickery -- although I don't think it
> > completely obviates the need for ACCESS_ONCE() -- uncertain there.
> 
> http://software.intel.com/en-us/articles/single-producer-single-consumer-queue/
> 
> There's one for icc on x86.
> 

I think the problem here is this really isn't a good interface.

Most users just want the most recent batch of samples.  Something like

    char buffer[4096];
    int count;

    do {
       count=perf_read_sample_buffer(buffer,4096);
       process_samples(buffer);
    } while(count);

where perf_read_sample_buffer() is a syscall that just copies the current 
valid samples to userspace.

Yes, this is inefficient (requires an extra copy of the values) but the 
kernel then could handle all the SMP/multithread/barrier/locking issues.

How much overhead is really introduced by making a copy?

Requiring the user of a kernel interface to have a deep knowledge of 
optimizing compilers, barriers, and CPU memory models is just asking for 
trouble.

Especially as this all needs to get documented in the manpage and I'm not 
sure that's possible in a sane fashion.

Vince



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 17:31                                   ` Vince Weaver
@ 2013-11-06 18:24                                     ` Peter Zijlstra
  2013-11-07  8:21                                       ` Ingo Molnar
  0 siblings, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 18:24 UTC (permalink / raw)
  To: Vince Weaver
  Cc: mingo, hpa, anton, mathieu.desnoyers, linux-kernel, michael,
	paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits

On Wed, Nov 06, 2013 at 12:31:53PM -0500, Vince Weaver wrote:
> On Wed, 6 Nov 2013, Peter Zijlstra wrote:
> 
> > On Wed, Nov 06, 2013 at 03:44:56PM +0100, Peter Zijlstra wrote:
> > > long head = ((__atomic long)pc->data_head).load(memory_order_acquire);
> > > 
> > > coupled with:
> > > 
> > > ((__atomic long)pc->data_tail).store(tail, memory_order_release);
> > > 
> > > might be the 'right' and proper C11 incantations to avoid having to
> > > touch kernel macros; but would obviously require a recent compiler.
> > > 
> > > Barring that, I think we're stuck with:
> > > 
> > > long head = ACCESS_ONCE(pc->data_head);
> > > smp_rmb();
> > > 
> > > ...
> > > 
> > > smp_mb();
> > > pc->data_tail = tail;
> > > 
> > > And using the right asm goo for the barriers. That said, all these asm
> > > barriers should include a compiler barriers (memory clobber) which
> > > _should_ avoid the worst compiler trickery -- although I don't think it
> > > completely obviates the need for ACCESS_ONCE() -- uncertain there.
> > 
> > http://software.intel.com/en-us/articles/single-producer-single-consumer-queue/
> > 
> > There's one for icc on x86.
> > 
> 
> I think the problem here is this really isn't a good interface.

Its _so_ common Intel put it on a website ;-) This is a fairly well
documented 'problem'.

> Most users just want the most recent batch of samples.  Something like
> 
>     char buffer[4096];
>     int count;
> 
>     do {
>        count=perf_read_sample_buffer(buffer,4096);
>        process_samples(buffer);
>     } while(count);
> 
> where perf_read_sample_buffer() is a syscall that just copies the current 
> valid samples to userspace.
> 
> Yes, this is inefficient (requires an extra copy of the values) but the 
> kernel then could handle all the SMP/multithread/barrier/locking issues.
> 
> How much overhead is really introduced by making a copy?

It would make the current perf-record like thing do 2 copies; one into
userspace, and one back into the kernel for write().

Also, we've (unfortunately) already used the read() implementation of
the perf-fd and I'm fairly sure people will not like adding a special
purpose read-like syscall just for this.

That said, I've no idea how expensive it is, not having actually done
it. I do know people were trying to get rid of the one copy we currently
already do.

> Requiring the user of a kernel interface to have a deep knowledge of 
> optimizing compilers, barriers, and CPU memory models is just asking for 
> trouble.

It shouldn't be all that hard to put this in a (lgpl) library others can
link to -- that way you can build it once (using GCC).

We'd basically need to lift the proposed smp_load_acquire() and
smp_store_release() into userspace for all relevant architectures and
then have something like:

unsigned long perf_read_sample_buffer(void *mmap, long mmap_size, void *dst, long len)
{
	struct perf_event_mmap_page *pc = mmap;
	void *data = mmap + page_size;
	unsigned long data_size = mmap_size - page_size; /* should be 2^n */
	unsigned long tail, head, size, copied = 0;

	tail = pc->data_tail;
	head = smp_load_acquire(&pc->data_head);

	size = (head - tail) & (data_size - 1);

	while (len && size) {
		unsigned long offset = tail & (data_size - 1);
		unsigned long bytes = min(len, data_size - offset);

		memcpy(data + offset, dst, bytes);

		dst += bytes;
		tail += bytes;
		copied += bytes;
		size -= bytes;
		len -= bytes;
	}

	smp_store_release(&pc->data_tail, tail);

	return copied;
}

And presto!

> Especially as this all needs to get documented in the manpage and I'm not 
> sure that's possible in a sane fashion.

Given that this is a fairly well documented problem that shouldn't be
too hard.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-06 13:57                                                     ` Peter Zijlstra
@ 2013-11-06 18:48                                                       ` Paul E. McKenney
  2013-11-06 19:42                                                         ` Peter Zijlstra
  2013-11-07 11:17                                                       ` Will Deacon
  1 sibling, 1 reply; 120+ messages in thread
From: Paul E. McKenney @ 2013-11-06 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Geert Uytterhoeven, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Russell King,
	Martin Schwidefsky, Heiko Carstens, Tony Luck

On Wed, Nov 06, 2013 at 02:57:36PM +0100, Peter Zijlstra wrote:
> On Wed, Nov 06, 2013 at 01:51:10PM +0100, Geert Uytterhoeven wrote:
> > This is screaming for a default implementation in asm-generic.
> 
> Right you are... how about a little something like this?
> 
> There's a few archs I didn't fully merge with the generic one because of
> weird nop implementations.
> 
> asm volatile ("nop" :: ) vs asm volatile ("nop" ::: "memory") and the
> like. They probably can (and should) use the regular asm volatile
> ("nop") but I misplaced the toolchains for many of the weird archs so I
> didn't attempt.
> 
> Also fixed a silly mistake in the return type definition for most
> smp_load_acquire() implementions: typeof(p) vs typeof(*p).
> 
> ---
> Subject: arch: Introduce smp_load_acquire(), smp_store_release()
> From: Peter Zijlstra <peterz@infradead.org>
> Date: Mon, 4 Nov 2013 20:18:11 +0100
> 
> A number of situations currently require the heavyweight smp_mb(),
> even though there is no need to order prior stores against later
> loads.  Many architectures have much cheaper ways to handle these
> situations, but the Linux kernel currently has no portable way
> to make use of them.
> 
> This commit therefore supplies smp_load_acquire() and
> smp_store_release() to remedy this situation.  The new
> smp_load_acquire() primitive orders the specified load against
> any subsequent reads or writes, while the new smp_store_release()
> primitive orders the specifed store against any prior reads or
> writes.  These primitives allow array-based circular FIFOs to be
> implemented without an smp_mb(), and also allow a theoretical
> hole in rcu_assign_pointer() to be closed at no additional
> expense on most architectures.
> 
> In addition, the RCU experience transitioning from explicit
> smp_read_barrier_depends() and smp_wmb() to rcu_dereference()
> and rcu_assign_pointer(), respectively resulted in substantial
> improvements in readability.  It therefore seems likely that
> replacing other explicit barriers with smp_load_acquire() and
> smp_store_release() will provide similar benefits.  It appears
> that roughly half of the explicit barriers in core kernel code
> might be so replaced.
> 
> 
> Cc: Michael Ellerman <michael@ellerman.id.au>
> Cc: Michael Neuling <mikey@neuling.org>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Victor Kaplansky <VICTORK@il.ibm.com>
> Cc: Oleg Nesterov <oleg@redhat.com>
> Cc: Anton Blanchard <anton@samba.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Frederic Weisbecker <fweisbec@gmail.com>
> Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> Signed-off-by: Peter Zijlstra <peterz@infradead.org>

A few nits on Documentation/memory-barriers.txt and some pointless
comments elsewhere.  With the suggested Documentation/memory-barriers.txt
fixes:

Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

> ---
>  Documentation/memory-barriers.txt     |  157 +++++++++++++++++-----------------
>  arch/alpha/include/asm/barrier.h      |   25 +----
>  arch/arc/include/asm/Kbuild           |    1 
>  arch/arc/include/asm/atomic.h         |    5 +
>  arch/arc/include/asm/barrier.h        |   42 ---------
>  arch/arm/include/asm/barrier.h        |   15 +++
>  arch/arm64/include/asm/barrier.h      |   50 ++++++++++
>  arch/avr32/include/asm/barrier.h      |   17 +--
>  arch/blackfin/include/asm/barrier.h   |   18 ---
>  arch/cris/include/asm/Kbuild          |    1 
>  arch/cris/include/asm/barrier.h       |   25 -----
>  arch/frv/include/asm/barrier.h        |    8 -
>  arch/h8300/include/asm/barrier.h      |   21 ----
>  arch/hexagon/include/asm/Kbuild       |    1 
>  arch/hexagon/include/asm/barrier.h    |   41 --------
>  arch/ia64/include/asm/barrier.h       |   49 ++++++++++
>  arch/m32r/include/asm/barrier.h       |   80 -----------------
>  arch/m68k/include/asm/barrier.h       |   14 ---
>  arch/metag/include/asm/barrier.h      |   15 +++
>  arch/microblaze/include/asm/Kbuild    |    1 
>  arch/microblaze/include/asm/barrier.h |   27 -----
>  arch/mips/include/asm/barrier.h       |   15 +++
>  arch/mn10300/include/asm/Kbuild       |    1 
>  arch/mn10300/include/asm/barrier.h    |   37 --------
>  arch/parisc/include/asm/Kbuild        |    1 
>  arch/parisc/include/asm/barrier.h     |   35 -------
>  arch/powerpc/include/asm/barrier.h    |   21 ++++
>  arch/s390/include/asm/barrier.h       |   15 +++
>  arch/score/include/asm/Kbuild         |    1 
>  arch/score/include/asm/barrier.h      |   16 ---
>  arch/sh/include/asm/barrier.h         |   21 ----
>  arch/sparc/include/asm/barrier_32.h   |   11 --
>  arch/sparc/include/asm/barrier_64.h   |   15 +++
>  arch/tile/include/asm/barrier.h       |   68 --------------
>  arch/unicore32/include/asm/barrier.h  |   11 --
>  arch/x86/include/asm/barrier.h        |   15 +++
>  arch/xtensa/include/asm/barrier.h     |    9 -
>  include/asm-generic/barrier.h         |   55 +++++++++--
>  include/linux/compiler.h              |    9 +
>  39 files changed, 375 insertions(+), 594 deletions(-)
> 
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -371,33 +371,35 @@ VARIETIES OF MEMORY BARRIER
> 
>  And a couple of implicit varieties:
> 
> - (5) LOCK operations.
> + (5) ACQUIRE operations.
> 
>       This acts as a one-way permeable barrier.  It guarantees that all memory
> -     operations after the LOCK operation will appear to happen after the LOCK
> -     operation with respect to the other components of the system.
> +     operations after the ACQUIRE operation will appear to happen after the
> +     ACQUIRE operation with respect to the other components of the system.
	ACQUIRE operations include LOCK operations and smp_load_acquire()
	operations.

> 
> -     Memory operations that occur before a LOCK operation may appear to happen
> -     after it completes.
> +     Memory operations that occur before a ACQUIRE operation may appear to
> +     happen after it completes.
> 
> -     A LOCK operation should almost always be paired with an UNLOCK operation.
> +     A ACQUIRE operation should almost always be paired with an RELEASE
> +     operation.
> 
> 
> - (6) UNLOCK operations.
> + (6) RELEASE operations.
> 
>       This also acts as a one-way permeable barrier.  It guarantees that all
> -     memory operations before the UNLOCK operation will appear to happen before
> -     the UNLOCK operation with respect to the other components of the system.
> +     memory operations before the RELEASE operation will appear to happen
> +     before the RELEASE operation with respect to the other components of the
> +     system.  Release operations include UNLOCK operations and
	smp_store_release() operations.

> -     Memory operations that occur after an UNLOCK operation may appear to
> +     Memory operations that occur after an RELEASE operation may appear to
>       happen before it completes.
> 
> -     LOCK and UNLOCK operations are guaranteed to appear with respect to each
> -     other strictly in the order specified.
> +     ACQUIRE and RELEASE operations are guaranteed to appear with respect to
> +     each other strictly in the order specified.
> 
> -     The use of LOCK and UNLOCK operations generally precludes the need for
> -     other sorts of memory barrier (but note the exceptions mentioned in the
> -     subsection "MMIO write barrier").
> +     The use of ACQUIRE and RELEASE operations generally precludes the need
> +     for other sorts of memory barrier (but note the exceptions mentioned in
> +     the subsection "MMIO write barrier").
> 
> 
>  Memory barriers are only required where there's a possibility of interaction
> @@ -1135,7 +1137,7 @@ CPU from reordering them.
>  	clear_bit( ... );
> 
>       This prevents memory operations before the clear leaking to after it.  See
> -     the subsection on "Locking Functions" with reference to UNLOCK operation
> +     the subsection on "Locking Functions" with reference to RELEASE operation
>       implications.
> 
>       See Documentation/atomic_ops.txt for more information.  See the "Atomic
> @@ -1181,65 +1183,66 @@ LOCKING FUNCTIONS
>   (*) R/W semaphores
>   (*) RCU
> 
> -In all cases there are variants on "LOCK" operations and "UNLOCK" operations
> +In all cases there are variants on "ACQUIRE" operations and "RELEASE" operations
>  for each construct.  These operations all imply certain barriers:
> 
> - (1) LOCK operation implication:
> + (1) ACQUIRE operation implication:
> 
> -     Memory operations issued after the LOCK will be completed after the LOCK
> -     operation has completed.
> +     Memory operations issued after the ACQUIRE will be completed after the
> +     ACQUIRE operation has completed.
> 
> -     Memory operations issued before the LOCK may be completed after the LOCK
> -     operation has completed.
> +     Memory operations issued before the ACQUIRE may be completed after the
> +     ACQUIRE operation has completed.
> 
> - (2) UNLOCK operation implication:
> + (2) RELEASE operation implication:
> 
> -     Memory operations issued before the UNLOCK will be completed before the
> -     UNLOCK operation has completed.
> +     Memory operations issued before the RELEASE will be completed before the
> +     RELEASE operation has completed.
> 
> -     Memory operations issued after the UNLOCK may be completed before the
> -     UNLOCK operation has completed.
> +     Memory operations issued after the RELEASE may be completed before the
> +     RELEASE operation has completed.
> 
> - (3) LOCK vs LOCK implication:
> + (3) ACQUIRE vs ACQUIRE implication:
> 
> -     All LOCK operations issued before another LOCK operation will be completed
> -     before that LOCK operation.
> +     All ACQUIRE operations issued before another ACQUIRE operation will be
> +     completed before that ACQUIRE operation.
> 
> - (4) LOCK vs UNLOCK implication:
> + (4) ACQUIRE vs RELEASE implication:
> 
> -     All LOCK operations issued before an UNLOCK operation will be completed
> -     before the UNLOCK operation.
> +     All ACQUIRE operations issued before an RELEASE operation will be
> +     completed before the RELEASE operation.
> 
> -     All UNLOCK operations issued before a LOCK operation will be completed
> -     before the LOCK operation.
> +     All RELEASE operations issued before a ACQUIRE operation will be
> +     completed before the ACQUIRE operation.
> 
> - (5) Failed conditional LOCK implication:
> + (5) Failed conditional ACQUIRE implication:
> 
> -     Certain variants of the LOCK operation may fail, either due to being
> +     Certain variants of the ACQUIRE operation may fail, either due to being
>       unable to get the lock immediately, or due to receiving an unblocked
>       signal whilst asleep waiting for the lock to become available.  Failed
>       locks do not imply any sort of barrier.

I suggest adding "For example" to the beginning of the last sentence:

	For example, failed lock acquisitions do not imply any sort of
	barrier.

Otherwise, the transition from ACQUIRE to lock is strange.

> -Therefore, from (1), (2) and (4) an UNLOCK followed by an unconditional LOCK is
> -equivalent to a full barrier, but a LOCK followed by an UNLOCK is not.
> +Therefore, from (1), (2) and (4) an RELEASE followed by an unconditional
> +ACQUIRE is equivalent to a full barrier, but a ACQUIRE followed by an RELEASE
> +is not.
> 
>  [!] Note: one of the consequences of LOCKs and UNLOCKs being only one-way
>      barriers is that the effects of instructions outside of a critical section
>      may seep into the inside of the critical section.
> 
> -A LOCK followed by an UNLOCK may not be assumed to be full memory barrier
> -because it is possible for an access preceding the LOCK to happen after the
> -LOCK, and an access following the UNLOCK to happen before the UNLOCK, and the
> -two accesses can themselves then cross:
> +A ACQUIRE followed by an RELEASE may not be assumed to be full memory barrier
> +because it is possible for an access preceding the ACQUIRE to happen after the
> +ACQUIRE, and an access following the RELEASE to happen before the RELEASE, and
> +the two accesses can themselves then cross:
> 
>  	*A = a;
> -	LOCK
> -	UNLOCK
> +	ACQUIRE
> +	RELEASE
>  	*B = b;
> 
>  may occur as:
> 
> -	LOCK, STORE *B, STORE *A, UNLOCK
> +	ACQUIRE, STORE *B, STORE *A, RELEASE
> 
>  Locks and semaphores may not provide any guarantee of ordering on UP compiled
>  systems, and so cannot be counted on in such a situation to actually achieve
> @@ -1253,33 +1256,33 @@ See also the section on "Inter-CPU locki
> 
>  	*A = a;
>  	*B = b;
> -	LOCK
> +	ACQUIRE
>  	*C = c;
>  	*D = d;
> -	UNLOCK
> +	RELEASE
>  	*E = e;
>  	*F = f;
> 
>  The following sequence of events is acceptable:
> 
> -	LOCK, {*F,*A}, *E, {*C,*D}, *B, UNLOCK
> +	ACQUIRE, {*F,*A}, *E, {*C,*D}, *B, RELEASE
> 
>  	[+] Note that {*F,*A} indicates a combined access.
> 
>  But none of the following are:
> 
> -	{*F,*A}, *B,	LOCK, *C, *D,	UNLOCK, *E
> -	*A, *B, *C,	LOCK, *D,	UNLOCK, *E, *F
> -	*A, *B,		LOCK, *C,	UNLOCK, *D, *E, *F
> -	*B,		LOCK, *C, *D,	UNLOCK, {*F,*A}, *E
> +	{*F,*A}, *B,	ACQUIRE, *C, *D,	RELEASE, *E
> +	*A, *B, *C,	ACQUIRE, *D,		RELEASE, *E, *F
> +	*A, *B,		ACQUIRE, *C,		RELEASE, *D, *E, *F
> +	*B,		ACQUIRE, *C, *D,	RELEASE, {*F,*A}, *E
> 
> 
> 
>  INTERRUPT DISABLING FUNCTIONS
>  -----------------------------
> 
> -Functions that disable interrupts (LOCK equivalent) and enable interrupts
> -(UNLOCK equivalent) will act as compiler barriers only.  So if memory or I/O
> +Functions that disable interrupts (ACQUIRE equivalent) and enable interrupts
> +(RELEASE equivalent) will act as compiler barriers only.  So if memory or I/O
>  barriers are required in such a situation, they must be provided from some
>  other means.
> 
> @@ -1436,24 +1439,24 @@ Consider the following: the system has a
>  	CPU 1				CPU 2
>  	===============================	===============================
>  	*A = a;				*E = e;
> -	LOCK M				LOCK Q
> +	ACQUIRE M			ACQUIRE Q
>  	*B = b;				*F = f;
>  	*C = c;				*G = g;
> -	UNLOCK M			UNLOCK Q
> +	RELEASE M			RELEASE Q
>  	*D = d;				*H = h;
> 
>  Then there is no guarantee as to what order CPU 3 will see the accesses to *A
>  through *H occur in, other than the constraints imposed by the separate locks
>  on the separate CPUs. It might, for example, see:
> 
> -	*E, LOCK M, LOCK Q, *G, *C, *F, *A, *B, UNLOCK Q, *D, *H, UNLOCK M
> +	*E, ACQUIRE M, ACQUIRE Q, *G, *C, *F, *A, *B, RELEASE Q, *D, *H, RELEASE M
> 
>  But it won't see any of:
> 
> -	*B, *C or *D preceding LOCK M
> -	*A, *B or *C following UNLOCK M
> -	*F, *G or *H preceding LOCK Q
> -	*E, *F or *G following UNLOCK Q
> +	*B, *C or *D preceding ACQUIRE M
> +	*A, *B or *C following RELEASE M
> +	*F, *G or *H preceding ACQUIRE Q
> +	*E, *F or *G following RELEASE Q
> 
> 
>  However, if the following occurs:
> @@ -1461,28 +1464,28 @@ through *H occur in, other than the cons
>  	CPU 1				CPU 2
>  	===============================	===============================
>  	*A = a;
> -	LOCK M		[1]
> +	ACQUIRE M	[1]
>  	*B = b;
>  	*C = c;
> -	UNLOCK M	[1]
> +	RELEASE M	[1]
>  	*D = d;				*E = e;
> -					LOCK M		[2]
> +					ACQUIRE M	[2]
>  					*F = f;
>  					*G = g;
> -					UNLOCK M	[2]
> +					RELEASE M	[2]
>  					*H = h;
> 
>  CPU 3 might see:
> 
> -	*E, LOCK M [1], *C, *B, *A, UNLOCK M [1],
> -		LOCK M [2], *H, *F, *G, UNLOCK M [2], *D
> +	*E, ACQUIRE M [1], *C, *B, *A, RELEASE M [1],
> +	    ACQUIRE M [2], *H, *F, *G, RELEASE M [2], *D
> 
>  But assuming CPU 1 gets the lock first, CPU 3 won't see any of:
> 
> -	*B, *C, *D, *F, *G or *H preceding LOCK M [1]
> -	*A, *B or *C following UNLOCK M [1]
> -	*F, *G or *H preceding LOCK M [2]
> -	*A, *B, *C, *E, *F or *G following UNLOCK M [2]
> +	*B, *C, *D, *F, *G or *H preceding ACQUIRE M [1]
> +	*A, *B or *C following RELEASE M [1]
> +	*F, *G or *H preceding ACQUIRE M [2]
> +	*A, *B, *C, *E, *F or *G following RELEASE M [2]
> 
> 
>  LOCKS VS I/O ACCESSES
> @@ -1702,13 +1705,13 @@ about the state (old or new) implies an
>  	test_and_clear_bit();
>  	test_and_change_bit();
> 
> -These are used for such things as implementing LOCK-class and UNLOCK-class
> +These are used for such things as implementing ACQUIRE-class and RELEASE-class
>  operations and adjusting reference counters towards object destruction, and as
>  such the implicit memory barrier effects are necessary.
> 
> 
>  The following operations are potential problems as they do _not_ imply memory
> -barriers, but might be used for implementing such things as UNLOCK-class
> +barriers, but might be used for implementing such things as RELEASE-class
>  operations:
> 
>  	atomic_set();
> @@ -1750,9 +1753,9 @@ barriers are needed or not.
>  	clear_bit_unlock();
>  	__clear_bit_unlock();
> 
> -These implement LOCK-class and UNLOCK-class operations. These should be used in
> -preference to other operations when implementing locking primitives, because
> -their implementations can be optimised on many architectures.
> +These implement ACQUIRE-class and RELEASE-class operations. These should be
> +used in preference to other operations when implementing locking primitives,
> +because their implementations can be optimised on many architectures.
> 
>  [!] Note that special memory barrier primitives are available for these
>  situations because on some CPUs the atomic instructions used imply full memory
> --- a/arch/alpha/include/asm/barrier.h
> +++ b/arch/alpha/include/asm/barrier.h
> @@ -3,33 +3,18 @@
> 
>  #include <asm/compiler.h>
> 
> -#define mb() \
> -__asm__ __volatile__("mb": : :"memory")
> +#define mb()	__asm__ __volatile__("mb": : :"memory")
> +#define rmb()	__asm__ __volatile__("mb": : :"memory")
> +#define wmb()	__asm__ __volatile__("wmb": : :"memory")
> 
> -#define rmb() \
> -__asm__ __volatile__("mb": : :"memory")
> -
> -#define wmb() \
> -__asm__ __volatile__("wmb": : :"memory")
> -
> -#define read_barrier_depends() \
> -__asm__ __volatile__("mb": : :"memory")
> +#define read_barrier_depends() __asm__ __volatile__("mb": : :"memory")
> 
>  #ifdef CONFIG_SMP
>  #define __ASM_SMP_MB	"\tmb\n"
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define __ASM_SMP_MB
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	do { } while (0)
>  #endif
> 
> -#define set_mb(var, value) \
> -do { var = value; mb(); } while (0)
> +#include <asm-generic/barrier.h>
> 
>  #endif		/* __BARRIER_H */
> --- a/arch/arc/include/asm/Kbuild
> +++ b/arch/arc/include/asm/Kbuild
> @@ -47,3 +47,4 @@ generic-y += user.h
>  generic-y += vga.h
>  generic-y += xor.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/arc/include/asm/atomic.h
> +++ b/arch/arc/include/asm/atomic.h
> @@ -190,6 +190,11 @@ static inline void atomic_clear_mask(uns
> 
>  #endif /* !CONFIG_ARC_HAS_LLSC */
> 
> +#define smp_mb__before_atomic_dec()	barrier()
> +#define smp_mb__after_atomic_dec()	barrier()
> +#define smp_mb__before_atomic_inc()	barrier()
> +#define smp_mb__after_atomic_inc()	barrier()
> +
>  /**
>   * __atomic_add_unless - add unless the number is a given value
>   * @v: pointer of type atomic_t
> --- a/arch/arc/include/asm/barrier.h
> +++ /dev/null
> @@ -1,42 +0,0 @@
> -/*
> - * Copyright (C) 2004, 2007-2010, 2011-2012 Synopsys, Inc. (www.synopsys.com)
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 as
> - * published by the Free Software Foundation.
> - */
> -
> -#ifndef __ASM_BARRIER_H
> -#define __ASM_BARRIER_H
> -
> -#ifndef __ASSEMBLY__
> -
> -/* TODO-vineetg: Need to see what this does, don't we need sync anywhere */
> -#define mb() __asm__ __volatile__ ("" : : : "memory")
> -#define rmb() mb()
> -#define wmb() mb()
> -#define set_mb(var, value)  do { var = value; mb(); } while (0)
> -#define set_wmb(var, value) do { var = value; wmb(); } while (0)
> -#define read_barrier_depends()  mb()
> -
> -/* TODO-vineetg verify the correctness of macros here */
> -#ifdef CONFIG_SMP
> -#define smp_mb()        mb()
> -#define smp_rmb()       rmb()
> -#define smp_wmb()       wmb()
> -#else
> -#define smp_mb()        barrier()
> -#define smp_rmb()       barrier()
> -#define smp_wmb()       barrier()
> -#endif
> -
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
> -#define smp_read_barrier_depends()      do { } while (0)
> -
> -#endif
> -
> -#endif

I do like this take-no-prisoners approach!  ;-)

> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -59,6 +59,21 @@
>  #define smp_wmb()	dmb(ishst)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #define read_barrier_depends()		do { } while(0)
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -35,11 +35,59 @@
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #else
> +
>  #define smp_mb()	asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()	asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()	asm volatile("dmb ishst" : : : "memory")
> -#endif
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("stlr %w1, [%0]"				\
> +				: "=Q" (*p) : "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("stlr %1, [%0]"				\
> +				: "=Q" (*p) : "r" (v) : "memory");	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1;						\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("ldar %w0, [%1]"				\
> +			: "=r" (___p1) : "Q" (*p) : "memory");		\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ldar %0, [%1]"				\
> +			: "=r" (___p1) : "Q" (*p) : "memory");		\
> +		break;							\
> +	}								\
> +	___p1;								\
> +})
> 
>  #define read_barrier_depends()		do { } while(0)
>  #define smp_read_barrier_depends()	do { } while(0)
> --- a/arch/avr32/include/asm/barrier.h
> +++ b/arch/avr32/include/asm/barrier.h
> @@ -8,22 +8,15 @@
>  #ifndef __ASM_AVR32_BARRIER_H
>  #define __ASM_AVR32_BARRIER_H
> 
> -#define nop()			asm volatile("nop")
> -
> -#define mb()			asm volatile("" : : : "memory")
> -#define rmb()			mb()
> -#define wmb()			asm volatile("sync 0" : : : "memory")
> -#define read_barrier_depends()  do { } while(0)
> -#define set_mb(var, value)      do { var = value; mb(); } while(0)
> +/*
> + * Weirdest thing ever.. no full barrier, but it has a write barrier!
> + */
> +#define wmb()	asm volatile("sync 0" : : : "memory")

Doesn't this mean that asm-generic/barrier.h needs to check for
definitions?  Ah, I see below that you added these checks.

>  #ifdef CONFIG_SMP
>  # error "The AVR32 port does not support SMP"
> -#else
> -# define smp_mb()		barrier()
> -# define smp_rmb()		barrier()
> -# define smp_wmb()		barrier()
> -# define smp_read_barrier_depends() do { } while(0)
>  #endif
> 
> +#include <asm-generic/barrier.h>
> 
>  #endif /* __ASM_AVR32_BARRIER_H */
> --- a/arch/blackfin/include/asm/barrier.h
> +++ b/arch/blackfin/include/asm/barrier.h
> @@ -23,26 +23,10 @@
>  # define rmb()	do { barrier(); smp_check_barrier(); } while (0)
>  # define wmb()	do { barrier(); smp_mark_barrier(); } while (0)
>  # define read_barrier_depends()	do { barrier(); smp_check_barrier(); } while (0)
> -#else
> -# define mb()	barrier()
> -# define rmb()	barrier()
> -# define wmb()	barrier()
> -# define read_barrier_depends()	do { } while (0)
>  #endif
> 
> -#else /* !CONFIG_SMP */
> -
> -#define mb()	barrier()
> -#define rmb()	barrier()
> -#define wmb()	barrier()
> -#define read_barrier_depends()	do { } while (0)
> -
>  #endif /* !CONFIG_SMP */
> 
> -#define smp_mb()  mb()
> -#define smp_rmb() rmb()
> -#define smp_wmb() wmb()
> -#define set_mb(var, value) do { var = value; mb(); } while (0)
> -#define smp_read_barrier_depends()	read_barrier_depends()
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _BLACKFIN_BARRIER_H */
> --- a/arch/cris/include/asm/Kbuild
> +++ b/arch/cris/include/asm/Kbuild
> @@ -12,3 +12,4 @@ generic-y += trace_clock.h
>  generic-y += vga.h
>  generic-y += xor.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/cris/include/asm/barrier.h
> +++ /dev/null
> @@ -1,25 +0,0 @@
> -#ifndef __ASM_CRIS_BARRIER_H
> -#define __ASM_CRIS_BARRIER_H
> -
> -#define nop() __asm__ __volatile__ ("nop");
> -
> -#define barrier() __asm__ __volatile__("": : :"memory")
> -#define mb() barrier()
> -#define rmb() mb()
> -#define wmb() mb()
> -#define read_barrier_depends() do { } while(0)
> -#define set_mb(var, value)  do { var = value; mb(); } while (0)
> -
> -#ifdef CONFIG_SMP
> -#define smp_mb()        mb()
> -#define smp_rmb()       rmb()
> -#define smp_wmb()       wmb()
> -#define smp_read_barrier_depends()     read_barrier_depends()
> -#else
> -#define smp_mb()        barrier()
> -#define smp_rmb()       barrier()
> -#define smp_wmb()       barrier()
> -#define smp_read_barrier_depends()     do { } while(0)
> -#endif
> -
> -#endif /* __ASM_CRIS_BARRIER_H */
> --- a/arch/frv/include/asm/barrier.h
> +++ b/arch/frv/include/asm/barrier.h
> @@ -17,13 +17,7 @@
>  #define mb()			asm volatile ("membar" : : :"memory")
>  #define rmb()			asm volatile ("membar" : : :"memory")
>  #define wmb()			asm volatile ("membar" : : :"memory")
> -#define read_barrier_depends()	do { } while (0)
> 
> -#define smp_mb()			barrier()
> -#define smp_rmb()			barrier()
> -#define smp_wmb()			barrier()
> -#define smp_read_barrier_depends()	do {} while(0)
> -#define set_mb(var, value) \
> -	do { var = (value); barrier(); } while (0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _ASM_BARRIER_H */
> --- a/arch/h8300/include/asm/barrier.h
> +++ b/arch/h8300/include/asm/barrier.h
> @@ -3,27 +3,8 @@
> 
>  #define nop()  asm volatile ("nop"::)
> 
> -/*
> - * Force strict CPU ordering.
> - * Not really required on H8...
> - */
> -#define mb()   asm volatile (""   : : :"memory")
> -#define rmb()  asm volatile (""   : : :"memory")
> -#define wmb()  asm volatile (""   : : :"memory")
>  #define set_mb(var, value) do { xchg(&var, value); } while (0)
> 
> -#define read_barrier_depends()	do { } while (0)
> -
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define smp_read_barrier_depends()	read_barrier_depends()
> -#else
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	do { } while(0)
> -#endif
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _H8300_BARRIER_H */
> --- a/arch/hexagon/include/asm/Kbuild
> +++ b/arch/hexagon/include/asm/Kbuild
> @@ -54,3 +54,4 @@ generic-y += ucontext.h
>  generic-y += unaligned.h
>  generic-y += xor.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/hexagon/include/asm/barrier.h
> +++ /dev/null
> @@ -1,41 +0,0 @@
> -/*
> - * Memory barrier definitions for the Hexagon architecture
> - *
> - * Copyright (c) 2010-2011, The Linux Foundation. All rights reserved.
> - *
> - * This program is free software; you can redistribute it and/or modify
> - * it under the terms of the GNU General Public License version 2 and
> - * only version 2 as published by the Free Software Foundation.
> - *
> - * This program is distributed in the hope that it will be useful,
> - * but WITHOUT ANY WARRANTY; without even the implied warranty of
> - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> - * GNU General Public License for more details.
> - *
> - * You should have received a copy of the GNU General Public License
> - * along with this program; if not, write to the Free Software
> - * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA
> - * 02110-1301, USA.
> - */
> -
> -#ifndef _ASM_BARRIER_H
> -#define _ASM_BARRIER_H
> -
> -#define rmb()				barrier()
> -#define read_barrier_depends()		barrier()
> -#define wmb()				barrier()
> -#define mb()				barrier()
> -#define smp_rmb()			barrier()
> -#define smp_read_barrier_depends()	barrier()
> -#define smp_wmb()			barrier()
> -#define smp_mb()			barrier()
> -#define smp_mb__before_atomic_dec()	barrier()
> -#define smp_mb__after_atomic_dec()	barrier()
> -#define smp_mb__before_atomic_inc()	barrier()
> -#define smp_mb__after_atomic_inc()	barrier()
> -
> -/*  Set a value and use a memory barrier.  Used by the scheduler somewhere.  */
> -#define set_mb(var, value) \
> -	do { var = value; mb(); } while (0)
> -
> -#endif /* _ASM_BARRIER_H */
> --- a/arch/ia64/include/asm/barrier.h
> +++ b/arch/ia64/include/asm/barrier.h
> @@ -45,11 +45,60 @@
>  # define smp_rmb()	rmb()
>  # define smp_wmb()	wmb()
>  # define smp_read_barrier_depends()	read_barrier_depends()
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("st4.rel [%0]=%1"				\
> +				: "=r" (p) : "r" (v) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("st8.rel [%0]=%1"				\
> +				: "=r" (p) : "r" (v) : "memory");	\
> +		break;							\
> +	}								\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1;						\
> +	compiletime_assert_atomic_type(*p);				\
> +	switch (sizeof(*p)) {						\
> +	case 4:								\
> +		asm volatile ("ld4.acq %0=[%1]"				\
> +				: "=r" (___p1) : "r" (p) : "memory");	\
> +		break;							\
> +	case 8:								\
> +		asm volatile ("ld8.acq %0=[%1]"				\
> +				: "=r" (___p1) : "r" (p) : "memory");	\
> +		break;							\
> +	}								\
> +	___p1;								\
> +})
> +
>  #else
> +
>  # define smp_mb()	barrier()
>  # define smp_rmb()	barrier()
>  # define smp_wmb()	barrier()
>  # define smp_read_barrier_depends()	do { } while(0)
> +
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
>  #endif
> 
>  /*
> --- a/arch/m32r/include/asm/barrier.h
> +++ b/arch/m32r/include/asm/barrier.h
> @@ -11,84 +11,6 @@
> 
>  #define nop()  __asm__ __volatile__ ("nop" : : )
> 
> -/*
> - * Memory barrier.
> - *
> - * mb() prevents loads and stores being reordered across this point.
> - * rmb() prevents loads being reordered across this point.
> - * wmb() prevents stores being reordered across this point.
> - */
> -#define mb()   barrier()
> -#define rmb()  mb()
> -#define wmb()  mb()
> -
> -/**
> - * read_barrier_depends - Flush all pending reads that subsequents reads
> - * depend on.
> - *
> - * No data-dependent reads from memory-like regions are ever reordered
> - * over this barrier.  All reads preceding this primitive are guaranteed
> - * to access memory (but not necessarily other CPUs' caches) before any
> - * reads following this primitive that depend on the data return by
> - * any of the preceding reads.  This primitive is much lighter weight than
> - * rmb() on most CPUs, and is never heavier weight than is
> - * rmb().
> - *
> - * These ordering constraints are respected by both the local CPU
> - * and the compiler.
> - *
> - * Ordering is not guaranteed by anything other than these primitives,
> - * not even by data dependencies.  See the documentation for
> - * memory_barrier() for examples and URLs to more information.
> - *
> - * For example, the following code would force ordering (the initial
> - * value of "a" is zero, "b" is one, and "p" is "&a"):
> - *
> - * <programlisting>
> - *      CPU 0                           CPU 1
> - *
> - *      b = 2;
> - *      memory_barrier();
> - *      p = &b;                         q = p;
> - *                                      read_barrier_depends();
> - *                                      d = *q;
> - * </programlisting>
> - *
> - *
> - * because the read of "*q" depends on the read of "p" and these
> - * two reads are separated by a read_barrier_depends().  However,
> - * the following code, with the same initial values for "a" and "b":
> - *
> - * <programlisting>
> - *      CPU 0                           CPU 1
> - *
> - *      a = 2;
> - *      memory_barrier();
> - *      b = 3;                          y = b;
> - *                                      read_barrier_depends();
> - *                                      x = a;
> - * </programlisting>
> - *
> - * does not enforce ordering, since there is no data dependency between
> - * the read of "a" and the read of "b".  Therefore, on some CPUs, such
> - * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
> - * in cases like this where there are no data dependencies.
> - **/
> -
> -#define read_barrier_depends()	do { } while (0)
> -
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define smp_read_barrier_depends()	read_barrier_depends()
> -#define set_mb(var, value) do { (void) xchg(&var, value); } while (0)
> -#else
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	do { } while (0)
> -#define set_mb(var, value) do { var = value; barrier(); } while (0)
> -#endif
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _ASM_M32R_BARRIER_H */
> --- a/arch/m68k/include/asm/barrier.h
> +++ b/arch/m68k/include/asm/barrier.h
> @@ -1,20 +1,8 @@
>  #ifndef _M68K_BARRIER_H
>  #define _M68K_BARRIER_H
> 
> -/*
> - * Force strict CPU ordering.
> - * Not really required on m68k...
> - */
>  #define nop()		do { asm volatile ("nop"); barrier(); } while (0)
> -#define mb()		barrier()
> -#define rmb()		barrier()
> -#define wmb()		barrier()
> -#define read_barrier_depends()	((void)0)
> -#define set_mb(var, value)	({ (var) = (value); wmb(); })
> 
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	((void)0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _M68K_BARRIER_H */
> --- a/arch/metag/include/asm/barrier.h
> +++ b/arch/metag/include/asm/barrier.h
> @@ -82,4 +82,19 @@ static inline void fence(void)
>  #define smp_read_barrier_depends()     do { } while (0)
>  #define set_mb(var, value) do { var = value; smp_mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #endif /* _ASM_METAG_BARRIER_H */
> --- a/arch/microblaze/include/asm/Kbuild
> +++ b/arch/microblaze/include/asm/Kbuild
> @@ -4,3 +4,4 @@ generic-y += exec.h
>  generic-y += trace_clock.h
>  generic-y += syscalls.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/microblaze/include/asm/barrier.h
> +++ /dev/null
> @@ -1,27 +0,0 @@
> -/*
> - * Copyright (C) 2006 Atmark Techno, Inc.
> - *
> - * This file is subject to the terms and conditions of the GNU General Public
> - * License. See the file "COPYING" in the main directory of this archive
> - * for more details.
> - */
> -
> -#ifndef _ASM_MICROBLAZE_BARRIER_H
> -#define _ASM_MICROBLAZE_BARRIER_H
> -
> -#define nop()                  asm volatile ("nop")
> -
> -#define smp_read_barrier_depends()	do {} while (0)
> -#define read_barrier_depends()		do {} while (0)
> -
> -#define mb()			barrier()
> -#define rmb()			mb()
> -#define wmb()			mb()
> -#define set_mb(var, value)	do { var = value; mb(); } while (0)
> -#define set_wmb(var, value)	do { var = value; wmb(); } while (0)
> -
> -#define smp_mb()		mb()
> -#define smp_rmb()		rmb()
> -#define smp_wmb()		wmb()
> -
> -#endif /* _ASM_MICROBLAZE_BARRIER_H */
> --- a/arch/mips/include/asm/barrier.h
> +++ b/arch/mips/include/asm/barrier.h
> @@ -180,4 +180,19 @@
>  #define nudge_writes() mb()
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> +
>  #endif /* __ASM_BARRIER_H */
> --- a/arch/mn10300/include/asm/Kbuild
> +++ b/arch/mn10300/include/asm/Kbuild
> @@ -3,3 +3,4 @@ generic-y += clkdev.h
>  generic-y += exec.h
>  generic-y += trace_clock.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/mn10300/include/asm/barrier.h
> +++ /dev/null
> @@ -1,37 +0,0 @@
> -/* MN10300 memory barrier definitions
> - *
> - * Copyright (C) 2007 Red Hat, Inc. All Rights Reserved.
> - * Written by David Howells (dhowells@redhat.com)
> - *
> - * This program is free software; you can redistribute it and/or
> - * modify it under the terms of the GNU General Public Licence
> - * as published by the Free Software Foundation; either version
> - * 2 of the Licence, or (at your option) any later version.
> - */
> -#ifndef _ASM_BARRIER_H
> -#define _ASM_BARRIER_H
> -
> -#define nop()	asm volatile ("nop")
> -
> -#define mb()	asm volatile ("": : :"memory")
> -#define rmb()	mb()
> -#define wmb()	asm volatile ("": : :"memory")
> -
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define set_mb(var, value)  do { xchg(&var, value); } while (0)
> -#else  /* CONFIG_SMP */
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define set_mb(var, value)  do { var = value;  mb(); } while (0)
> -#endif /* CONFIG_SMP */
> -
> -#define set_wmb(var, value) do { var = value; wmb(); } while (0)
> -
> -#define read_barrier_depends()		do {} while (0)
> -#define smp_read_barrier_depends()	do {} while (0)
> -
> -#endif /* _ASM_BARRIER_H */
> --- a/arch/parisc/include/asm/Kbuild
> +++ b/arch/parisc/include/asm/Kbuild
> @@ -5,3 +5,4 @@ generic-y += word-at-a-time.h auxvec.h u
>  	  poll.h xor.h clkdev.h exec.h
>  generic-y += trace_clock.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/parisc/include/asm/barrier.h
> +++ /dev/null
> @@ -1,35 +0,0 @@
> -#ifndef __PARISC_BARRIER_H
> -#define __PARISC_BARRIER_H
> -
> -/*
> -** This is simply the barrier() macro from linux/kernel.h but when serial.c
> -** uses tqueue.h uses smp_mb() defined using barrier(), linux/kernel.h
> -** hasn't yet been included yet so it fails, thus repeating the macro here.
> -**
> -** PA-RISC architecture allows for weakly ordered memory accesses although
> -** none of the processors use it. There is a strong ordered bit that is
> -** set in the O-bit of the page directory entry. Operating systems that
> -** can not tolerate out of order accesses should set this bit when mapping
> -** pages. The O-bit of the PSW should also be set to 1 (I don't believe any
> -** of the processor implemented the PSW O-bit). The PCX-W ERS states that
> -** the TLB O-bit is not implemented so the page directory does not need to
> -** have the O-bit set when mapping pages (section 3.1). This section also
> -** states that the PSW Y, Z, G, and O bits are not implemented.
> -** So it looks like nothing needs to be done for parisc-linux (yet).
> -** (thanks to chada for the above comment -ggg)
> -**
> -** The __asm__ op below simple prevents gcc/ld from reordering
> -** instructions across the mb() "call".
> -*/
> -#define mb()		__asm__ __volatile__("":::"memory")	/* barrier() */
> -#define rmb()		mb()
> -#define wmb()		mb()
> -#define smp_mb()	mb()
> -#define smp_rmb()	mb()
> -#define smp_wmb()	mb()
> -#define smp_read_barrier_depends()	do { } while(0)
> -#define read_barrier_depends()		do { } while(0)
> -
> -#define set_mb(var, value)		do { var = value; mb(); } while (0)
> -
> -#endif /* __PARISC_BARRIER_H */
> --- a/arch/powerpc/include/asm/barrier.h
> +++ b/arch/powerpc/include/asm/barrier.h
> @@ -45,11 +45,15 @@
>  #    define SMPWMB      eieio
>  #endif
> 
> +#define __lwsync()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +
>  #define smp_mb()	mb()
> -#define smp_rmb()	__asm__ __volatile__ (stringify_in_c(LWSYNC) : : :"memory")
> +#define smp_rmb()	__lwsync()
>  #define smp_wmb()	__asm__ __volatile__ (stringify_in_c(SMPWMB) : : :"memory")
>  #define smp_read_barrier_depends()	read_barrier_depends()
>  #else
> +#define __lwsync()	barrier()
> +
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> @@ -65,4 +69,19 @@
>  #define data_barrier(x)	\
>  	asm volatile("twi 0,%0,0; isync" : : "r" (x) : "memory");
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	__lwsync();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	__lwsync();							\
> +	___p1;								\
> +})
> +
>  #endif /* _ASM_POWERPC_BARRIER_H */
> --- a/arch/s390/include/asm/barrier.h
> +++ b/arch/s390/include/asm/barrier.h
> @@ -32,4 +32,19 @@
> 
>  #define set_mb(var, value)		do { var = value; mb(); } while (0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})
> +
>  #endif /* __ASM_BARRIER_H */
> --- a/arch/score/include/asm/Kbuild
> +++ b/arch/score/include/asm/Kbuild
> @@ -5,3 +5,4 @@ generic-y += clkdev.h
>  generic-y += trace_clock.h
>  generic-y += xor.h
>  generic-y += preempt.h
> +generic-y += barrier.h
> --- a/arch/score/include/asm/barrier.h
> +++ /dev/null
> @@ -1,16 +0,0 @@
> -#ifndef _ASM_SCORE_BARRIER_H
> -#define _ASM_SCORE_BARRIER_H
> -
> -#define mb()		barrier()
> -#define rmb()		barrier()
> -#define wmb()		barrier()
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -
> -#define read_barrier_depends()		do {} while (0)
> -#define smp_read_barrier_depends()	do {} while (0)
> -
> -#define set_mb(var, value) 		do {var = value; wmb(); } while (0)
> -
> -#endif /* _ASM_SCORE_BARRIER_H */
> --- a/arch/sh/include/asm/barrier.h
> +++ b/arch/sh/include/asm/barrier.h
> @@ -26,29 +26,14 @@
>  #if defined(CONFIG_CPU_SH4A) || defined(CONFIG_CPU_SH5)
>  #define mb()		__asm__ __volatile__ ("synco": : :"memory")
>  #define rmb()		mb()
> -#define wmb()		__asm__ __volatile__ ("synco": : :"memory")
> +#define wmb()		mb()
>  #define ctrl_barrier()	__icbi(PAGE_OFFSET)
> -#define read_barrier_depends()	do { } while(0)
>  #else
> -#define mb()		__asm__ __volatile__ ("": : :"memory")
> -#define rmb()		mb()
> -#define wmb()		__asm__ __volatile__ ("": : :"memory")
>  #define ctrl_barrier()	__asm__ __volatile__ ("nop;nop;nop;nop;nop;nop;nop;nop")
> -#define read_barrier_depends()	do { } while(0)
> -#endif
> -
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define smp_read_barrier_depends()	read_barrier_depends()
> -#else
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	do { } while(0)
>  #endif
> 
>  #define set_mb(var, value) do { (void)xchg(&var, value); } while (0)
> 
> +#include <asm-generic/barrier.h>
> +
>  #endif /* __ASM_SH_BARRIER_H */
> --- a/arch/sparc/include/asm/barrier_32.h
> +++ b/arch/sparc/include/asm/barrier_32.h
> @@ -1,15 +1,6 @@
>  #ifndef __SPARC_BARRIER_H
>  #define __SPARC_BARRIER_H
> 
> -/* XXX Change this if we ever use a PSO mode kernel. */
> -#define mb()	__asm__ __volatile__ ("" : : : "memory")
> -#define rmb()	mb()
> -#define wmb()	mb()
> -#define read_barrier_depends()	do { } while(0)
> -#define set_mb(__var, __value)  do { __var = __value; mb(); } while(0)
> -#define smp_mb()	__asm__ __volatile__("":::"memory")
> -#define smp_rmb()	__asm__ __volatile__("":::"memory")
> -#define smp_wmb()	__asm__ __volatile__("":::"memory")
> -#define smp_read_barrier_depends()	do { } while(0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* !(__SPARC_BARRIER_H) */
> --- a/arch/sparc/include/asm/barrier_64.h
> +++ b/arch/sparc/include/asm/barrier_64.h
> @@ -53,4 +53,19 @@ do {	__asm__ __volatile__("ba,pt	%%xcc,
> 
>  #define smp_read_barrier_depends()	do { } while(0)
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})
> +
>  #endif /* !(__SPARC64_BARRIER_H) */
> --- a/arch/tile/include/asm/barrier.h
> +++ b/arch/tile/include/asm/barrier.h
> @@ -22,59 +22,6 @@
>  #include <arch/spr_def.h>
>  #include <asm/timex.h>
> 
> -/*
> - * read_barrier_depends - Flush all pending reads that subsequents reads
> - * depend on.
> - *
> - * No data-dependent reads from memory-like regions are ever reordered
> - * over this barrier.  All reads preceding this primitive are guaranteed
> - * to access memory (but not necessarily other CPUs' caches) before any
> - * reads following this primitive that depend on the data return by
> - * any of the preceding reads.  This primitive is much lighter weight than
> - * rmb() on most CPUs, and is never heavier weight than is
> - * rmb().
> - *
> - * These ordering constraints are respected by both the local CPU
> - * and the compiler.
> - *
> - * Ordering is not guaranteed by anything other than these primitives,
> - * not even by data dependencies.  See the documentation for
> - * memory_barrier() for examples and URLs to more information.
> - *
> - * For example, the following code would force ordering (the initial
> - * value of "a" is zero, "b" is one, and "p" is "&a"):
> - *
> - * <programlisting>
> - *	CPU 0				CPU 1
> - *
> - *	b = 2;
> - *	memory_barrier();
> - *	p = &b;				q = p;
> - *					read_barrier_depends();
> - *					d = *q;
> - * </programlisting>
> - *
> - * because the read of "*q" depends on the read of "p" and these
> - * two reads are separated by a read_barrier_depends().  However,
> - * the following code, with the same initial values for "a" and "b":
> - *
> - * <programlisting>
> - *	CPU 0				CPU 1
> - *
> - *	a = 2;
> - *	memory_barrier();
> - *	b = 3;				y = b;
> - *					read_barrier_depends();
> - *					x = a;
> - * </programlisting>
> - *
> - * does not enforce ordering, since there is no data dependency between
> - * the read of "a" and the read of "b".  Therefore, on some CPUs, such
> - * as Alpha, "y" could be set to 3 and "x" to 0.  Use rmb()
> - * in cases like this where there are no data dependencies.
> - */
> -#define read_barrier_depends()	do { } while (0)
> -
>  #define __sync()	__insn_mf()
> 
>  #include <hv/syscall_public.h>
> @@ -125,20 +72,7 @@ mb_incoherent(void)
>  #define mb()		fast_mb()
>  #define iob()		fast_iob()
> 
> -#ifdef CONFIG_SMP
> -#define smp_mb()	mb()
> -#define smp_rmb()	rmb()
> -#define smp_wmb()	wmb()
> -#define smp_read_barrier_depends()	read_barrier_depends()
> -#else
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
> -#define smp_read_barrier_depends()	do { } while (0)
> -#endif
> -
> -#define set_mb(var, value) \
> -	do { var = value; mb(); } while (0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_TILE_BARRIER_H */
> --- a/arch/unicore32/include/asm/barrier.h
> +++ b/arch/unicore32/include/asm/barrier.h
> @@ -14,15 +14,6 @@
>  #define dsb() __asm__ __volatile__ ("" : : : "memory")
>  #define dmb() __asm__ __volatile__ ("" : : : "memory")
> 
> -#define mb()				barrier()
> -#define rmb()				barrier()
> -#define wmb()				barrier()
> -#define smp_mb()			barrier()
> -#define smp_rmb()			barrier()
> -#define smp_wmb()			barrier()
> -#define read_barrier_depends()		do { } while (0)
> -#define smp_read_barrier_depends()	do { } while (0)
> -
> -#define set_mb(var, value)		do { var = value; smp_mb(); } while (0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* __UNICORE_BARRIER_H__ */
> --- a/arch/x86/include/asm/barrier.h
> +++ b/arch/x86/include/asm/barrier.h
> @@ -100,6 +100,21 @@
>  #define set_mb(var, value) do { var = value; barrier(); } while (0)
>  #endif
> 
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	barrier();							\
> +	___p1;								\
> +})
> +
>  /*
>   * Stop RDTSC speculation. This is needed when you need to use RDTSC
>   * (or get_cycles or vread that possibly accesses the TSC) in a defined
> --- a/arch/xtensa/include/asm/barrier.h
> +++ b/arch/xtensa/include/asm/barrier.h
> @@ -9,21 +9,14 @@
>  #ifndef _XTENSA_SYSTEM_H
>  #define _XTENSA_SYSTEM_H
> 
> -#define smp_read_barrier_depends() do { } while(0)
> -#define read_barrier_depends() do { } while(0)
> -
>  #define mb()  ({ __asm__ __volatile__("memw" : : : "memory"); })
>  #define rmb() barrier()
>  #define wmb() mb()
> 
>  #ifdef CONFIG_SMP
>  #error smp_* not defined
> -#else
> -#define smp_mb()	barrier()
> -#define smp_rmb()	barrier()
> -#define smp_wmb()	barrier()
>  #endif
> 
> -#define set_mb(var, value)	do { var = value; mb(); } while (0)
> +#include <asm-generic/barrier.h>
> 
>  #endif /* _XTENSA_SYSTEM_H */
> --- a/include/asm-generic/barrier.h
> +++ b/include/asm-generic/barrier.h
> @@ -1,4 +1,5 @@
> -/* Generic barrier definitions, based on MN10300 definitions.
> +/*
> + * Generic barrier definitions, based on MN10300 definitions.
>   *
>   * It should be possible to use these on really simple architectures,
>   * but it serves more as a starting point for new ports.
> @@ -16,35 +17,67 @@
> 
>  #ifndef __ASSEMBLY__
> 
> -#define nop() asm volatile ("nop")
> +#include <asm/compiler.h>
> +
> +#ifndef nop
> +#define nop()	asm volatile ("nop")
> +#endif
> 
>  /*
> - * Force strict CPU ordering.
> - * And yes, this is required on UP too when we're talking
> - * to devices.
> + * Force strict CPU ordering. And yes, this is required on UP too when we're
> + * talking to devices.
>   *
> - * This implementation only contains a compiler barrier.
> + * Fall back to compiler barriers if nothing better is provided.
>   */
> 
> -#define mb()	asm volatile ("": : :"memory")
> -#define rmb()	mb()
> -#define wmb()	asm volatile ("": : :"memory")
> +#ifndef mb
> +#define mb()	barrier()
> +#endif
> +
> +#ifndef rmb
> +#define rmb()	barrier()
> +#endif
> +
> +#ifndef wmb
> +#define wmb()	barrier()
> +#endif
> +
> +#ifndef read_barrier_depends
> +#define read_barrier_depends()		do {} while (0)
> +#endif
> 
>  #ifdef CONFIG_SMP
>  #define smp_mb()	mb()
>  #define smp_rmb()	rmb()
>  #define smp_wmb()	wmb()
> +#define smp_read_barrier_depends()	read_barrier_depends()
>  #else
>  #define smp_mb()	barrier()
>  #define smp_rmb()	barrier()
>  #define smp_wmb()	barrier()
> +#define smp_read_barrier_depends()	do {} while (0)
>  #endif
> 
> +#ifndef set_mb
>  #define set_mb(var, value)  do { var = value;  mb(); } while (0)
> +#endif
> +
>  #define set_wmb(var, value) do { var = value; wmb(); } while (0)
> 
> -#define read_barrier_depends()		do {} while (0)
> -#define smp_read_barrier_depends()	do {} while (0)
> +#define smp_store_release(p, v)						\
> +do {									\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	ACCESS_ONCE(*p) = (v);						\
> +} while (0)
> +
> +#define smp_load_acquire(p)						\
> +({									\
> +	typeof(*p) ___p1 = ACCESS_ONCE(*p);				\
> +	compiletime_assert_atomic_type(*p);				\
> +	smp_mb();							\
> +	___p1;								\
> +})
> 
>  #endif /* !__ASSEMBLY__ */
>  #endif /* __ASM_GENERIC_BARRIER_H */
> --- a/include/linux/compiler.h
> +++ b/include/linux/compiler.h
> @@ -298,6 +298,11 @@ void ftrace_likely_update(struct ftrace_
>  # define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
>  #endif
> 
> +/* Is this type a native word size -- useful for atomic operations */
> +#ifndef __native_word
> +# define __native_word(t) (sizeof(t) == sizeof(int) || sizeof(t) == sizeof(long))
> +#endif
> +
>  /* Compile time object size, -1 for unknown */
>  #ifndef __compiletime_object_size
>  # define __compiletime_object_size(obj) -1
> @@ -337,6 +342,10 @@ void ftrace_likely_update(struct ftrace_
>  #define compiletime_assert(condition, msg) \
>  	_compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)
> 
> +#define compiletime_assert_atomic_type(t)				\
> +	compiletime_assert(__native_word(t),				\
> +		"Need native word sized stores/loads for atomicity.")
> +
>  /*
>   * Prevent the compiler from merging or refetching accesses.  The compiler
>   * is also forbidden from reordering successive instances of ACCESS_ONCE(),
> 


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-06 18:48                                                       ` Paul E. McKenney
@ 2013-11-06 19:42                                                         ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-06 19:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Geert Uytterhoeven, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers,
	Michael Ellerman, Michael Neuling, Russell King,
	Martin Schwidefsky, Heiko Carstens, Tony Luck

On Wed, Nov 06, 2013 at 10:48:48AM -0800, Paul E. McKenney wrote:
> A few nits on Documentation/memory-barriers.txt and some pointless
> comments elsewhere.  With the suggested Documentation/memory-barriers.txt
> fixes:
> 
> Reviewed-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Thanks, I think I'll cut the thing into a number of smaller patches with
identical end result. Will (hopefully) post a full new series tomorrow
somewhere.

I was thinking like:
 1 - aggressively employ asm-generic/barrier.h
 2 - Reformulate _The_ document to ACQUIRE/RELEASE
 3 - add the new store/load thingies

That should hopefully be slightly easier to look at.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-06 18:24                                     ` Peter Zijlstra
@ 2013-11-07  8:21                                       ` Ingo Molnar
  2013-11-07 14:27                                         ` Vince Weaver
  2013-11-11 16:24                                         ` Peter Zijlstra
  0 siblings, 2 replies; 120+ messages in thread
From: Ingo Molnar @ 2013-11-07  8:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, hpa, anton, mathieu.desnoyers, linux-kernel,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, David Ahern


* Peter Zijlstra <peterz@infradead.org> wrote:

> > Requiring the user of a kernel interface to have a deep knowledge of 
> > optimizing compilers, barriers, and CPU memory models is just asking 
> > for trouble.
> 
> It shouldn't be all that hard to put this in a (lgpl) library others can 
> link to -- that way you can build it once (using GCC).

I'd suggest to expose it via a new perf syscall, using vsyscall methods to 
not have to enter the kernel for the pure user-space bits. It should also 
have a real usecase in tools/perf/ so that it's constantly tested, with 
matching 'perf test' entries, etc.

I don't want a library that is external and under-tested: for example 
quite a few of the PAPI breakages were found very late, after a new kernel 
has been released - that's the big disadvantage of librarization and 
decentralization. The decentralized library model might work if all you 
want to create is a second-class also-ran GUI, but it just doesn't work 
very well for actively developed kernel code.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-06 13:57                                                     ` Peter Zijlstra
  2013-11-06 18:48                                                       ` Paul E. McKenney
@ 2013-11-07 11:17                                                       ` Will Deacon
  2013-11-07 13:36                                                         ` Peter Zijlstra
  1 sibling, 1 reply; 120+ messages in thread
From: Will Deacon @ 2013-11-07 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Geert Uytterhoeven, Paul E. McKenney, Linus Torvalds,
	Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Russell King, Martin Schwidefsky, Heiko Carstens, Tony Luck

Hi Peter,

Couple of minor fixes on the arm64 side...

On Wed, Nov 06, 2013 at 01:57:36PM +0000, Peter Zijlstra wrote:
> --- a/arch/arm64/include/asm/barrier.h
> +++ b/arch/arm64/include/asm/barrier.h
> @@ -35,11 +35,59 @@
>  #define smp_mb()       barrier()
>  #define smp_rmb()      barrier()
>  #define smp_wmb()      barrier()
> +
> +#define smp_store_release(p, v)                                                \
> +do {                                                                   \
> +       compiletime_assert_atomic_type(*p);                             \
> +       smp_mb();                                                       \
> +       ACCESS_ONCE(*p) = (v);                                          \
> +} while (0)
> +
> +#define smp_load_acquire(p)                                            \
> +({                                                                     \
> +       typeof(*p) ___p1 = ACCESS_ONCE(*p);                             \
> +       compiletime_assert_atomic_type(*p);                             \
> +       smp_mb();                                                       \
> +       ___p1;                                                          \
> +})
> +
>  #else
> +
>  #define smp_mb()       asm volatile("dmb ish" : : : "memory")
>  #define smp_rmb()      asm volatile("dmb ishld" : : : "memory")
>  #define smp_wmb()      asm volatile("dmb ishst" : : : "memory")
> -#endif

Why are you getting rid of this #endif?

> +#define smp_store_release(p, v)                                                \
> +do {                                                                   \
> +       compiletime_assert_atomic_type(*p);                             \
> +       switch (sizeof(*p)) {                                           \
> +       case 4:                                                         \
> +               asm volatile ("stlr %w1, [%0]"                          \
> +                               : "=Q" (*p) : "r" (v) : "memory");      \
> +               break;                                                  \
> +       case 8:                                                         \
> +               asm volatile ("stlr %1, [%0]"                           \
> +                               : "=Q" (*p) : "r" (v) : "memory");      \
> +               break;                                                  \
> +       }                                                               \
> +} while (0)
> +
> +#define smp_load_acquire(p)                                            \
> +({                                                                     \
> +       typeof(*p) ___p1;                                               \
> +       compiletime_assert_atomic_type(*p);                             \
> +       switch (sizeof(*p)) {                                           \
> +       case 4:                                                         \
> +               asm volatile ("ldar %w0, [%1]"                          \
> +                       : "=r" (___p1) : "Q" (*p) : "memory");          \
> +               break;                                                  \
> +       case 8:                                                         \
> +               asm volatile ("ldar %0, [%1]"                           \
> +                       : "=r" (___p1) : "Q" (*p) : "memory");          \
> +               break;                                                  \
> +       }                                                               \
> +       ___p1;                                                          \
> +})

You don't need the square brackets when using the "Q" constraint (otherwise
it will expand to something like [[x0]], which gas won't accept).

With those changes, for the general idea and arm/arm64 parts:

  Acked-by: Will Deacon <will.deacon@arm.com>

Will

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-07 11:17                                                       ` Will Deacon
@ 2013-11-07 13:36                                                         ` Peter Zijlstra
  0 siblings, 0 replies; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-07 13:36 UTC (permalink / raw)
  To: Will Deacon
  Cc: Geert Uytterhoeven, Paul E. McKenney, Linus Torvalds,
	Victor Kaplansky, Oleg Nesterov, Anton Blanchard,
	Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev,
	Mathieu Desnoyers, Michael Ellerman, Michael Neuling,
	Russell King, Martin Schwidefsky, Heiko Carstens, Tony Luck

On Thu, Nov 07, 2013 at 11:17:41AM +0000, Will Deacon wrote:
> Hi Peter,
> 
> Couple of minor fixes on the arm64 side...
> 
> On Wed, Nov 06, 2013 at 01:57:36PM +0000, Peter Zijlstra wrote:
> > --- a/arch/arm64/include/asm/barrier.h
> > +++ b/arch/arm64/include/asm/barrier.h
> > @@ -35,11 +35,59 @@
> >  #define smp_mb()       barrier()
> >  #define smp_rmb()      barrier()
> >  #define smp_wmb()      barrier()
> > +
> > +#define smp_store_release(p, v)                                                \
> > +do {                                                                   \
> > +       compiletime_assert_atomic_type(*p);                             \
> > +       smp_mb();                                                       \
> > +       ACCESS_ONCE(*p) = (v);                                          \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)                                            \
> > +({                                                                     \
> > +       typeof(*p) ___p1 = ACCESS_ONCE(*p);                             \
> > +       compiletime_assert_atomic_type(*p);                             \
> > +       smp_mb();                                                       \
> > +       ___p1;                                                          \
> > +})
> > +
> >  #else
> > +
> >  #define smp_mb()       asm volatile("dmb ish" : : : "memory")
> >  #define smp_rmb()      asm volatile("dmb ishld" : : : "memory")
> >  #define smp_wmb()      asm volatile("dmb ishst" : : : "memory")
> > -#endif
> 
> Why are you getting rid of this #endif?

oops..

> > +#define smp_store_release(p, v)                                                \
> > +do {                                                                   \
> > +       compiletime_assert_atomic_type(*p);                             \
> > +       switch (sizeof(*p)) {                                           \
> > +       case 4:                                                         \
> > +               asm volatile ("stlr %w1, [%0]"                          \
> > +                               : "=Q" (*p) : "r" (v) : "memory");      \
> > +               break;                                                  \
> > +       case 8:                                                         \
> > +               asm volatile ("stlr %1, [%0]"                           \
> > +                               : "=Q" (*p) : "r" (v) : "memory");      \
> > +               break;                                                  \
> > +       }                                                               \
> > +} while (0)
> > +
> > +#define smp_load_acquire(p)                                            \
> > +({                                                                     \
> > +       typeof(*p) ___p1;                                               \
> > +       compiletime_assert_atomic_type(*p);                             \
> > +       switch (sizeof(*p)) {                                           \
> > +       case 4:                                                         \
> > +               asm volatile ("ldar %w0, [%1]"                          \
> > +                       : "=r" (___p1) : "Q" (*p) : "memory");          \
> > +               break;                                                  \
> > +       case 8:                                                         \
> > +               asm volatile ("ldar %0, [%1]"                           \
> > +                       : "=r" (___p1) : "Q" (*p) : "memory");          \
> > +               break;                                                  \
> > +       }                                                               \
> > +       ___p1;                                                          \
> > +})
> 
> You don't need the square brackets when using the "Q" constraint (otherwise
> it will expand to something like [[x0]], which gas won't accept).
> 
> With those changes, for the general idea and arm/arm64 parts:
> 
>   Acked-by: Will Deacon <will.deacon@arm.com>

Thanks, I did that split-up I talked about yesterday, I was going to
compile them for all archs I have a compiler for before posting again.



---
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -62,11 +62,11 @@ do {									\
 	compiletime_assert_atomic_type(*p);				\
 	switch (sizeof(*p)) {						\
 	case 4:								\
-		asm volatile ("stlr %w1, [%0]"				\
+		asm volatile ("stlr %w1, %0"				\
 				: "=Q" (*p) : "r" (v) : "memory");	\
 		break;							\
 	case 8:								\
-		asm volatile ("stlr %1, [%0]"				\
+		asm volatile ("stlr %1, %0"				\
 				: "=Q" (*p) : "r" (v) : "memory");	\
 		break;							\
 	}								\
@@ -78,17 +78,19 @@ do {									\
 	compiletime_assert_atomic_type(*p);				\
 	switch (sizeof(*p)) {						\
 	case 4:								\
-		asm volatile ("ldar %w0, [%1]"				\
+		asm volatile ("ldar %w0, %1"				\
 			: "=r" (___p1) : "Q" (*p) : "memory");		\
 		break;							\
 	case 8:								\
-		asm volatile ("ldar %0, [%1]"				\
+		asm volatile ("ldar %0, %1"				\
 			: "=r" (___p1) : "Q" (*p) : "memory");		\
 		break;							\
 	}								\
 	___p1;								\
 })
 
+#endif
+
 #define read_barrier_depends()		do { } while(0)
 #define smp_read_barrier_depends()	do { } while(0)
 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-07  8:21                                       ` Ingo Molnar
@ 2013-11-07 14:27                                         ` Vince Weaver
  2013-11-07 15:55                                           ` Ingo Molnar
  2013-11-11 16:24                                         ` Peter Zijlstra
  1 sibling, 1 reply; 120+ messages in thread
From: Vince Weaver @ 2013-11-07 14:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, hpa, anton, mathieu.desnoyers, linux-kernel,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, David Ahern

On Thu, 7 Nov 2013, Ingo Molnar wrote:

> I don't want a library that is external and under-tested: for example 
> quite a few of the PAPI breakages were found very late, after a new kernel 
> has been released - that's the big disadvantage of librarization and 
> decentralization. The decentralized library model might work if all you 
> want to create is a second-class also-ran GUI, but it just doesn't work 
> very well for actively developed kernel code.

I would argue that PAPI's problem was because it was trying to use 
perf_event_open() which is a complex, poorly documented kernel interface 
with a lot of code churn.  Usually it's the job of the kernel not to break 
user-space, not the other way around.

It's too late on the decentralized library issue.  PAPI has to support 
kernels going back to 2.6.32 so it's going to have its own copy of the 
mmap parsing code even if a new syscall gets introduced.

There are a lot of tools out there now that open-code a perf_event 
interface.  I don't think it's really possible to say "anyone using
the syscall without using our kernel library is unsupported".


This current issue with the locking doesn't really matter much because as 
far as I can tell it's an obscure potential corner case that no one has 
seen in practice yet.  So the easiest solution might just be to ignore the 
whole issue, which is a lot easier than trying to write a custom portable 
cross-platform license-agnostic memory barrier library.

We do try to keep the papi mmap reading code as close to perf's as 
possible though just because we know you aren't going to notice or care if 
you break it for other users.

Vince

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-07 14:27                                         ` Vince Weaver
@ 2013-11-07 15:55                                           ` Ingo Molnar
  0 siblings, 0 replies; 120+ messages in thread
From: Ingo Molnar @ 2013-11-07 15:55 UTC (permalink / raw)
  To: Vince Weaver
  Cc: Peter Zijlstra, hpa, anton, mathieu.desnoyers, linux-kernel,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, David Ahern


* Vince Weaver <vince@deater.net> wrote:

> On Thu, 7 Nov 2013, Ingo Molnar wrote:
> 
> > I don't want a library that is external and under-tested: for example 
> > quite a few of the PAPI breakages were found very late, after a new 
> > kernel has been released - that's the big disadvantage of 
> > librarization and decentralization. The decentralized library model 
> > might work if all you want to create is a second-class also-ran GUI, 
> > but it just doesn't work very well for actively developed kernel code.
> 
> I would argue that PAPI's problem was because it was trying to use 
> perf_event_open() which is a complex, poorly documented kernel interface 
> with a lot of code churn.  Usually it's the job of the kernel not to 
> break user-space, not the other way around.

As Linus said it on the Kernel Summit: breakages that don't get reported 
or don't get noticed essentially don't exist. We can only fix what gets 
reported in time.

> It's too late on the decentralized library issue.  PAPI has to support 
> kernels going back to 2.6.32 so it's going to have its own copy of the 
> mmap parsing code even if a new syscall gets introduced.
> 
> There are a lot of tools out there now that open-code a perf_event 
> interface.  I don't think it's really possible to say "anyone using the 
> syscall without using our kernel library is unsupported".

I'm not saying that at all - but you appear to expect perfect kernel code 
and fixes done before you report them: that's an impossible expectation.

It's your choice to live outside the space that we readily test and it's 
your choice to not test your bits with a new kernel in time. Others do not 
test your code with -rc kernels, they don't report the bugs, so some bugs 
that affect the PAPI library go unnoticed. Yet you try to blame it on us, 
which is really backwards ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [RFC] arch: Introduce new TSO memory barrier smp_tmb()
  2013-11-04 11:22                                         ` Peter Zijlstra
  2013-11-04 16:27                                           ` Paul E. McKenney
@ 2013-11-07 23:50                                           ` Mathieu Desnoyers
  1 sibling, 0 replies; 120+ messages in thread
From: Mathieu Desnoyers @ 2013-11-07 23:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linus Torvalds, Victor Kaplansky,
	Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt,
	Frederic Weisbecker, LKML, Linux PPC dev, Michael Ellerman,
	Michael Neuling

* Peter Zijlstra (peterz@infradead.org) wrote:

[...]

Hi Peter,

Looking at this simplified version of perf's ring buffer
synchronization, I get concerned about the following issue:

> /*
>  * One important detail is that the kbuf part and the kbuf_writer() are
>  * strictly per cpu and we can thus rely on program order for those.
>  *
>  * Only the userspace consumer can possibly run on another cpu, and thus we
>  * need to ensure data consistency for those.
>  */
> 
> struct buffer {
>         u64 size;
>         u64 tail;
>         u64 head;
>         void *data;
> };
> 
> struct buffer *kbuf, *ubuf;
> 
> /*
>  * If there's space in the buffer; store the data @buf; otherwise
>  * discard it.
>  */
> void kbuf_write(int sz, void *buf)
> {
> 	u64 tail, head, offset;
> 
> 	do {
> 		tail = ACCESS_ONCE(ubuf->tail);
> 		offset = head = kbuf->head;
> 		if (CIRC_SPACE(head, tail, kbuf->size) < sz) {
> 			/* discard @buf */
> 			return;
> 		}
> 		head += sz;
> 	} while (local_cmpxchg(&kbuf->head, offset, head) != offset)
> 

Let's suppose we have a thread executing kbuf_write(), interrupted by an
IRQ or NMI right after a successful local_cmpxchg() (space reservation
in the buffer). If the nested execution context also calls kbuf_write(),
it will therefore update ubuf->head (below) with the second reserved
space, and only after that will it return to the original thread context
and continue executing kbuf_write(), thus overwriting ubuf->head with
the prior-to-last reserved offset.

All this probably works OK most of the times, when we have an event flow
guaranteeing that a following event will fix things up, but there
appears to be a risk of losing events near the end of the trace when
those are in nested execution contexts.

Thoughts ?

Thanks,

Mathieu

>         /*
>          * Ensure that if we see the userspace tail (ubuf->tail) such
>          * that there is space to write @buf without overwriting data
>          * userspace hasn't seen yet, we won't in fact store data before
>          * that read completes.
>          */
> 
>         smp_mb(); /* A, matches with D */
> 
>         memcpy(kbuf->data + offset, buf, sz);
> 
>         /*
>          * Ensure that we write all the @buf data before we update the
>          * userspace visible ubuf->head pointer.
>          */
>         smp_wmb(); /* B, matches with C */
> 
>         ubuf->head = kbuf->head;
> }
> 
> /*
>  * Consume the buffer data and update the tail pointer to indicate to
>  * kernel space there's 'free' space.
>  */
> void ubuf_read(void)
> {
>         u64 head, tail;
> 
>         tail = ACCESS_ONCE(ubuf->tail);
>         head = ACCESS_ONCE(ubuf->head);
> 
>         /*
>          * Ensure we read the buffer boundaries before the actual buffer
>          * data...
>          */
>         smp_rmb(); /* C, matches with B */
> 
>         while (tail != head) {
>                 obj = ubuf->data + tail;
>                 /* process obj */
>                 tail += obj->size;
>                 tail %= ubuf->size;
>         }
> 
>         /*
>          * Ensure all data reads are complete before we issue the
>          * ubuf->tail update; once that update hits, kbuf_write() can
>          * observe and overwrite data.
>          */
>         smp_mb(); /* D, matches with A */
> 
>         ubuf->tail = tail;
> }

-- 
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-07  8:21                                       ` Ingo Molnar
  2013-11-07 14:27                                         ` Vince Weaver
@ 2013-11-11 16:24                                         ` Peter Zijlstra
  2013-11-11 21:10                                           ` Ingo Molnar
  1 sibling, 1 reply; 120+ messages in thread
From: Peter Zijlstra @ 2013-11-11 16:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Vince Weaver, hpa, anton, mathieu.desnoyers, linux-kernel,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, David Ahern

On Thu, Nov 07, 2013 at 09:21:22AM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Requiring the user of a kernel interface to have a deep knowledge of 
> > > optimizing compilers, barriers, and CPU memory models is just asking 
> > > for trouble.
> > 
> > It shouldn't be all that hard to put this in a (lgpl) library others can 
> > link to -- that way you can build it once (using GCC).
> 
> I'd suggest to expose it via a new perf syscall, using vsyscall methods to 
> not have to enter the kernel for the pure user-space bits. It should also 
> have a real usecase in tools/perf/ so that it's constantly tested, with 
> matching 'perf test' entries, etc.

Oh man, I've never poked at the entire vsyscall stuff before; let alone
done it for ARM, ARM64, PPC64 etc..

Keeping it in userspace like we have is so much easier.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [tip:perf/core] tools/perf: Add required memory barriers
  2013-11-11 16:24                                         ` Peter Zijlstra
@ 2013-11-11 21:10                                           ` Ingo Molnar
  0 siblings, 0 replies; 120+ messages in thread
From: Ingo Molnar @ 2013-11-11 21:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vince Weaver, hpa, anton, mathieu.desnoyers, linux-kernel,
	michael, paulmck, benh, fweisbec, VICTORK, tglx, oleg, mikey,
	linux-tip-commits, Arnaldo Carvalho de Melo, Jiri Olsa,
	Namhyung Kim, David Ahern


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, Nov 07, 2013 at 09:21:22AM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > Requiring the user of a kernel interface to have a deep knowledge of 
> > > > optimizing compilers, barriers, and CPU memory models is just asking 
> > > > for trouble.
> > > 
> > > It shouldn't be all that hard to put this in a (lgpl) library others can 
> > > link to -- that way you can build it once (using GCC).
> > 
> > I'd suggest to expose it via a new perf syscall, using vsyscall methods to 
> > not have to enter the kernel for the pure user-space bits. It should also 
> > have a real usecase in tools/perf/ so that it's constantly tested, with 
> > matching 'perf test' entries, etc.
> 
> Oh man, I've never poked at the entire vsyscall stuff before; let alone
> done it for ARM, ARM64, PPC64 etc..
> 
> Keeping it in userspace like we have is so much easier.

... and so much more broken in fantastic ways, right? ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
  2014-05-09 12:20   ` Mikulas Patocka
@ 2014-05-09 13:47     ` Paul E. McKenney
  0 siblings, 0 replies; 120+ messages in thread
From: Paul E. McKenney @ 2014-05-09 13:47 UTC (permalink / raw)
  To: Mikulas Patocka; +Cc: Victor Kaplansky, Peter Zijlstra, linux-kernel

On Fri, May 09, 2014 at 08:20:25AM -0400, Mikulas Patocka wrote:
> 
> 
> On Fri, 9 May 2014, Victor Kaplansky wrote:
> 
> > Mikulas Patocka <mpatocka@redhat.com> wrote on 05/08/2014 11:46:53 PM:
> > 
> > > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > > > > around
> > > > > @head read.
> > > >
> > > > Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
> > > > updating tail, so there's no problem with the value changing
> > > > unexpectedly.
> > >
> > > You need ACCESS_ONCE even if you are the only process writing the value.
> > > Because without ACCESS_ONCE, the compiler may perform store tearing and
> > > split the store into several smaller stores. Search the file
> > > "Documentation/memory-barriers.txt" for the term "store tearing", it shows
> > > an example where one instruction storing 32-bit value may be split to two
> > > instructions, each storing 16-bit value.
> > >
> > > Mikulas
> > 
> > AFAIR, I was talking about redundant ACCESS_ONCE() around @tail *read* in
> > consumer code. As for ACCESS_ONCE() around @tail write in consumer code,
> > I see your point, but I don't think that volatile imposed by ACCESS_ONCE()
> > is appropriate, since:
> > 
> >     - compiler can generate several stores despite volatile if @tail
> >     is bigger in size than native machine data size, e.g. 64-bit on
> >     a 32-bit CPU.
> 
> That's true - so you should define data_head and data_tail as "unsigned 
> long", not "__u64".
> 
> >     - volatile imposed by ACCESS_ONCE() does nothing to prevent CPU from
> >     reordering, splitting or merging accesses. It can only mediate
> >     communication problems between processes running on same CPU.
> 
> That's why you need smp barrier in addition to ACCESS_ONCE. You need both 
> - the smp barrier (to prevent the CPU from reordering) and ACCESS_ONCE (to 
> prevent the compiler from splitting the write to smaller memory accesses).

IIRC the ring-buffer code uses the fact that one element remains
empty to make clever double use of a memory barrier.

> Since Linux 3.14, there are new macros smp_store_release and 
> smp_load_acquire that combine ACCESS_ONCE and memory barrier, so you can 
> use them. (they call compiletime_assert_atomic_type to make sure that you 
> don't use them on types that are not atomic, such as long long on 32-bit 
> architectures)

These are indeed useful and often simpler to use than raw barriers.

							Thanx, Paul

> > What you really want is to guarantee *atomicity* of @tail write on consumer
> > side.
> > 
> > -- Victor
> 
> Mikulas


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
       [not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com>
@ 2014-05-09 12:20   ` Mikulas Patocka
  2014-05-09 13:47     ` Paul E. McKenney
  0 siblings, 1 reply; 120+ messages in thread
From: Mikulas Patocka @ 2014-05-09 12:20 UTC (permalink / raw)
  To: Victor Kaplansky; +Cc: Peter Zijlstra, Paul E. McKenney, linux-kernel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2216 bytes --]



On Fri, 9 May 2014, Victor Kaplansky wrote:

> Mikulas Patocka <mpatocka@redhat.com> wrote on 05/08/2014 11:46:53 PM:
> 
> > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > > > around
> > > > @head read.
> > >
> > > Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
> > > updating tail, so there's no problem with the value changing
> > > unexpectedly.
> >
> > You need ACCESS_ONCE even if you are the only process writing the value.
> > Because without ACCESS_ONCE, the compiler may perform store tearing and
> > split the store into several smaller stores. Search the file
> > "Documentation/memory-barriers.txt" for the term "store tearing", it shows
> > an example where one instruction storing 32-bit value may be split to two
> > instructions, each storing 16-bit value.
> >
> > Mikulas
> 
> AFAIR, I was talking about redundant ACCESS_ONCE() around @tail *read* in
> consumer code. As for ACCESS_ONCE() around @tail write in consumer code,
> I see your point, but I don't think that volatile imposed by ACCESS_ONCE()
> is appropriate, since:
> 
>     - compiler can generate several stores despite volatile if @tail
>     is bigger in size than native machine data size, e.g. 64-bit on
>     a 32-bit CPU.

That's true - so you should define data_head and data_tail as "unsigned 
long", not "__u64".

>     - volatile imposed by ACCESS_ONCE() does nothing to prevent CPU from
>     reordering, splitting or merging accesses. It can only mediate
>     communication problems between processes running on same CPU.

That's why you need smp barrier in addition to ACCESS_ONCE. You need both 
- the smp barrier (to prevent the CPU from reordering) and ACCESS_ONCE (to 
prevent the compiler from splitting the write to smaller memory accesses).


Since Linux 3.14, there are new macros smp_store_release and 
smp_load_acquire that combine ACCESS_ONCE and memory barrier, so you can 
use them. (they call compiletime_assert_atomic_type to make sure that you 
don't use them on types that are not atomic, such as long long on 32-bit 
architectures)

> What you really want is to guarantee *atomicity* of @tail write on consumer
> side.
> 
> -- Victor

Mikulas

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: perf events ring buffer memory barrier on powerpc
@ 2014-05-08 20:46 Mikulas Patocka
       [not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com>
  0 siblings, 1 reply; 120+ messages in thread
From: Mikulas Patocka @ 2014-05-08 20:46 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: Victor Kaplansky, Paul E. McKenney, linux-kernel


[ I found this in the lkml archvive ]

> On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote:
>
> > Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM:
> >
> > > Also, I'm not entirely sure on C, that too seems like a dependency, we
> > > simply cannot read the buffer @tail before we've read the tail itself,
> > > now can we? Similarly we cannot compare tail to head without having the
> > > head read completed.
> >
> > No, this one we cannot omit, because our problem on consumer side is not
> > with @tail, which is written exclusively by consumer, but with @head.
>
> Ah indeed, my argument was flawed in that @head is the important part.
> But we still do a comparison of @tail against @head before we do further
> reads.
>
> Although I suppose speculative reads are allowed -- they don't have the
> destructive behaviour speculative writes have -- and thus we could in
> fact get reorder issues.
>
> But since it is still a dependent load in that we do that @tail vs @head
> comparison before doing other loads, wouldn't a read_barrier_depends()
> be sufficient? Or do we still need a complete rmb?
>
> > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > around
> > @head read.
>
> Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
> updating tail, so there's no problem with the value changing
> unexpectedly.

You need ACCESS_ONCE even if you are the only process writing the value. 
Because without ACCESS_ONCE, the compiler may perform store tearing and 
split the store into several smaller stores. Search the file 
"Documentation/memory-barriers.txt" for the term "store tearing", it shows 
an example where one instruction storing 32-bit value may be split to two 
instructions, each storing 16-bit value.

Mikulas


^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2014-05-09 13:47 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-22 23:54 perf events ring buffer memory barrier on powerpc Michael Neuling
2013-10-23  7:39 ` Victor Kaplansky
2013-10-23 14:19 ` Frederic Weisbecker
2013-10-23 14:25   ` Frederic Weisbecker
2013-10-25 17:37   ` Peter Zijlstra
2013-10-25 20:31     ` Michael Neuling
2013-10-27  9:00     ` Victor Kaplansky
2013-10-28  9:22       ` Peter Zijlstra
2013-10-28 10:02     ` Frederic Weisbecker
2013-10-28 12:38       ` Victor Kaplansky
2013-10-28 13:26         ` Peter Zijlstra
2013-10-28 16:34           ` Paul E. McKenney
2013-10-28 20:17             ` Oleg Nesterov
2013-10-28 20:58               ` Victor Kaplansky
2013-10-29 10:21                 ` Peter Zijlstra
2013-10-29 10:30                   ` Peter Zijlstra
2013-10-29 10:35                     ` Peter Zijlstra
2013-10-29 20:15                       ` Oleg Nesterov
2013-10-29 19:27                     ` Vince Weaver
2013-10-30 10:42                       ` Peter Zijlstra
2013-10-30 11:48                         ` James Hogan
2013-10-30 12:48                           ` Peter Zijlstra
2013-11-06 13:19                         ` [tip:perf/core] tools/perf: Add required memory barriers tip-bot for Peter Zijlstra
2013-11-06 13:50                           ` Vince Weaver
2013-11-06 14:00                             ` Peter Zijlstra
2013-11-06 14:28                               ` Peter Zijlstra
2013-11-06 14:55                                 ` Vince Weaver
2013-11-06 15:10                                   ` Peter Zijlstra
2013-11-06 15:23                                     ` Peter Zijlstra
2013-11-06 14:44                               ` Peter Zijlstra
2013-11-06 16:07                                 ` Peter Zijlstra
2013-11-06 17:31                                   ` Vince Weaver
2013-11-06 18:24                                     ` Peter Zijlstra
2013-11-07  8:21                                       ` Ingo Molnar
2013-11-07 14:27                                         ` Vince Weaver
2013-11-07 15:55                                           ` Ingo Molnar
2013-11-11 16:24                                         ` Peter Zijlstra
2013-11-11 21:10                                           ` Ingo Molnar
2013-10-29 21:23                     ` perf events ring buffer memory barrier on powerpc Michael Neuling
2013-10-30  9:27                 ` Paul E. McKenney
2013-10-30 11:25                   ` Peter Zijlstra
2013-10-30 14:52                     ` Victor Kaplansky
2013-10-30 15:39                       ` Peter Zijlstra
2013-10-30 17:14                         ` Victor Kaplansky
2013-10-30 17:44                           ` Peter Zijlstra
2013-10-31  6:16                       ` Paul E. McKenney
2013-11-01 13:12                         ` Victor Kaplansky
2013-11-02 16:36                           ` Paul E. McKenney
2013-11-02 17:26                             ` Paul E. McKenney
2013-10-31  6:40                     ` Paul E. McKenney
2013-11-01 14:25                       ` Victor Kaplansky
2013-11-02 17:28                         ` Paul E. McKenney
2013-11-01 14:56                       ` Peter Zijlstra
2013-11-02 17:32                         ` Paul E. McKenney
2013-11-03 14:40                           ` Paul E. McKenney
2013-11-03 15:17                             ` [RFC] arch: Introduce new TSO memory barrier smp_tmb() Peter Zijlstra
2013-11-03 18:08                               ` Linus Torvalds
2013-11-03 20:01                                 ` Peter Zijlstra
2013-11-03 22:42                                   ` Paul E. McKenney
2013-11-03 23:34                                     ` Linus Torvalds
2013-11-04 10:51                                       ` Paul E. McKenney
2013-11-04 11:22                                         ` Peter Zijlstra
2013-11-04 16:27                                           ` Paul E. McKenney
2013-11-04 16:48                                             ` Peter Zijlstra
2013-11-04 19:11                                             ` Peter Zijlstra
2013-11-04 19:18                                               ` Peter Zijlstra
2013-11-04 20:54                                                 ` Paul E. McKenney
2013-11-04 20:53                                               ` Paul E. McKenney
2013-11-05 14:05                                                 ` Will Deacon
2013-11-05 14:49                                                   ` Paul E. McKenney
2013-11-05 18:49                                                   ` Peter Zijlstra
2013-11-06 11:00                                                     ` Will Deacon
2013-11-06 12:39                                                 ` Peter Zijlstra
2013-11-06 12:51                                                   ` Geert Uytterhoeven
2013-11-06 13:57                                                     ` Peter Zijlstra
2013-11-06 18:48                                                       ` Paul E. McKenney
2013-11-06 19:42                                                         ` Peter Zijlstra
2013-11-07 11:17                                                       ` Will Deacon
2013-11-07 13:36                                                         ` Peter Zijlstra
2013-11-07 23:50                                           ` Mathieu Desnoyers
2013-11-04 11:05                                       ` Will Deacon
2013-11-04 16:34                                         ` Paul E. McKenney
2013-11-03 20:59                               ` Benjamin Herrenschmidt
2013-11-03 22:43                                 ` Paul E. McKenney
2013-11-03 17:07                             ` perf events ring buffer memory barrier on powerpc Will Deacon
2013-11-03 22:47                               ` Paul E. McKenney
2013-11-04  9:57                                 ` Will Deacon
2013-11-04 10:52                                   ` Paul E. McKenney
2013-11-01 16:11                       ` Peter Zijlstra
2013-11-02 17:46                         ` Paul E. McKenney
2013-11-01 16:18                       ` Peter Zijlstra
2013-11-02 17:49                         ` Paul E. McKenney
2013-10-30 13:28                   ` Victor Kaplansky
2013-10-30 15:51                     ` Peter Zijlstra
2013-10-30 18:29                       ` Peter Zijlstra
2013-10-30 19:11                         ` Peter Zijlstra
2013-10-31  4:33                       ` Paul E. McKenney
2013-10-31  4:32                     ` Paul E. McKenney
2013-10-31  9:04                       ` Peter Zijlstra
2013-10-31 15:07                         ` Paul E. McKenney
2013-10-31 15:19                           ` Peter Zijlstra
2013-11-01  9:28                             ` Paul E. McKenney
2013-11-01 10:30                               ` Peter Zijlstra
2013-11-02 15:20                                 ` Paul E. McKenney
2013-11-04  9:07                                   ` Peter Zijlstra
2013-11-04 10:00                                     ` Paul E. McKenney
2013-10-31  9:59                       ` Victor Kaplansky
2013-10-31 12:28                         ` David Laight
2013-10-31 12:55                           ` Victor Kaplansky
2013-10-31 15:25                         ` Paul E. McKenney
2013-11-01 16:06                           ` Victor Kaplansky
2013-11-01 16:25                             ` David Laight
2013-11-01 16:30                               ` Victor Kaplansky
2013-11-03 20:57                                 ` Benjamin Herrenschmidt
2013-11-02 15:46                             ` Paul E. McKenney
2013-10-28 19:09           ` Oleg Nesterov
2013-10-29 14:06     ` [tip:perf/urgent] perf: Fix perf ring buffer memory ordering tip-bot for Peter Zijlstra
2014-05-08 20:46 perf events ring buffer memory barrier on powerpc Mikulas Patocka
     [not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com>
2014-05-09 12:20   ` Mikulas Patocka
2014-05-09 13:47     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).