* Re: perf events ring buffer memory barrier on powerpc
@ 2014-05-08 20:46 Mikulas Patocka
[not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com>
0 siblings, 1 reply; 74+ messages in thread
From: Mikulas Patocka @ 2014-05-08 20:46 UTC (permalink / raw)
To: Peter Zijlstra; +Cc: Victor Kaplansky, Paul E. McKenney, linux-kernel
[ I found this in the lkml archvive ]
> On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote:
>
> > Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM:
> >
> > > Also, I'm not entirely sure on C, that too seems like a dependency, we
> > > simply cannot read the buffer @tail before we've read the tail itself,
> > > now can we? Similarly we cannot compare tail to head without having the
> > > head read completed.
> >
> > No, this one we cannot omit, because our problem on consumer side is not
> > with @tail, which is written exclusively by consumer, but with @head.
>
> Ah indeed, my argument was flawed in that @head is the important part.
> But we still do a comparison of @tail against @head before we do further
> reads.
>
> Although I suppose speculative reads are allowed -- they don't have the
> destructive behaviour speculative writes have -- and thus we could in
> fact get reorder issues.
>
> But since it is still a dependent load in that we do that @tail vs @head
> comparison before doing other loads, wouldn't a read_barrier_depends()
> be sufficient? Or do we still need a complete rmb?
>
> > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only
> > around
> > @head read.
>
> Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one
> updating tail, so there's no problem with the value changing
> unexpectedly.
You need ACCESS_ONCE even if you are the only process writing the value.
Because without ACCESS_ONCE, the compiler may perform store tearing and
split the store into several smaller stores. Search the file
"Documentation/memory-barriers.txt" for the term "store tearing", it shows
an example where one instruction storing 32-bit value may be split to two
instructions, each storing 16-bit value.
Mikulas
^ permalink raw reply [flat|nested] 74+ messages in thread
[parent not found: <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com>]
* Re: perf events ring buffer memory barrier on powerpc [not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com> @ 2014-05-09 12:20 ` Mikulas Patocka 2014-05-09 13:47 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Mikulas Patocka @ 2014-05-09 12:20 UTC (permalink / raw) To: Victor Kaplansky; +Cc: Peter Zijlstra, Paul E. McKenney, linux-kernel [-- Attachment #1: Type: TEXT/PLAIN, Size: 2216 bytes --] On Fri, 9 May 2014, Victor Kaplansky wrote: > Mikulas Patocka <mpatocka@redhat.com> wrote on 05/08/2014 11:46:53 PM: > > > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > > > > around > > > > @head read. > > > > > > Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one > > > updating tail, so there's no problem with the value changing > > > unexpectedly. > > > > You need ACCESS_ONCE even if you are the only process writing the value. > > Because without ACCESS_ONCE, the compiler may perform store tearing and > > split the store into several smaller stores. Search the file > > "Documentation/memory-barriers.txt" for the term "store tearing", it shows > > an example where one instruction storing 32-bit value may be split to two > > instructions, each storing 16-bit value. > > > > Mikulas > > AFAIR, I was talking about redundant ACCESS_ONCE() around @tail *read* in > consumer code. As for ACCESS_ONCE() around @tail write in consumer code, > I see your point, but I don't think that volatile imposed by ACCESS_ONCE() > is appropriate, since: > > - compiler can generate several stores despite volatile if @tail > is bigger in size than native machine data size, e.g. 64-bit on > a 32-bit CPU. That's true - so you should define data_head and data_tail as "unsigned long", not "__u64". > - volatile imposed by ACCESS_ONCE() does nothing to prevent CPU from > reordering, splitting or merging accesses. It can only mediate > communication problems between processes running on same CPU. That's why you need smp barrier in addition to ACCESS_ONCE. You need both - the smp barrier (to prevent the CPU from reordering) and ACCESS_ONCE (to prevent the compiler from splitting the write to smaller memory accesses). Since Linux 3.14, there are new macros smp_store_release and smp_load_acquire that combine ACCESS_ONCE and memory barrier, so you can use them. (they call compiletime_assert_atomic_type to make sure that you don't use them on types that are not atomic, such as long long on 32-bit architectures) > What you really want is to guarantee *atomicity* of @tail write on consumer > side. > > -- Victor Mikulas ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2014-05-09 12:20 ` Mikulas Patocka @ 2014-05-09 13:47 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2014-05-09 13:47 UTC (permalink / raw) To: Mikulas Patocka; +Cc: Victor Kaplansky, Peter Zijlstra, linux-kernel On Fri, May 09, 2014 at 08:20:25AM -0400, Mikulas Patocka wrote: > > > On Fri, 9 May 2014, Victor Kaplansky wrote: > > > Mikulas Patocka <mpatocka@redhat.com> wrote on 05/08/2014 11:46:53 PM: > > > > > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > > > > > around > > > > > @head read. > > > > > > > > Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one > > > > updating tail, so there's no problem with the value changing > > > > unexpectedly. > > > > > > You need ACCESS_ONCE even if you are the only process writing the value. > > > Because without ACCESS_ONCE, the compiler may perform store tearing and > > > split the store into several smaller stores. Search the file > > > "Documentation/memory-barriers.txt" for the term "store tearing", it shows > > > an example where one instruction storing 32-bit value may be split to two > > > instructions, each storing 16-bit value. > > > > > > Mikulas > > > > AFAIR, I was talking about redundant ACCESS_ONCE() around @tail *read* in > > consumer code. As for ACCESS_ONCE() around @tail write in consumer code, > > I see your point, but I don't think that volatile imposed by ACCESS_ONCE() > > is appropriate, since: > > > > - compiler can generate several stores despite volatile if @tail > > is bigger in size than native machine data size, e.g. 64-bit on > > a 32-bit CPU. > > That's true - so you should define data_head and data_tail as "unsigned > long", not "__u64". > > > - volatile imposed by ACCESS_ONCE() does nothing to prevent CPU from > > reordering, splitting or merging accesses. It can only mediate > > communication problems between processes running on same CPU. > > That's why you need smp barrier in addition to ACCESS_ONCE. You need both > - the smp barrier (to prevent the CPU from reordering) and ACCESS_ONCE (to > prevent the compiler from splitting the write to smaller memory accesses). IIRC the ring-buffer code uses the fact that one element remains empty to make clever double use of a memory barrier. > Since Linux 3.14, there are new macros smp_store_release and > smp_load_acquire that combine ACCESS_ONCE and memory barrier, so you can > use them. (they call compiletime_assert_atomic_type to make sure that you > don't use them on types that are not atomic, such as long long on 32-bit > architectures) These are indeed useful and often simpler to use than raw barriers. Thanx, Paul > > What you really want is to guarantee *atomicity* of @tail write on consumer > > side. > > > > -- Victor > > Mikulas ^ permalink raw reply [flat|nested] 74+ messages in thread
* perf events ring buffer memory barrier on powerpc @ 2013-10-22 23:54 Michael Neuling 2013-10-23 7:39 ` Victor Kaplansky 2013-10-23 14:19 ` Frederic Weisbecker 0 siblings, 2 replies; 74+ messages in thread From: Michael Neuling @ 2013-10-22 23:54 UTC (permalink / raw) To: Frederic Weisbecker, benh, anton, linux-kernel, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, michael Frederic, In the perf ring buffer code we have this in perf_output_get_handle(): if (!local_dec_and_test(&rb->nest)) goto out; /* * Publish the known good head. Rely on the full barrier implied * by atomic_dec_and_test() order the rb->head read and this * write. */ rb->user_page->data_head = head; The comment says atomic_dec_and_test() but the code is local_dec_and_test(). On powerpc, local_dec_and_test() doesn't have a memory barrier but atomic_dec_and_test() does. Is the comment wrong, or is local_dec_and_test() suppose to imply a memory barrier too and we have it wrongly implemented in powerpc? My guess is that local_dec_and_test() is correct but we to add an explicit memory barrier like below: (Kudos to Victor Kaplansky for finding this) Mikey diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index cd55144..95768c6 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -87,10 +87,10 @@ again: goto out; /* - * Publish the known good head. Rely on the full barrier implied - * by atomic_dec_and_test() order the rb->head read and this - * write. + * Publish the known good head. We need a memory barrier to order the + * order the rb->head read and this write. */ + smp_mb (); rb->user_page->data_head = head; /* ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-22 23:54 Michael Neuling @ 2013-10-23 7:39 ` Victor Kaplansky 2013-10-23 14:19 ` Frederic Weisbecker 1 sibling, 0 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-23 7:39 UTC (permalink / raw) To: Michael Neuling Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev, Mathieu Desnoyers, michael See below. Michael Neuling <mikey@neuling.org> wrote on 10/23/2013 02:54:54 AM: > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index cd55144..95768c6 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -87,10 +87,10 @@ again: > goto out; > > /* > - * Publish the known good head. Rely on the full barrier implied > - * by atomic_dec_and_test() order the rb->head read and this > - * write. > + * Publish the known good head. We need a memory barrier to order the > + * order the rb->head read and this write. > */ > + smp_mb (); > rb->user_page->data_head = head; > > /* 1. As far as I understand, smp_mb() is superfluous in this case, smp_wmb() should be enough. (same for the space between the name of function and open parenthesis :-) ) 2. Again, as far as I understand from ./Documentation/atomic_ops.txt, it is mistake in architecture independent code to rely on memory barriers in atomic operations, all the more so in "local" operations. 3. The solution above is sub-optimal on architectures where memory barrier is part of "local", since we are going to execute two consecutive barriers. So, maybe, it would be better to use smp_mb__after_atomic_dec(). 4. I'm not sure, but I think there is another, unrelated potential problem in function perf_output_put_handle() - the write to "data_head" - kernel/events/ring_buffer.c: 77 /* 78 * Publish the known good head. Rely on the full barrier implied 79 * by atomic_dec_and_test() order the rb->head read and this 80 * write. 81 */ 82 rb->user_page->data_head = head; As data_head is 64-bit wide, the update should be done by an atomic64_set (). Regards, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-22 23:54 Michael Neuling 2013-10-23 7:39 ` Victor Kaplansky @ 2013-10-23 14:19 ` Frederic Weisbecker 2013-10-23 14:25 ` Frederic Weisbecker 2013-10-25 17:37 ` Peter Zijlstra 1 sibling, 2 replies; 74+ messages in thread From: Frederic Weisbecker @ 2013-10-23 14:19 UTC (permalink / raw) To: Michael Neuling, Peter Zijlstra Cc: benh, anton, linux-kernel, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, michael On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote: > Frederic, > > In the perf ring buffer code we have this in perf_output_get_handle(): > > if (!local_dec_and_test(&rb->nest)) > goto out; > > /* > * Publish the known good head. Rely on the full barrier implied > * by atomic_dec_and_test() order the rb->head read and this > * write. > */ > rb->user_page->data_head = head; > > The comment says atomic_dec_and_test() but the code is > local_dec_and_test(). > > On powerpc, local_dec_and_test() doesn't have a memory barrier but > atomic_dec_and_test() does. Is the comment wrong, or is > local_dec_and_test() suppose to imply a memory barrier too and we have > it wrongly implemented in powerpc? > > My guess is that local_dec_and_test() is correct but we to add an > explicit memory barrier like below: > > (Kudos to Victor Kaplansky for finding this) > > Mikey > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index cd55144..95768c6 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -87,10 +87,10 @@ again: > goto out; > > /* > - * Publish the known good head. Rely on the full barrier implied > - * by atomic_dec_and_test() order the rb->head read and this > - * write. > + * Publish the known good head. We need a memory barrier to order the > + * order the rb->head read and this write. > */ > + smp_mb (); > rb->user_page->data_head = head; > > /* I'm adding Peter in Cc since he wrote that code. I agree that local_dec_and_test() doesn't need to imply an smp barrier. All it has to provide as a guarantee is the atomicity against local concurrent operations (interrupts, preemption, ...). Now I'm a bit confused about this barrier. I think we want this ordering: Kernel User READ rb->user_page->data_tail READ rb->user_page->data_head smp_mb() smp_mb() WRITE rb data READ rb data smp_mb() smp_mb() rb->user_page->data_head WRITE rb->user_page->data_tail So yeah we want a berrier between the data published and the user data_head. But this ordering concerns wider layout than just rb->head and rb->user_page->data_head And BTW I can see an smp_rmb() after we read rb->user_page->data_tail. This is probably the first kernel barrier in my above example. (not sure if rmb() alone is enough though). ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-23 14:19 ` Frederic Weisbecker @ 2013-10-23 14:25 ` Frederic Weisbecker 2013-10-25 17:37 ` Peter Zijlstra 1 sibling, 0 replies; 74+ messages in thread From: Frederic Weisbecker @ 2013-10-23 14:25 UTC (permalink / raw) To: Michael Neuling, Peter Zijlstra Cc: Benjamin Herrenschmidt, Anton Blanchard, LKML, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, michael 2013/10/23 Frederic Weisbecker <fweisbec@gmail.com>: > On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote: >> Frederic, >> >> In the perf ring buffer code we have this in perf_output_get_handle(): >> >> if (!local_dec_and_test(&rb->nest)) >> goto out; >> >> /* >> * Publish the known good head. Rely on the full barrier implied >> * by atomic_dec_and_test() order the rb->head read and this >> * write. >> */ >> rb->user_page->data_head = head; >> >> The comment says atomic_dec_and_test() but the code is >> local_dec_and_test(). >> >> On powerpc, local_dec_and_test() doesn't have a memory barrier but >> atomic_dec_and_test() does. Is the comment wrong, or is >> local_dec_and_test() suppose to imply a memory barrier too and we have >> it wrongly implemented in powerpc? >> >> My guess is that local_dec_and_test() is correct but we to add an >> explicit memory barrier like below: >> >> (Kudos to Victor Kaplansky for finding this) >> >> Mikey >> >> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c >> index cd55144..95768c6 100644 >> --- a/kernel/events/ring_buffer.c >> +++ b/kernel/events/ring_buffer.c >> @@ -87,10 +87,10 @@ again: >> goto out; >> >> /* >> - * Publish the known good head. Rely on the full barrier implied >> - * by atomic_dec_and_test() order the rb->head read and this >> - * write. >> + * Publish the known good head. We need a memory barrier to order the >> + * order the rb->head read and this write. >> */ >> + smp_mb (); >> rb->user_page->data_head = head; >> >> /* > > > I'm adding Peter in Cc since he wrote that code. > I agree that local_dec_and_test() doesn't need to imply an smp barrier. > All it has to provide as a guarantee is the atomicity against local concurrent > operations (interrupts, preemption, ...). > > Now I'm a bit confused about this barrier. > > I think we want this ordering: > > Kernel User > > READ rb->user_page->data_tail READ rb->user_page->data_head > smp_mb() smp_mb() > WRITE rb data READ rb data > smp_mb() smp_mb() > rb->user_page->data_head WRITE rb->user_page->data_tail ^^ I meant a write above for data_head. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-23 14:19 ` Frederic Weisbecker 2013-10-23 14:25 ` Frederic Weisbecker @ 2013-10-25 17:37 ` Peter Zijlstra 2013-10-25 20:31 ` Michael Neuling ` (2 more replies) 1 sibling, 3 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-25 17:37 UTC (permalink / raw) To: Frederic Weisbecker Cc: Michael Neuling, benh, anton, linux-kernel, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, michael On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote: > On Wed, Oct 23, 2013 at 10:54:54AM +1100, Michael Neuling wrote: > > Frederic, > > > > The comment says atomic_dec_and_test() but the code is > > local_dec_and_test(). > > > > On powerpc, local_dec_and_test() doesn't have a memory barrier but > > atomic_dec_and_test() does. Is the comment wrong, or is > > local_dec_and_test() suppose to imply a memory barrier too and we have > > it wrongly implemented in powerpc? My bad; I converted from atomic to local without actually thinking it seems. Your implementation of the local primitives is fine. > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > > index cd55144..95768c6 100644 > > --- a/kernel/events/ring_buffer.c > > +++ b/kernel/events/ring_buffer.c > > @@ -87,10 +87,10 @@ again: > > goto out; > > > > /* > > - * Publish the known good head. Rely on the full barrier implied > > - * by atomic_dec_and_test() order the rb->head read and this > > - * write. > > + * Publish the known good head. We need a memory barrier to order the > > + * order the rb->head read and this write. > > */ > > + smp_mb (); > > rb->user_page->data_head = head; > > > > /* Right; so that would indeed be what the comment suggests it should be. However I think the comment is now actively wrong too :-) Since on the kernel side the buffer is strictly per-cpu, we don't need memory barriers there. > I think we want this ordering: > > Kernel User > > READ rb->user_page->data_tail READ rb->user_page->data_head > smp_mb() smp_mb() > WRITE rb data READ rb data > smp_mb() smp_mb() > rb->user_page->data_head WRITE rb->user_page->data_tail > I would argue for: READ ->data_tail READ ->data_head smp_rmb() (A) smp_rmb() (C) WRITE $data READ $data smp_wmb() (B) smp_mb() (D) STORE ->data_head WRITE ->data_tail Where A pairs with D, and B pairs with C. I don't think A needs to be a full barrier because we won't in fact write data until we see the store from userspace. So we simply don't issue the data WRITE until we observe it. OTOH, D needs to be a full barrier since it separates the data READ from the tail WRITE. For B a WMB is sufficient since it separates two WRITEs, and for C an RMB is sufficient since it separates two READs. --- kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++--- 1 file changed, 26 insertions(+), 3 deletions(-) diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index cd55144270b5..c91274ef4e23 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle) goto out; /* - * Publish the known good head. Rely on the full barrier implied - * by atomic_dec_and_test() order the rb->head read and this - * write. + * Since the mmap() consumer (userspace) can run on a different CPU: + * + * kernel user + * + * READ ->data_tail READ ->data_head + * smp_rmb() (A) smp_rmb() (C) + * WRITE $data READ $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head WRITE ->data_tail + * + * Where A pairs with D, and B pairs with C. + * + * I don't think A needs to be a full barrier because we won't in fact + * write data until we see the store from userspace. So we simply don't + * issue the data WRITE until we observe it. + * + * OTOH, D needs to be a full barrier since it separates the data READ + * from the tail WRITE. + * + * For B a WMB is sufficient since it separates two WRITEs, and for C + * an RMB is sufficient since it separates two READs. + * + * See perf_output_begin(). */ + smp_wmb(); rb->user_page->data_head = head; /* @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle, * Userspace could choose to issue a mb() before updating the * tail pointer. So that all reads will be completed before the * write is issued. + * + * See perf_output_put_handle(). */ tail = ACCESS_ONCE(rb->user_page->data_tail); smp_rmb(); ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-25 17:37 ` Peter Zijlstra @ 2013-10-25 20:31 ` Michael Neuling 2013-10-27 9:00 ` Victor Kaplansky 2013-10-28 10:02 ` Frederic Weisbecker 2 siblings, 0 replies; 74+ messages in thread From: Michael Neuling @ 2013-10-25 20:31 UTC (permalink / raw) To: Peter Zijlstra Cc: Frederic Weisbecker, benh, anton, linux-kernel, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, michael > I would argue for: > > READ ->data_tail READ ->data_head > smp_rmb() (A) smp_rmb() (C) > WRITE $data READ $data > smp_wmb() (B) smp_mb() (D) > STORE ->data_head WRITE ->data_tail > > Where A pairs with D, and B pairs with C. > > I don't think A needs to be a full barrier because we won't in fact > write data until we see the store from userspace. So we simply don't > issue the data WRITE until we observe it. > > OTOH, D needs to be a full barrier since it separates the data READ from > the tail WRITE. > > For B a WMB is sufficient since it separates two WRITEs, and for C an > RMB is sufficient since it separates two READs. FWIW the testing Victor did confirms WMB is good enough on powerpc. Thanks, Mikey > > --- > kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++--- > 1 file changed, 26 insertions(+), 3 deletions(-) > > diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c > index cd55144270b5..c91274ef4e23 100644 > --- a/kernel/events/ring_buffer.c > +++ b/kernel/events/ring_buffer.c > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struct perf_output_handle *handle) > goto out; > > /* > - * Publish the known good head. Rely on the full barrier implied > - * by atomic_dec_and_test() order the rb->head read and this > - * write. > + * Since the mmap() consumer (userspace) can run on a different CPU: > + * > + * kernel user > + * > + * READ ->data_tail READ ->data_head > + * smp_rmb() (A) smp_rmb() (C) > + * WRITE $data READ $data > + * smp_wmb() (B) smp_mb() (D) > + * STORE ->data_head WRITE ->data_tail > + * > + * Where A pairs with D, and B pairs with C. > + * > + * I don't think A needs to be a full barrier because we won't in fact > + * write data until we see the store from userspace. So we simply don't > + * issue the data WRITE until we observe it. > + * > + * OTOH, D needs to be a full barrier since it separates the data READ > + * from the tail WRITE. > + * > + * For B a WMB is sufficient since it separates two WRITEs, and for C > + * an RMB is sufficient since it separates two READs. > + * > + * See perf_output_begin(). > */ > + smp_wmb(); > rb->user_page->data_head = head; > > /* > @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output_handle *handle, > * Userspace could choose to issue a mb() before updating the > * tail pointer. So that all reads will be completed before the > * write is issued. > + * > + * See perf_output_put_handle(). > */ > tail = ACCESS_ONCE(rb->user_page->data_tail); > smp_rmb(); > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-25 17:37 ` Peter Zijlstra 2013-10-25 20:31 ` Michael Neuling @ 2013-10-27 9:00 ` Victor Kaplansky 2013-10-28 9:22 ` Peter Zijlstra 2013-10-28 10:02 ` Frederic Weisbecker 2 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-10-27 9:00 UTC (permalink / raw) To: Peter Zijlstra Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev, Mathieu Desnoyers, michael, Michael Neuling Peter Zijlstra <peterz@infradead.org> wrote on 10/25/2013 07:37:49 PM: > I would argue for: > > READ ->data_tail READ ->data_head > smp_rmb() (A) smp_rmb() (C) > WRITE $data READ $data > smp_wmb() (B) smp_mb() (D) > STORE ->data_head WRITE ->data_tail > > Where A pairs with D, and B pairs with C. 1. I agree. My only concern is that architectures which do use atomic operations with memory barriers, will issue two consecutive barriers now, which is sub-optimal. 2. I think the comment in "include/linux/perf_event.h" describing "data_head" and "data_tail" for user space need an update as well. Current version - /* * Control data for the mmap() data buffer. * * User-space reading the @data_head value should issue an rmb(), on * SMP capable platforms, after reading this value -- see * perf_event_wakeup(). * * When the mapping is PROT_WRITE the @data_tail value should be * written by userspace to reflect the last read data. In this case * the kernel will not over-write unread data. */ __u64 data_head; /* head in the data section */ __u64 data_tail; /* user-space written tail */ - say nothing about the need of memory barrier before "data_tail" write. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-27 9:00 ` Victor Kaplansky @ 2013-10-28 9:22 ` Peter Zijlstra 0 siblings, 0 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-28 9:22 UTC (permalink / raw) To: Victor Kaplansky Cc: anton, benh, Frederic Weisbecker, linux-kernel, Linux PPC dev, Mathieu Desnoyers, michael, Michael Neuling On Sun, Oct 27, 2013 at 11:00:33AM +0200, Victor Kaplansky wrote: > Peter Zijlstra <peterz@infradead.org> wrote on 10/25/2013 07:37:49 PM: > > > I would argue for: > > > > READ ->data_tail READ ->data_head > > smp_rmb() (A) smp_rmb() (C) > > WRITE $data READ $data > > smp_wmb() (B) smp_mb() (D) > > STORE ->data_head WRITE ->data_tail > > > > Where A pairs with D, and B pairs with C. > > 1. I agree. My only concern is that architectures which do use atomic > operations > with memory barriers, will issue two consecutive barriers now, which is > sub-optimal. Yeah, although that would be fairly easy to optimize by the CPUs itself; not sure they actually do this though. But we don't really have much choice aside of introducing things like: smp_wmb__after_local_$op; and I'm fairly sure people won't like adding a ton of conditional barriers like that either. > 2. I think the comment in "include/linux/perf_event.h" describing > "data_head" and > "data_tail" for user space need an update as well. Current version - Oh, indeed. Thanks; I'll update that too! ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-25 17:37 ` Peter Zijlstra 2013-10-25 20:31 ` Michael Neuling 2013-10-27 9:00 ` Victor Kaplansky @ 2013-10-28 10:02 ` Frederic Weisbecker 2013-10-28 12:38 ` Victor Kaplansky 2 siblings, 1 reply; 74+ messages in thread From: Frederic Weisbecker @ 2013-10-28 10:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Michael Neuling, Benjamin Herrenschmidt, Anton Blanchard, LKML, Linux PPC dev, Victor Kaplansky, Mathieu Desnoyers, Michael Ellerman 2013/10/25 Peter Zijlstra <peterz@infradead.org>: > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote: > I would argue for: > > READ ->data_tail READ ->data_head > smp_rmb() (A) smp_rmb() (C) > WRITE $data READ $data > smp_wmb() (B) smp_mb() (D) > STORE ->data_head WRITE ->data_tail > > Where A pairs with D, and B pairs with C. > > I don't think A needs to be a full barrier because we won't in fact > write data until we see the store from userspace. So we simply don't > issue the data WRITE until we observe it. > > OTOH, D needs to be a full barrier since it separates the data READ from > the tail WRITE. > > For B a WMB is sufficient since it separates two WRITEs, and for C an > RMB is sufficient since it separates two READs. Hmm, I need to defer on you for that, I'm not yet comfortable with picking specific barrier flavours when both write and read are involved in a same side :) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 10:02 ` Frederic Weisbecker @ 2013-10-28 12:38 ` Victor Kaplansky 2013-10-28 13:26 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-10-28 12:38 UTC (permalink / raw) To: Frederic Weisbecker Cc: Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Peter Zijlstra > From: Frederic Weisbecker <fweisbec@gmail.com> > > 2013/10/25 Peter Zijlstra <peterz@infradead.org>: > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote: > > I would argue for > > > > READ ->data_tail READ ->data_head > > smp_rmb() (A) smp_rmb() (C) > > WRITE $data READ $data > > smp_wmb() (B) smp_mb() (D) > > STORE ->data_head WRITE ->data_tail > > > > Where A pairs with D, and B pairs with C. > > > > I don't think A needs to be a full barrier because we won't in fact > > write data until we see the store from userspace. So we simply don't > > issue the data WRITE until we observe it. > > > > OTOH, D needs to be a full barrier since it separates the data READ from > > the tail WRITE. > > > > For B a WMB is sufficient since it separates two WRITEs, and for C an > > RMB is sufficient since it separates two READs. > > Hmm, I need to defer on you for that, I'm not yet comfortable with > picking specific barrier flavours when both write and read are > involved in a same side :) I think you have a point :) IMO, memory barrier (A) is superfluous. At producer side we need to ensure that "WRITE $data" is not committed to memory before "READ ->data_tail" had seen a new value and if the old one indicated that there is no enough space for a new entry. All this is already guaranteed by control flow dependancy on single CPU - writes will not be committed to the memory if read value of "data_tail" doesn't specify enough free space in the ring buffer. Likewise, on consumer side, we can make use of natural data dependency and memory ordering guarantee for single CPU and try to replace "smp_mb" by a more light-weight "smp_rmb": READ ->data_tail READ ->data_head // ... smp_rmb() (C) WRITE $data READ $data smp_wmb() (B) smp_rmb() (D) READ $header_size STORE ->data_head WRITE ->data_tail = $old_data_tail + $header_size We ensure that all $data is read before "data_tail" is written by doing "READ $header_size" after all other data is read and we rely on natural data dependancy between "data_tail" write and "header_size" read. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 12:38 ` Victor Kaplansky @ 2013-10-28 13:26 ` Peter Zijlstra 2013-10-28 16:34 ` Paul E. McKenney 2013-10-28 19:09 ` Oleg Nesterov 0 siblings, 2 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-28 13:26 UTC (permalink / raw) To: Victor Kaplansky Cc: Frederic Weisbecker, Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul McKenney, Oleg Nesterov On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote: > > 2013/10/25 Peter Zijlstra <peterz@infradead.org>: > > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote: > > > I would argue for > > > > > > READ ->data_tail READ ->data_head > > > smp_rmb() (A) smp_rmb() (C) > > > WRITE $data READ $data > > > smp_wmb() (B) smp_mb() (D) > > > STORE ->data_head WRITE ->data_tail > > > > > > Where A pairs with D, and B pairs with C. > > > > > > I don't think A needs to be a full barrier because we won't in fact > > > write data until we see the store from userspace. So we simply don't > > > issue the data WRITE until we observe it. > > > > > > OTOH, D needs to be a full barrier since it separates the data READ from > > > the tail WRITE. > > > > > > For B a WMB is sufficient since it separates two WRITEs, and for C an > > > RMB is sufficient since it separates two READs. <snip> > I think you have a point :) IMO, memory barrier (A) is superfluous. > At producer side we need to ensure that "WRITE $data" is not committed > to memory before "READ ->data_tail" had seen a new value and if the > old one indicated that there is no enough space for a new entry. All > this is already guaranteed by control flow dependancy on single CPU - > writes will not be committed to the memory if read value of > "data_tail" doesn't specify enough free space in the ring buffer. > > Likewise, on consumer side, we can make use of natural data dependency and > memory ordering guarantee for single CPU and try to replace "smp_mb" by > a more light-weight "smp_rmb": > > READ ->data_tail READ ->data_head > // ... smp_rmb() (C) > WRITE $data READ $data > smp_wmb() (B) smp_rmb() (D) > READ $header_size > STORE ->data_head WRITE ->data_tail = $old_data_tail + > $header_size > > We ensure that all $data is read before "data_tail" is written by > doing "READ $header_size" after all other data is read and we rely on > natural data dependancy between "data_tail" write and "header_size" > read. I'm not entirely sure I get the $header_size trickery; need to think more on that. But yes, I did consider the other one. However, I had trouble having no pairing barrier for (D). ISTR something like Alpha being able to miss the update (for a long while) if you don't issue the RMB. Lets add Paul and Oleg to the thread; this is getting far more 'fun' that it should be ;-) For completeness; below the patch as I had queued it. --- Subject: perf: Fix perf ring buffer memory ordering From: Peter Zijlstra <peterz@infradead.org> Date: Mon Oct 28 13:55:29 CET 2013 The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Cc: anton@samba.org Cc: benh@kernel.crashing.org Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan --- include/uapi/linux/perf_event.h | 12 +++++++----- kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++--- 2 files changed, 33 insertions(+), 8 deletions(-) Index: linux-2.6/include/uapi/linux/perf_event.h =================================================================== --- linux-2.6.orig/include/uapi/linux/perf_event.h +++ linux-2.6/include/uapi/linux/perf_event.h @@ -479,13 +479,15 @@ struct perf_event_mmap_page { /* * Control data for the mmap() data buffer. * - * User-space reading the @data_head value should issue an rmb(), on - * SMP capable platforms, after reading this value -- see - * perf_event_wakeup(). + * User-space reading the @data_head value should issue an smp_rmb(), + * after reading this value. * * When the mapping is PROT_WRITE the @data_tail value should be - * written by userspace to reflect the last read data. In this case - * the kernel will not over-write unread data. + * written by userspace to reflect the last read data, after issueing + * an smp_mb() to separate the data read from the ->data_tail store. + * In this case the kernel will not over-write unread data. + * + * See perf_output_put_handle() for the data ordering. */ __u64 data_head; /* head in the data section */ __u64 data_tail; /* user-space written tail */ Index: linux-2.6/kernel/events/ring_buffer.c =================================================================== --- linux-2.6.orig/kernel/events/ring_buffer.c +++ linux-2.6/kernel/events/ring_buffer.c @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc goto out; /* - * Publish the known good head. Rely on the full barrier implied - * by atomic_dec_and_test() order the rb->head read and this - * write. + * Since the mmap() consumer (userspace) can run on a different CPU: + * + * kernel user + * + * READ ->data_tail READ ->data_head + * smp_rmb() (A) smp_rmb() (C) + * WRITE $data READ $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head WRITE ->data_tail + * + * Where A pairs with D, and B pairs with C. + * + * I don't think A needs to be a full barrier because we won't in fact + * write data until we see the store from userspace. So we simply don't + * issue the data WRITE until we observe it. + * + * OTOH, D needs to be a full barrier since it separates the data READ + * from the tail WRITE. + * + * For B a WMB is sufficient since it separates two WRITEs, and for C + * an RMB is sufficient since it separates two READs. + * + * See perf_output_begin(). */ + smp_wmb(); rb->user_page->data_head = head; /* @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output * Userspace could choose to issue a mb() before updating the * tail pointer. So that all reads will be completed before the * write is issued. + * + * See perf_output_put_handle(). */ tail = ACCESS_ONCE(rb->user_page->data_tail); smp_rmb(); ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 13:26 ` Peter Zijlstra @ 2013-10-28 16:34 ` Paul E. McKenney 2013-10-28 20:17 ` Oleg Nesterov 2013-10-28 19:09 ` Oleg Nesterov 1 sibling, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-10-28 16:34 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Frederic Weisbecker, Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote: > On Mon, Oct 28, 2013 at 02:38:29PM +0200, Victor Kaplansky wrote: > > > 2013/10/25 Peter Zijlstra <peterz@infradead.org>: > > > > On Wed, Oct 23, 2013 at 03:19:51PM +0100, Frederic Weisbecker wrote: > > > > I would argue for > > > > > > > > READ ->data_tail READ ->data_head > > > > smp_rmb() (A) smp_rmb() (C) > > > > WRITE $data READ $data > > > > smp_wmb() (B) smp_mb() (D) > > > > STORE ->data_head WRITE ->data_tail > > > > > > > > Where A pairs with D, and B pairs with C. > > > > > > > > I don't think A needs to be a full barrier because we won't in fact > > > > write data until we see the store from userspace. So we simply don't > > > > issue the data WRITE until we observe it. > > > > > > > > OTOH, D needs to be a full barrier since it separates the data READ from > > > > the tail WRITE. > > > > > > > > For B a WMB is sufficient since it separates two WRITEs, and for C an > > > > RMB is sufficient since it separates two READs. > > <snip> > > > I think you have a point :) IMO, memory barrier (A) is superfluous. > > At producer side we need to ensure that "WRITE $data" is not committed > > to memory before "READ ->data_tail" had seen a new value and if the > > old one indicated that there is no enough space for a new entry. All > > this is already guaranteed by control flow dependancy on single CPU - > > writes will not be committed to the memory if read value of > > "data_tail" doesn't specify enough free space in the ring buffer. > > > > Likewise, on consumer side, we can make use of natural data dependency and > > memory ordering guarantee for single CPU and try to replace "smp_mb" by > > a more light-weight "smp_rmb": > > > > READ ->data_tail READ ->data_head > > // ... smp_rmb() (C) > > WRITE $data READ $data > > smp_wmb() (B) smp_rmb() (D) > > READ $header_size > > STORE ->data_head WRITE ->data_tail = $old_data_tail + > > $header_size > > > > We ensure that all $data is read before "data_tail" is written by > > doing "READ $header_size" after all other data is read and we rely on > > natural data dependancy between "data_tail" write and "header_size" > > read. > > I'm not entirely sure I get the $header_size trickery; need to think > more on that. But yes, I did consider the other one. However, I had > trouble having no pairing barrier for (D). > > ISTR something like Alpha being able to miss the update (for a long > while) if you don't issue the RMB. > > Lets add Paul and Oleg to the thread; this is getting far more 'fun' > that it should be ;-) > > For completeness; below the patch as I had queued it. > --- > Subject: perf: Fix perf ring buffer memory ordering > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Oct 28 13:55:29 CET 2013 > > The PPC64 people noticed a missing memory barrier and crufty old > comments in the perf ring buffer code. So update all the comments and > add the missing barrier. > > When the architecture implements local_t using atomic_long_t there > will be double barriers issued; but short of introducing more > conditional barrier primitives this is the best we can do. > > Cc: anton@samba.org > Cc: benh@kernel.crashing.org > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> > Cc: michael@ellerman.id.au > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> > Cc: Michael Neuling <mikey@neuling.org> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Reported-by: Victor Kaplansky <victork@il.ibm.com> > Tested-by: Victor Kaplansky <victork@il.ibm.com> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan > --- > include/uapi/linux/perf_event.h | 12 +++++++----- > kernel/events/ring_buffer.c | 29 ++++++++++++++++++++++++++--- > 2 files changed, 33 insertions(+), 8 deletions(-) > > Index: linux-2.6/include/uapi/linux/perf_event.h > =================================================================== > --- linux-2.6.orig/include/uapi/linux/perf_event.h > +++ linux-2.6/include/uapi/linux/perf_event.h > @@ -479,13 +479,15 @@ struct perf_event_mmap_page { > /* > * Control data for the mmap() data buffer. > * > - * User-space reading the @data_head value should issue an rmb(), on > - * SMP capable platforms, after reading this value -- see > - * perf_event_wakeup(). > + * User-space reading the @data_head value should issue an smp_rmb(), > + * after reading this value. > * > * When the mapping is PROT_WRITE the @data_tail value should be > - * written by userspace to reflect the last read data. In this case > - * the kernel will not over-write unread data. > + * written by userspace to reflect the last read data, after issueing > + * an smp_mb() to separate the data read from the ->data_tail store. > + * In this case the kernel will not over-write unread data. > + * > + * See perf_output_put_handle() for the data ordering. > */ > __u64 data_head; /* head in the data section */ > __u64 data_tail; /* user-space written tail */ > Index: linux-2.6/kernel/events/ring_buffer.c > =================================================================== > --- linux-2.6.orig/kernel/events/ring_buffer.c > +++ linux-2.6/kernel/events/ring_buffer.c > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc > goto out; > > /* > - * Publish the known good head. Rely on the full barrier implied > - * by atomic_dec_and_test() order the rb->head read and this > - * write. > + * Since the mmap() consumer (userspace) can run on a different CPU: > + * > + * kernel user > + * > + * READ ->data_tail READ ->data_head > + * smp_rmb() (A) smp_rmb() (C) Given that both of the kernel's subsequent operations are stores/writes, doesn't (A) need to be smp_mb()? Thanx, Paul > + * WRITE $data READ $data > + * smp_wmb() (B) smp_mb() (D) > + * STORE ->data_head WRITE ->data_tail > + * > + * Where A pairs with D, and B pairs with C. > + * > + * I don't think A needs to be a full barrier because we won't in fact > + * write data until we see the store from userspace. So we simply don't > + * issue the data WRITE until we observe it. > + * > + * OTOH, D needs to be a full barrier since it separates the data READ > + * from the tail WRITE. > + * > + * For B a WMB is sufficient since it separates two WRITEs, and for C > + * an RMB is sufficient since it separates two READs. > + * > + * See perf_output_begin(). > */ > + smp_wmb(); > rb->user_page->data_head = head; > > /* > @@ -154,6 +175,8 @@ int perf_output_begin(struct perf_output > * Userspace could choose to issue a mb() before updating the > * tail pointer. So that all reads will be completed before the > * write is issued. > + * > + * See perf_output_put_handle(). > */ > tail = ACCESS_ONCE(rb->user_page->data_tail); > smp_rmb(); > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 16:34 ` Paul E. McKenney @ 2013-10-28 20:17 ` Oleg Nesterov 2013-10-28 20:58 ` Victor Kaplansky 0 siblings, 1 reply; 74+ messages in thread From: Oleg Nesterov @ 2013-10-28 20:17 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Victor Kaplansky, Frederic Weisbecker, Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On 10/28, Paul E. McKenney wrote: > > On Mon, Oct 28, 2013 at 02:26:34PM +0100, Peter Zijlstra wrote: > > --- linux-2.6.orig/kernel/events/ring_buffer.c > > +++ linux-2.6/kernel/events/ring_buffer.c > > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc > > goto out; > > > > /* > > - * Publish the known good head. Rely on the full barrier implied > > - * by atomic_dec_and_test() order the rb->head read and this > > - * write. > > + * Since the mmap() consumer (userspace) can run on a different CPU: > > + * > > + * kernel user > > + * > > + * READ ->data_tail READ ->data_head > > + * smp_rmb() (A) smp_rmb() (C) > > Given that both of the kernel's subsequent operations are stores/writes, > doesn't (A) need to be smp_mb()? Yes, this is my understanding^Wfeeling too, but I have to admit that I can't really explain to myself why _exactly_ we need mb() here. And let me copy-and-paste the artificial example from my previous email, bool BUSY; data_t DATA; bool try_to_get(data_t *data) { if (!BUSY) return false; rmb(); *data = DATA; mb(); BUSY = false; return true; } bool try_to_put(data_t *data) { if (BUSY) return false; mb(); // XXXXXXXX: do we really need it? I think yes. DATA = *data; wmb(); BUSY = true; return true; } (just in case, the code above obviously assumes that _get or _put can't race with itself, but they can race with each other). Could you confirm that try_to_put() actually needs mb() between LOAD(BUSY) and STORE(DATA) ? I am sure it actually needs, but I will appreciate it if you can explain why. IOW, how it is possible that without mb() try_to_put() can overwrite DATA before try_to_get() completes its "*data = DATA" in this particular case. Perhaps this can happen if, say, reader and writer share a level of cache or something like this... Oleg. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 20:17 ` Oleg Nesterov @ 2013-10-28 20:58 ` Victor Kaplansky 2013-10-29 10:21 ` Peter Zijlstra 2013-10-30 9:27 ` Paul E. McKenney 0 siblings, 2 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-28 20:58 UTC (permalink / raw) To: Oleg Nesterov Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney, Peter Zijlstra Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > mb(); // XXXXXXXX: do we really need it? I think yes. Oh, it is hard to argue with feelings. Also, it is easy to be on conservative side and put the barrier here just in case. But I still insist that the barrier is redundant in your example. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 20:58 ` Victor Kaplansky @ 2013-10-29 10:21 ` Peter Zijlstra 2013-10-29 10:30 ` Peter Zijlstra 2013-10-30 9:27 ` Paul E. McKenney 1 sibling, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-29 10:21 UTC (permalink / raw) To: Victor Kaplansky Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > Oh, it is hard to argue with feelings. Also, it is easy to be on > conservative side and put the barrier here just in case. I'll make it a full mb for now and too am curious to see the end of this discussion explaining things ;-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 10:21 ` Peter Zijlstra @ 2013-10-29 10:30 ` Peter Zijlstra 2013-10-29 10:35 ` Peter Zijlstra ` (2 more replies) 0 siblings, 3 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-29 10:30 UTC (permalink / raw) To: Victor Kaplansky Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote: > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > > > Oh, it is hard to argue with feelings. Also, it is easy to be on > > conservative side and put the barrier here just in case. > > I'll make it a full mb for now and too am curious to see the end of this > discussion explaining things ;-) That is, I've now got this queued: --- Subject: perf: Fix perf ring buffer memory ordering From: Peter Zijlstra <peterz@infradead.org> Date: Mon Oct 28 13:55:29 CET 2013 The PPC64 people noticed a missing memory barrier and crufty old comments in the perf ring buffer code. So update all the comments and add the missing barrier. When the architecture implements local_t using atomic_long_t there will be double barriers issued; but short of introducing more conditional barrier primitives this is the best we can do. Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> Cc: michael@ellerman.id.au Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> Cc: Michael Neuling <mikey@neuling.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: anton@samba.org Cc: benh@kernel.crashing.org Reported-by: Victor Kaplansky <victork@il.ibm.com> Tested-by: Victor Kaplansky <victork@il.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan --- include/uapi/linux/perf_event.h | 12 +++++++----- kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++---- 2 files changed, 34 insertions(+), 9 deletions(-) Index: linux-2.6/include/uapi/linux/perf_event.h =================================================================== --- linux-2.6.orig/include/uapi/linux/perf_event.h +++ linux-2.6/include/uapi/linux/perf_event.h @@ -479,13 +479,15 @@ struct perf_event_mmap_page { /* * Control data for the mmap() data buffer. * - * User-space reading the @data_head value should issue an rmb(), on - * SMP capable platforms, after reading this value -- see - * perf_event_wakeup(). + * User-space reading the @data_head value should issue an smp_rmb(), + * after reading this value. * * When the mapping is PROT_WRITE the @data_tail value should be - * written by userspace to reflect the last read data. In this case - * the kernel will not over-write unread data. + * written by userspace to reflect the last read data, after issueing + * an smp_mb() to separate the data read from the ->data_tail store. + * In this case the kernel will not over-write unread data. + * + * See perf_output_put_handle() for the data ordering. */ __u64 data_head; /* head in the data section */ __u64 data_tail; /* user-space written tail */ Index: linux-2.6/kernel/events/ring_buffer.c =================================================================== --- linux-2.6.orig/kernel/events/ring_buffer.c +++ linux-2.6/kernel/events/ring_buffer.c @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc goto out; /* - * Publish the known good head. Rely on the full barrier implied - * by atomic_dec_and_test() order the rb->head read and this - * write. + * Since the mmap() consumer (userspace) can run on a different CPU: + * + * kernel user + * + * READ ->data_tail READ ->data_head + * smp_mb() (A) smp_rmb() (C) + * WRITE $data READ $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head WRITE ->data_tail + * + * Where A pairs with D, and B pairs with C. + * + * I don't think A needs to be a full barrier because we won't in fact + * write data until we see the store from userspace. So we simply don't + * issue the data WRITE until we observe it. Be conservative for now. + * + * OTOH, D needs to be a full barrier since it separates the data READ + * from the tail WRITE. + * + * For B a WMB is sufficient since it separates two WRITEs, and for C + * an RMB is sufficient since it separates two READs. + * + * See perf_output_begin(). */ + smp_wmb(); rb->user_page->data_head = head; /* @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output * Userspace could choose to issue a mb() before updating the * tail pointer. So that all reads will be completed before the * write is issued. + * + * See perf_output_put_handle(). */ tail = ACCESS_ONCE(rb->user_page->data_tail); - smp_rmb(); + smp_mb(); offset = head = local_read(&rb->head); head += size; if (unlikely(!perf_output_space(rb, tail, offset, head))) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 10:30 ` Peter Zijlstra @ 2013-10-29 10:35 ` Peter Zijlstra 2013-10-29 20:15 ` Oleg Nesterov 2013-10-29 19:27 ` Vince Weaver 2013-10-29 21:23 ` Michael Neuling 2 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-29 10:35 UTC (permalink / raw) To: Victor Kaplansky Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote: > @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output > * Userspace could choose to issue a mb() before updating the > * tail pointer. So that all reads will be completed before the > * write is issued. > + * > + * See perf_output_put_handle(). > */ > tail = ACCESS_ONCE(rb->user_page->data_tail); > - smp_rmb(); > + smp_mb(); > offset = head = local_read(&rb->head); > head += size; > if (unlikely(!perf_output_space(rb, tail, offset, head))) That said; it would be very nice to be able to remove this barrier. This is in every event write path :/ ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 10:35 ` Peter Zijlstra @ 2013-10-29 20:15 ` Oleg Nesterov 0 siblings, 0 replies; 74+ messages in thread From: Oleg Nesterov @ 2013-10-29 20:15 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney On 10/29, Peter Zijlstra wrote: > > On Tue, Oct 29, 2013 at 11:30:57AM +0100, Peter Zijlstra wrote: > > @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output > > * Userspace could choose to issue a mb() before updating the > > * tail pointer. So that all reads will be completed before the > > * write is issued. > > + * > > + * See perf_output_put_handle(). > > */ > > tail = ACCESS_ONCE(rb->user_page->data_tail); > > - smp_rmb(); > > + smp_mb(); > > offset = head = local_read(&rb->head); > > head += size; > > if (unlikely(!perf_output_space(rb, tail, offset, head))) > > That said; it would be very nice to be able to remove this barrier. This > is in every event write path :/ Yes.. And I'm afraid very much that I simply confused you. Perhaps Victor is right and we do not need this mb(). So I am waiting for the end of this story too. And btw I do not understand why we need it (or smp_rmb) right after ACCESS_ONCE(data_tail). Oleg. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 10:30 ` Peter Zijlstra 2013-10-29 10:35 ` Peter Zijlstra @ 2013-10-29 19:27 ` Vince Weaver 2013-10-30 10:42 ` Peter Zijlstra 2013-10-29 21:23 ` Michael Neuling 2 siblings, 1 reply; 74+ messages in thread From: Vince Weaver @ 2013-10-29 19:27 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney On Tue, 29 Oct 2013, Peter Zijlstra wrote: > On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote: > --- linux-2.6.orig/include/uapi/linux/perf_event.h > +++ linux-2.6/include/uapi/linux/perf_event.h > @@ -479,13 +479,15 @@ struct perf_event_mmap_page { > /* > * Control data for the mmap() data buffer. > * > - * User-space reading the @data_head value should issue an rmb(), on > - * SMP capable platforms, after reading this value -- see > - * perf_event_wakeup(). > + * User-space reading the @data_head value should issue an smp_rmb(), > + * after reading this value. so where's the patch fixing perf to use the new recommendations? Is this purely a performance thing or a correctness change? A change like this a bit of a pain, especially as userspace doesn't really have nice access to smb_mb() defines so a lot of cut-and-pasting will ensue for everyone who's trying to parse the mmap buffer. Vince ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 19:27 ` Vince Weaver @ 2013-10-30 10:42 ` Peter Zijlstra 2013-10-30 11:48 ` James Hogan 0 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 10:42 UTC (permalink / raw) To: Vince Weaver Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney, james.hogan On Tue, Oct 29, 2013 at 03:27:05PM -0400, Vince Weaver wrote: > On Tue, 29 Oct 2013, Peter Zijlstra wrote: > > > On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote: > > --- linux-2.6.orig/include/uapi/linux/perf_event.h > > +++ linux-2.6/include/uapi/linux/perf_event.h > > @@ -479,13 +479,15 @@ struct perf_event_mmap_page { > > /* > > * Control data for the mmap() data buffer. > > * > > - * User-space reading the @data_head value should issue an rmb(), on > > - * SMP capable platforms, after reading this value -- see > > - * perf_event_wakeup(). > > + * User-space reading the @data_head value should issue an smp_rmb(), > > + * after reading this value. > > so where's the patch fixing perf to use the new recommendations? Fair enough, thanks for reminding me about that. See below. > Is this purely a performance thing or a correctness change? Correctness, although I suppose on most archs you'd be hard pushed to notice. > A change like this a bit of a pain, especially as userspace doesn't really > have nice access to smb_mb() defines so a lot of cut-and-pasting will > ensue for everyone who's trying to parse the mmap buffer. Agreed; we should maybe push for a user visible asm/barrier.h or so. --- Subject: perf, tool: Add required memory barriers To match patch bf378d341e48 ("perf: Fix perf ring buffer memory ordering") change userspace to also adhere to the ordering outlined. Most barrier implementations were gleaned from arch/*/include/asm/barrier.h and with the exception of metag I'm fairly sure they're correct. Cc: James Hogan <james.hogan@imgtec.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> --- tools/perf/perf.h | 39 +++++++++++++++++++++++++++++++++++++-- tools/perf/util/evlist.h | 2 +- 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/tools/perf/perf.h b/tools/perf/perf.h index f61c230beec4..1b8a0a2a63d4 100644 --- a/tools/perf/perf.h +++ b/tools/perf/perf.h @@ -4,6 +4,8 @@ #include <asm/unistd.h> #if defined(__i386__) +#define mb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") +#define wmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define rmb() asm volatile("lock; addl $0,0(%%esp)" ::: "memory") #define cpu_relax() asm volatile("rep; nop" ::: "memory"); #define CPUINFO_PROC "model name" @@ -13,6 +15,8 @@ #endif #if defined(__x86_64__) +#define mb() asm volatile("mfence" ::: "memory") +#define wmb() asm volatile("sfence" ::: "memory") #define rmb() asm volatile("lfence" ::: "memory") #define cpu_relax() asm volatile("rep; nop" ::: "memory"); #define CPUINFO_PROC "model name" @@ -23,20 +27,28 @@ #ifdef __powerpc__ #include "../../arch/powerpc/include/uapi/asm/unistd.h" +#define mb() asm volatile ("sync" ::: "memory") +#define wmb() asm volatile ("sync" ::: "memory") #define rmb() asm volatile ("sync" ::: "memory") #define cpu_relax() asm volatile ("" ::: "memory"); #define CPUINFO_PROC "cpu" #endif #ifdef __s390__ +#define mb() asm volatile("bcr 15,0" ::: "memory") +#define wmb() asm volatile("bcr 15,0" ::: "memory") #define rmb() asm volatile("bcr 15,0" ::: "memory") #define cpu_relax() asm volatile("" ::: "memory"); #endif #ifdef __sh__ #if defined(__SH4A__) || defined(__SH5__) +# define mb() asm volatile("synco" ::: "memory") +# define wmb() asm volatile("synco" ::: "memory") # define rmb() asm volatile("synco" ::: "memory") #else +# define mb() asm volatile("" ::: "memory") +# define wmb() asm volatile("" ::: "memory") # define rmb() asm volatile("" ::: "memory") #endif #define cpu_relax() asm volatile("" ::: "memory") @@ -44,24 +56,38 @@ #endif #ifdef __hppa__ +#define mb() asm volatile("" ::: "memory") +#define wmb() asm volatile("" ::: "memory") #define rmb() asm volatile("" ::: "memory") #define cpu_relax() asm volatile("" ::: "memory"); #define CPUINFO_PROC "cpu" #endif #ifdef __sparc__ +#ifdef __LP64__ +#define mb() asm volatile("ba,pt %%xcc, 1f\n" \ + "membar #StoreLoad\n" \ + "1:\n"":::"memory") +#else +#define mb() asm volatile("":::"memory") +#endif +#define wmb() asm volatile("":::"memory") #define rmb() asm volatile("":::"memory") #define cpu_relax() asm volatile("":::"memory") #define CPUINFO_PROC "cpu" #endif #ifdef __alpha__ +#define mb() asm volatile("mb" ::: "memory") +#define wmb() asm volatile("wmb" ::: "memory") #define rmb() asm volatile("mb" ::: "memory") #define cpu_relax() asm volatile("" ::: "memory") #define CPUINFO_PROC "cpu model" #endif #ifdef __ia64__ +#define mb() asm volatile ("mf" ::: "memory") +#define wmb() asm volatile ("mf" ::: "memory") #define rmb() asm volatile ("mf" ::: "memory") #define cpu_relax() asm volatile ("hint @pause" ::: "memory") #define CPUINFO_PROC "model name" @@ -72,35 +98,44 @@ * Use the __kuser_memory_barrier helper in the CPU helper page. See * arch/arm/kernel/entry-armv.S in the kernel source for details. */ +#define mb() ((void(*)(void))0xffff0fa0)() +#define wmb() ((void(*)(void))0xffff0fa0)() #define rmb() ((void(*)(void))0xffff0fa0)() #define cpu_relax() asm volatile("":::"memory") #define CPUINFO_PROC "Processor" #endif #ifdef __aarch64__ -#define rmb() asm volatile("dmb ld" ::: "memory") +#define mb() asm volatile("dmb ish" ::: "memory") +#define wmb() asm volatile("dmb ishld" ::: "memory") +#define rmb() asm volatile("dmb ishst" ::: "memory") #define cpu_relax() asm volatile("yield" ::: "memory") #endif #ifdef __mips__ -#define rmb() asm volatile( \ +#define mb() asm volatile( \ ".set mips2\n\t" \ "sync\n\t" \ ".set mips0" \ : /* no output */ \ : /* no input */ \ : "memory") +#define wmb() mb() +#define rmb() mb() #define cpu_relax() asm volatile("" ::: "memory") #define CPUINFO_PROC "cpu model" #endif #ifdef __arc__ +#define mb() asm volatile("" ::: "memory") +#define wmb() asm volatile("" ::: "memory") #define rmb() asm volatile("" ::: "memory") #define cpu_relax() rmb() #define CPUINFO_PROC "Processor" #endif #ifdef __metag__ +/* XXX no clue */ #define rmb() asm volatile("" ::: "memory") #define cpu_relax() asm volatile("" ::: "memory") #define CPUINFO_PROC "CPU" diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index 6e8acc9abe38..8ab1b5ae4a0e 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -189,7 +189,7 @@ static inline void perf_mmap__write_tail(struct perf_mmap *md, /* * ensure all reads are done before we write the tail out. */ - /* mb(); */ + mb(); pc->data_tail = tail; } ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 10:42 ` Peter Zijlstra @ 2013-10-30 11:48 ` James Hogan 2013-10-30 12:48 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: James Hogan @ 2013-10-30 11:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Vince Weaver, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney, linux-metag Hi Peter, On 30/10/13 10:42, Peter Zijlstra wrote: > Subject: perf, tool: Add required memory barriers > > To match patch bf378d341e48 ("perf: Fix perf ring buffer memory > ordering") change userspace to also adhere to the ordering outlined. > > Most barrier implementations were gleaned from > arch/*/include/asm/barrier.h and with the exception of metag I'm fairly > sure they're correct. Yeh... Short answer: For Meta you're probably best off assuming CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers. Long answer: The issue with write reordering between Meta's hardware threads beyond the cache is only with a particular SoC, and SMP is not used in production on it. It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register writable to userspace (it's in a non-mappable region already) but even then the write to that register needs odd placement to be effective (before the shmem write rather than after - which isn't a place any existing barriers are guaranteed to be placed). I'm fairly confident we get away with it in the kernel, and userland normally just uses linked load/store instructions for atomicity which works fine. Cheers James ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 11:48 ` James Hogan @ 2013-10-30 12:48 ` Peter Zijlstra 0 siblings, 0 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 12:48 UTC (permalink / raw) To: James Hogan Cc: Vince Weaver, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul E. McKenney, linux-metag On Wed, Oct 30, 2013 at 11:48:44AM +0000, James Hogan wrote: > Hi Peter, > > On 30/10/13 10:42, Peter Zijlstra wrote: > > Subject: perf, tool: Add required memory barriers > > > > To match patch bf378d341e48 ("perf: Fix perf ring buffer memory > > ordering") change userspace to also adhere to the ordering outlined. > > > > Most barrier implementations were gleaned from > > arch/*/include/asm/barrier.h and with the exception of metag I'm fairly > > sure they're correct. > > Yeh... > > Short answer: > For Meta you're probably best off assuming > CONFIG_METAG_SMP_WRITE_REORDERING=n and just using compiler barriers. Thanks, fixed it that way. > Long answer: > The issue with write reordering between Meta's hardware threads beyond > the cache is only with a particular SoC, and SMP is not used in > production on it. > It is possible to make the LINSYSEVENT_WR_COMBINE_FLUSH register > writable to userspace (it's in a non-mappable region already) but even > then the write to that register needs odd placement to be effective > (before the shmem write rather than after - which isn't a place any > existing barriers are guaranteed to be placed). I'm fairly confident we > get away with it in the kernel, and userland normally just uses linked > load/store instructions for atomicity which works fine. Urgh.. sounds like way 'fun' for you ;-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-29 10:30 ` Peter Zijlstra 2013-10-29 10:35 ` Peter Zijlstra 2013-10-29 19:27 ` Vince Weaver @ 2013-10-29 21:23 ` Michael Neuling 2 siblings, 0 replies; 74+ messages in thread From: Michael Neuling @ 2013-10-29 21:23 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Paul E. McKenney Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Oct 29, 2013 at 11:21:31AM +0100, Peter Zijlstra wrote: > > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > > > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > > > > > Oh, it is hard to argue with feelings. Also, it is easy to be on > > > conservative side and put the barrier here just in case. > > > > I'll make it a full mb for now and too am curious to see the end of this > > discussion explaining things ;-) > > That is, I've now got this queued: Can we also CC stable@kernel.org? This has been around for a while. Mikey > > --- > Subject: perf: Fix perf ring buffer memory ordering > From: Peter Zijlstra <peterz@infradead.org> > Date: Mon Oct 28 13:55:29 CET 2013 > > The PPC64 people noticed a missing memory barrier and crufty old > comments in the perf ring buffer code. So update all the comments and > add the missing barrier. > > When the architecture implements local_t using atomic_long_t there > will be double barriers issued; but short of introducing more > conditional barrier primitives this is the best we can do. > > Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> > Cc: michael@ellerman.id.au > Cc: Paul McKenney <paulmck@linux.vnet.ibm.com> > Cc: Michael Neuling <mikey@neuling.org> > Cc: Frederic Weisbecker <fweisbec@gmail.com> > Cc: anton@samba.org > Cc: benh@kernel.crashing.org > Reported-by: Victor Kaplansky <victork@il.ibm.com> > Tested-by: Victor Kaplansky <victork@il.ibm.com> > Signed-off-by: Peter Zijlstra <peterz@infradead.org> > Link: http://lkml.kernel.org/r/20131025173749.GG19466@laptop.lan > --- > include/uapi/linux/perf_event.h | 12 +++++++----- > kernel/events/ring_buffer.c | 31 +++++++++++++++++++++++++++---- > 2 files changed, 34 insertions(+), 9 deletions(-) > > Index: linux-2.6/include/uapi/linux/perf_event.h > =================================================================== > --- linux-2.6.orig/include/uapi/linux/perf_event.h > +++ linux-2.6/include/uapi/linux/perf_event.h > @@ -479,13 +479,15 @@ struct perf_event_mmap_page { > /* > * Control data for the mmap() data buffer. > * > - * User-space reading the @data_head value should issue an rmb(), on > - * SMP capable platforms, after reading this value -- see > - * perf_event_wakeup(). > + * User-space reading the @data_head value should issue an smp_rmb(), > + * after reading this value. > * > * When the mapping is PROT_WRITE the @data_tail value should be > - * written by userspace to reflect the last read data. In this case > - * the kernel will not over-write unread data. > + * written by userspace to reflect the last read data, after issueing > + * an smp_mb() to separate the data read from the ->data_tail store. > + * In this case the kernel will not over-write unread data. > + * > + * See perf_output_put_handle() for the data ordering. > */ > __u64 data_head; /* head in the data section */ > __u64 data_tail; /* user-space written tail */ > Index: linux-2.6/kernel/events/ring_buffer.c > =================================================================== > --- linux-2.6.orig/kernel/events/ring_buffer.c > +++ linux-2.6/kernel/events/ring_buffer.c > @@ -87,10 +87,31 @@ static void perf_output_put_handle(struc > goto out; > > /* > - * Publish the known good head. Rely on the full barrier implied > - * by atomic_dec_and_test() order the rb->head read and this > - * write. > + * Since the mmap() consumer (userspace) can run on a different CPU: > + * > + * kernel user > + * > + * READ ->data_tail READ ->data_head > + * smp_mb() (A) smp_rmb() (C) > + * WRITE $data READ $data > + * smp_wmb() (B) smp_mb() (D) > + * STORE ->data_head WRITE ->data_tail > + * > + * Where A pairs with D, and B pairs with C. > + * > + * I don't think A needs to be a full barrier because we won't in fact > + * write data until we see the store from userspace. So we simply don't > + * issue the data WRITE until we observe it. Be conservative for now. > + * > + * OTOH, D needs to be a full barrier since it separates the data READ > + * from the tail WRITE. > + * > + * For B a WMB is sufficient since it separates two WRITEs, and for C > + * an RMB is sufficient since it separates two READs. > + * > + * See perf_output_begin(). > */ > + smp_wmb(); > rb->user_page->data_head = head; > > /* > @@ -154,9 +175,11 @@ int perf_output_begin(struct perf_output > * Userspace could choose to issue a mb() before updating the > * tail pointer. So that all reads will be completed before the > * write is issued. > + * > + * See perf_output_put_handle(). > */ > tail = ACCESS_ONCE(rb->user_page->data_tail); > - smp_rmb(); > + smp_mb(); > offset = head = local_read(&rb->head); > head += size; > if (unlikely(!perf_output_space(rb, tail, offset, head))) > ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 20:58 ` Victor Kaplansky 2013-10-29 10:21 ` Peter Zijlstra @ 2013-10-30 9:27 ` Paul E. McKenney 2013-10-30 11:25 ` Peter Zijlstra 2013-10-30 13:28 ` Victor Kaplansky 1 sibling, 2 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-10-30 9:27 UTC (permalink / raw) To: Victor Kaplansky Cc: Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Peter Zijlstra On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > Oh, it is hard to argue with feelings. Also, it is easy to be on > conservative side and put the barrier here just in case. > But I still insist that the barrier is redundant in your example. If you were to back up that insistence with a description of the orderings you are relying on, why other orderings are not important, and how the important orderings are enforced, I might be tempted to pay attention to your opinion. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 9:27 ` Paul E. McKenney @ 2013-10-30 11:25 ` Peter Zijlstra 2013-10-30 14:52 ` Victor Kaplansky 2013-10-31 6:40 ` Paul E. McKenney 2013-10-30 13:28 ` Victor Kaplansky 1 sibling, 2 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 11:25 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Wed, Oct 30, 2013 at 02:27:25AM -0700, Paul E. McKenney wrote: > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > > > Oh, it is hard to argue with feelings. Also, it is easy to be on > > conservative side and put the barrier here just in case. > > But I still insist that the barrier is redundant in your example. > > If you were to back up that insistence with a description of the orderings > you are relying on, why other orderings are not important, and how the > important orderings are enforced, I might be tempted to pay attention > to your opinion. OK, so let me try.. a slightly less convoluted version of the code in kernel/events/ring_buffer.c coupled with a userspace consumer would look something like the below. One important detail is that the kbuf part and the kbuf_writer() are strictly per cpu and we can thus rely on implicit ordering for those. Only the userspace consumer can possibly run on another cpu, and thus we need to ensure data consistency for those. struct buffer { u64 size; u64 tail; u64 head; void *data; }; struct buffer *kbuf, *ubuf; /* * Determine there's space in the buffer to store data at @offset to * @head without overwriting data at @tail. */ bool space(u64 tail, u64 offset, u64 head) { offset = (offset - tail) % kbuf->size; head = (head - tail) % kbuf->size; return (s64)(head - offset) >= 0; } /* * If there's space in the buffer; store the data @buf; otherwise * discard it. */ void kbuf_write(int sz, void *buf) { u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */ u64 offset = kbuf->head; /* we already know where we last wrote */ u64 head = offset + sz; if (!space(tail, offset, head)) { /* discard @buf */ return; } /* * Ensure that if we see the userspace tail (ubuf->tail) such * that there is space to write @buf without overwriting data * userspace hasn't seen yet, we won't in fact store data before * that read completes. */ smp_mb(); /* A, matches with D */ write(kbuf->data + offset, buf, sz); kbuf->head = head % kbuf->size; /* * Ensure that we write all the @buf data before we update the * userspace visible ubuf->head pointer. */ smp_wmb(); /* B, matches with C */ ubuf->head = kbuf->head; } /* * Consume the buffer data and update the tail pointer to indicate to * kernel space there's 'free' space. */ void ubuf_read(void) { u64 head, tail; tail = ACCESS_ONCE(ubuf->tail); head = ACCESS_ONCE(ubuf->head); /* * Ensure we read the buffer boundaries before the actual buffer * data... */ smp_rmb(); /* C, matches with B */ while (tail != head) { obj = ubuf->data + tail; /* process obj */ tail += obj->size; tail %= ubuf->size; } /* * Ensure all data reads are complete before we issue the * ubuf->tail update; once that update hits, kbuf_write() can * observe and overwrite data. */ smp_mb(); /* D, matches with A */ ubuf->tail = tail; } Now the whole crux of the question is if we need barrier A at all, since the STORES issued by the @buf writes are dependent on the ubuf->tail read. If the read shows no available space, we simply will not issue those writes -- therefore we could argue we can avoid the memory barrier. However, that leaves D unpaired and me confused. We must have D because otherwise the CPU could reorder that write into the reads previous and the kernel could start overwriting data we're still reading.. which seems like a bad deal. Also, I'm not entirely sure on C, that too seems like a dependency, we simply cannot read the buffer @tail before we've read the tail itself, now can we? Similarly we cannot compare tail to head without having the head read completed. Could we replace A and C with an smp_read_barrier_depends()? ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 11:25 ` Peter Zijlstra @ 2013-10-30 14:52 ` Victor Kaplansky 2013-10-30 15:39 ` Peter Zijlstra 2013-10-31 6:16 ` Paul E. McKenney 2013-10-31 6:40 ` Paul E. McKenney 1 sibling, 2 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-30 14:52 UTC (permalink / raw) To: Peter Zijlstra Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Paul E. McKenney Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM: > Also, I'm not entirely sure on C, that too seems like a dependency, we > simply cannot read the buffer @tail before we've read the tail itself, > now can we? Similarly we cannot compare tail to head without having the > head read completed. No, this one we cannot omit, because our problem on consumer side is not with @tail, which is written exclusively by consumer, but with @head. BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only around @head read. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 14:52 ` Victor Kaplansky @ 2013-10-30 15:39 ` Peter Zijlstra 2013-10-30 17:14 ` Victor Kaplansky 2013-10-31 6:16 ` Paul E. McKenney 1 sibling, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 15:39 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Paul E. McKenney On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote: > Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM: > > > Also, I'm not entirely sure on C, that too seems like a dependency, we > > simply cannot read the buffer @tail before we've read the tail itself, > > now can we? Similarly we cannot compare tail to head without having the > > head read completed. > > No, this one we cannot omit, because our problem on consumer side is not > with @tail, which is written exclusively by consumer, but with @head. Ah indeed, my argument was flawed in that @head is the important part. But we still do a comparison of @tail against @head before we do further reads. Although I suppose speculative reads are allowed -- they don't have the destructive behaviour speculative writes have -- and thus we could in fact get reorder issues. But since it is still a dependent load in that we do that @tail vs @head comparison before doing other loads, wouldn't a read_barrier_depends() be sufficient? Or do we still need a complete rmb? > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > around > @head read. Agreed, the ACCESS_ONCE() around tail is superfluous since we're the one updating tail, so there's no problem with the value changing unexpectedly. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 15:39 ` Peter Zijlstra @ 2013-10-30 17:14 ` Victor Kaplansky 2013-10-30 17:44 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-10-30 17:14 UTC (permalink / raw) To: Peter Zijlstra Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Paul E. McKenney Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 05:39:31 PM: > Although I suppose speculative reads are allowed -- they don't have the > destructive behaviour speculative writes have -- and thus we could in > fact get reorder issues. I agree. > > But since it is still a dependent load in that we do that @tail vs @head > comparison before doing other loads, wouldn't a read_barrier_depends() > be sufficient? Or do we still need a complete rmb? We need a complete rmb() here IMO. I think there is a fundamental difference between load and stores in this aspect. Load are allowed to be hoisted by compiler or executed speculatively by HW. To prevent load "*(ubuf->data + tail)" to be hoisted beyond "ubuf->head" load you would need something like this: void ubuf_read(void) { u64 head, tail; tail = ubuf->tail; head = ACCESS_ONCE(ubuf->head); /* * Ensure we read the buffer boundaries before the actual buffer * data... */ while (tail != head) { smp_read_barrier_depends(); /* for Alpha */ obj = *(ubuf->data + head - 128); /* process obj */ tail += obj->size; tail %= ubuf->size; } /* * Ensure all data reads are complete before we issue the * ubuf->tail update; once that update hits, kbuf_write() can * observe and overwrite data. */ smp_mb(); /* D, matches with A */ ubuf->tail = tail; } (note that "head" is part of address calculation of obj load now). But, even in this demo example some "smp_read_barrier_depends()" before "obj = *(ubuf->data + head - 100);" is required for architectures like Alpha. Though, on more sane architectures "smp_read_barrier_depends()" will be translated to nothing. Regards, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 17:14 ` Victor Kaplansky @ 2013-10-30 17:44 ` Peter Zijlstra 0 siblings, 0 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 17:44 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Paul E. McKenney On Wed, Oct 30, 2013 at 07:14:29PM +0200, Victor Kaplansky wrote: > We need a complete rmb() here IMO. I think there is a fundamental > difference between load and stores in this aspect. Load are allowed to > be hoisted by compiler or executed speculatively by HW. To prevent > load "*(ubuf->data + tail)" to be hoisted beyond "ubuf->head" load you > would need something like this: Indeed, we could compute and load ->data + tail the moment we've completed the tail load but before we've completed the head load and done the comparison. So yes, full rmb() it is. > void > ubuf_read(void) > { > u64 head, tail; > > tail = ubuf->tail; > head = ACCESS_ONCE(ubuf->head); > > /* > * Ensure we read the buffer boundaries before the actual buffer > * data... > */ > > while (tail != head) { > smp_read_barrier_depends(); /* for Alpha */ > obj = *(ubuf->data + head - 128); > /* process obj */ > tail += obj->size; > tail %= ubuf->size; > } > > /* > * Ensure all data reads are complete before we issue the > * ubuf->tail update; once that update hits, kbuf_write() can > * observe and overwrite data. > */ > smp_mb(); /* D, matches with A */ > > ubuf->tail = tail; > } > > (note that "head" is part of address calculation of obj load now). Right, explicit dependent loads; I was hoping the conditional in between might be enough, but as argued above it is not. The above cannot work in our case though, we must use tail to find the obj since we have variable size objects. > But, even in this demo example some "smp_read_barrier_depends()" before > "obj = *(ubuf->data + head - 100);" is required for architectures > like Alpha. Though, on more sane architectures "smp_read_barrier_depends()" > will be translated to nothing. Sure.. I know all about that. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 14:52 ` Victor Kaplansky 2013-10-30 15:39 ` Peter Zijlstra @ 2013-10-31 6:16 ` Paul E. McKenney 2013-11-01 13:12 ` Victor Kaplansky 1 sibling, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 6:16 UTC (permalink / raw) To: Victor Kaplansky Cc: Peter Zijlstra, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 04:52:05PM +0200, Victor Kaplansky wrote: > Peter Zijlstra <peterz@infradead.org> wrote on 10/30/2013 01:25:26 PM: > > > Also, I'm not entirely sure on C, that too seems like a dependency, we > > simply cannot read the buffer @tail before we've read the tail itself, > > now can we? Similarly we cannot compare tail to head without having the > > head read completed. > > No, this one we cannot omit, because our problem on consumer side is not > with @tail, which is written exclusively by consumer, but with @head. > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > around > @head read. If you omit the ACCESS_ONCE() calls around @tail, the compiler is within its rights to combine adjacent operations and also to invent loads and stores, for example, in cases of register pressure. It is also within its rights to do piece-at-a-time loads and stores, which might sound unlikely, but which can actually has happened when the compiler figures out exactly what is to be stored at compile time, especially on hardware that only allows small immediate values. So the ACCESS_ONCE() calls are not optional, the current contents of Documentation/circular-buffers.txt notwithstanding. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 6:16 ` Paul E. McKenney @ 2013-11-01 13:12 ` Victor Kaplansky 2013-11-02 16:36 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-11-01 13:12 UTC (permalink / raw) To: paulmck Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 08:16:02 AM: > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > > around > > @head read. Just to be sure, that we are talking about the same code - I was considering ACCESS_ONCE() around @tail in point AAA in the following example from Documentation/circular-buffers.txt for CONSUMER: unsigned long head = ACCESS_ONCE(buffer->head); unsigned long tail = buffer->tail; /* AAA */ if (CIRC_CNT(head, tail, buffer->size) >= 1) { /* read index before reading contents at that index */ smp_read_barrier_depends(); /* extract one item from the buffer */ struct item *item = buffer[tail]; consume_item(item); smp_mb(); /* finish reading descriptor before incrementing tail */ buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */ } > > If you omit the ACCESS_ONCE() calls around @tail, the compiler is within > its rights to combine adjacent operations and also to invent loads and > stores, for example, in cases of register pressure. Right. And I was completely aware about these possible transformations when said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or even completely dismissed reads of @tail in consumer code, are not a problem at all, since @tail is written exclusively by CONSUMER side. > It is also within > its rights to do piece-at-a-time loads and stores, which might sound > unlikely, but which can actually has happened when the compiler figures > out exactly what is to be stored at compile time, especially on hardware > that only allows small immediate values. As for writes to @tail, the ACCESS_ONCE around @tail at point AAA, doesn't prevent in any way an imaginary super-optimizing compiler from moving around the store to @tail (which appears in the code at point BBB). It is why ACCESS_ONCE at point AAA is completely redundant. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 13:12 ` Victor Kaplansky @ 2013-11-02 16:36 ` Paul E. McKenney 2013-11-02 17:26 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 16:36 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra On Fri, Nov 01, 2013 at 03:12:58PM +0200, Victor Kaplansky wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 > 08:16:02 AM: > > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > > > around > > > @head read. > > Just to be sure, that we are talking about the same code - I was > considering > ACCESS_ONCE() around @tail in point AAA in the following example from > Documentation/circular-buffers.txt for CONSUMER: > > unsigned long head = ACCESS_ONCE(buffer->head); > unsigned long tail = buffer->tail; /* AAA */ > > if (CIRC_CNT(head, tail, buffer->size) >= 1) { > /* read index before reading contents at that index */ > smp_read_barrier_depends(); > > /* extract one item from the buffer */ > struct item *item = buffer[tail]; > > consume_item(item); > > smp_mb(); /* finish reading descriptor before incrementing > tail */ > > buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */ > } Hmmm... I believe that we need to go back to the original code in Documentation/circular-buffers.txt. I do so at the bottom of this email. > > If you omit the ACCESS_ONCE() calls around @tail, the compiler is within > > its rights to combine adjacent operations and also to invent loads and > > stores, for example, in cases of register pressure. > > Right. And I was completely aware about these possible transformations when > said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or > even > completely dismissed reads of @tail in consumer code, are not a problem at > all, > since @tail is written exclusively by CONSUMER side. I believe that the lack of ACCESS_ONCE() around the consumer's store to buffer->tail is at least a documentation problem. In the original consumer code, it is trapped between an smp_mb() and a spin_unlock(), but it is updating something that is read without synchronization by some other thread. > > It is also within > > its rights to do piece-at-a-time loads and stores, which might sound > > unlikely, but which can actually has happened when the compiler figures > > out exactly what is to be stored at compile time, especially on hardware > > that only allows small immediate values. > > As for writes to @tail, the ACCESS_ONCE around @tail at point AAA, > doesn't prevent in any way an imaginary super-optimizing compiler > from moving around the store to @tail (which appears in the code at point > BBB). > > It is why ACCESS_ONCE at point AAA is completely redundant. Agreed, it is under the lock that guards modifications, so AAA does not need ACCESS_ONCE(). OK, here is the producer from Documentation/circular-buffers.txt, with some comments added: spin_lock(&producer_lock); unsigned long head = buffer->head; unsigned long tail = ACCESS_ONCE(buffer->tail); /* PT */ if (CIRC_SPACE(head, tail, buffer->size) >= 1) { /* insert one item into the buffer */ struct item *item = buffer[head]; produce_item(item); /* PD */ smp_wmb(); /* commit the item before incrementing the head */ buffer->head = (head + 1) & (buffer->size - 1); /* PH */ /* wake_up() will make sure that the head is committed before * waking anyone up */ wake_up(consumer); } spin_unlock(&producer_lock); And here is the consumer, also from Documentation/circular-buffers.txt: spin_lock(&consumer_lock); unsigned long head = ACCESS_ONCE(buffer->head); /* CH */ unsigned long tail = buffer->tail; if (CIRC_CNT(head, tail, buffer->size) >= 1) { /* read index before reading contents at that index */ smp_read_barrier_depends(); /* extract one item from the buffer */ struct item *item = buffer[tail]; /* CD */ consume_item(item); smp_mb(); /* finish reading descriptor before incrementing tail */ buffer->tail = (tail + 1) & (buffer->size - 1); /* CT */ } spin_unlock(&consumer_lock); Here are the ordering requirements as I see them: 1. The producer is not allowed to clobber a location that the consumer is in the process of reading from. 2. The consumer is not allowed to read from a location that the producer has not yet completed writing to. #1 is helped out by the fact that there is always an empty element in the array, so that the producer will need to produce twice in a row to catch up to where the consumer is currently consuming. #2 has no such benefit: The consumer can consume an item that has just now been produced. #1 requires that CD is ordered before CT in a way that pairs with the ordering of PT and PD. There is of course no effective ordering between PT and PD within a given call to the producer, but we only need the ordering between the read from PT for one call to the producer and the PD of the -next- call to the producer, courtesy of the fact that there is always one empty cell in the array. Therefore, the required ordering between PT of one call and PD of the next is provided by the unlock-lock pair. The ordering of CD and CT is of course provided by the smp_mb(). (And yes, I was missing the unlock-lock pair earlier. In my defense, you did leave this unlock-lock pair out of your example.) So ordering requirement #1 is handled by the original, but only if you leave the locking in place. The producer's smp_wmb() does not necessarily order prior loads against subsequent stores, and the wake_up() only guarantees ordering if something was actually awakened. As noted earlier, the "if" does not necessarily provide ordering. On to ordering requirement #2. This requires that CH and CD is ordered in a way that pairs with ordering between PD and PH. PD and PH are both writes, so the smp_wmb() does the trick there. The consumer side is a bit strange. On DEC Alpha, smp_read_barrier_dependes() turns into smp_mb(), so that case is covered (though by accident). On other architectures, smp_read_barrier_depends() generates no code, and there is no data dependency between the CH and CD. The dependency is instead between the read from ->tail and the write, and as you noted, ->tail is written by the consumer, not the producer. But my battery is dying, so more later, including ACCESS_ONCE(). Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-02 16:36 ` Paul E. McKenney @ 2013-11-02 17:26 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 17:26 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra, lfomicki, dhowells, mbatty [ Adding David Howells, Lech Fomicki, and Mark Batty on CC for their thoughts given previous discussions. ] On Sat, Nov 02, 2013 at 09:36:18AM -0700, Paul E. McKenney wrote: > On Fri, Nov 01, 2013 at 03:12:58PM +0200, Victor Kaplansky wrote: > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 > > 08:16:02 AM: > > > > > > BTW, it is why you also don't need ACCESS_ONCE() around @tail, but only > > > > around > > > > @head read. > > > > Just to be sure, that we are talking about the same code - I was > > considering > > ACCESS_ONCE() around @tail in point AAA in the following example from > > Documentation/circular-buffers.txt for CONSUMER: > > > > unsigned long head = ACCESS_ONCE(buffer->head); > > unsigned long tail = buffer->tail; /* AAA */ > > > > if (CIRC_CNT(head, tail, buffer->size) >= 1) { > > /* read index before reading contents at that index */ > > smp_read_barrier_depends(); > > > > /* extract one item from the buffer */ > > struct item *item = buffer[tail]; > > > > consume_item(item); > > > > smp_mb(); /* finish reading descriptor before incrementing > > tail */ > > > > buffer->tail = (tail + 1) & (buffer->size - 1); /* BBB */ > > } > > Hmmm... I believe that we need to go back to the original code in > Documentation/circular-buffers.txt. I do so at the bottom of this email. > > > > If you omit the ACCESS_ONCE() calls around @tail, the compiler is within > > > its rights to combine adjacent operations and also to invent loads and > > > stores, for example, in cases of register pressure. > > > > Right. And I was completely aware about these possible transformations when > > said that ACCESS_ONCE() around @tail in point AAA is redundant. Moved, or > > even > > completely dismissed reads of @tail in consumer code, are not a problem at > > all, > > since @tail is written exclusively by CONSUMER side. > > I believe that the lack of ACCESS_ONCE() around the consumer's store > to buffer->tail is at least a documentation problem. In the original > consumer code, it is trapped between an smp_mb() and a spin_unlock(), > but it is updating something that is read without synchronization by > some other thread. > > > > It is also within > > > its rights to do piece-at-a-time loads and stores, which might sound > > > unlikely, but which can actually has happened when the compiler figures > > > out exactly what is to be stored at compile time, especially on hardware > > > that only allows small immediate values. > > > > As for writes to @tail, the ACCESS_ONCE around @tail at point AAA, > > doesn't prevent in any way an imaginary super-optimizing compiler > > from moving around the store to @tail (which appears in the code at point > > BBB). > > > > It is why ACCESS_ONCE at point AAA is completely redundant. > > Agreed, it is under the lock that guards modifications, so AAA does not > need ACCESS_ONCE(). > > OK, here is the producer from Documentation/circular-buffers.txt, with > some comments added: > > spin_lock(&producer_lock); > > unsigned long head = buffer->head; The above is updated only under producer_lock, which we hold, so no ACCESS_ONCE() is needed for buffer->head. > unsigned long tail = ACCESS_ONCE(buffer->tail); /* PT */ > > if (CIRC_SPACE(head, tail, buffer->size) >= 1) { > /* insert one item into the buffer */ > struct item *item = buffer[head]; > > produce_item(item); /* PD */ > > smp_wmb(); /* commit the item before incrementing the head */ > > buffer->head = (head + 1) & (buffer->size - 1); /* PH */ The above needs to be something like: ACCESS_ONCE(buffer->head) = (head + 1) & (buffer->size - 1); This is because we are writing to a shared variable that might be being read concurrently. > /* wake_up() will make sure that the head is committed before > * waking anyone up */ > wake_up(consumer); > } > > spin_unlock(&producer_lock); > > And here is the consumer, also from Documentation/circular-buffers.txt: > > spin_lock(&consumer_lock); > > unsigned long head = ACCESS_ONCE(buffer->head); /* CH */ > unsigned long tail = buffer->tail; The above is updated only under consumer_lock, which we hold, so no ACCESS_ONCE() is needed for buffer->tail. > > if (CIRC_CNT(head, tail, buffer->size) >= 1) { > /* read index before reading contents at that index */ > smp_read_barrier_depends(); > > /* extract one item from the buffer */ > struct item *item = buffer[tail]; /* CD */ > > consume_item(item); > > smp_mb(); /* finish reading descriptor before incrementing tail */ > > buffer->tail = (tail + 1) & (buffer->size - 1); /* CT */ And here, for no-execution-cost documentation, if nothing else: ACCESS_ONCE(buffer->tail) = (tail + 1) & (buffer->size - 1); > } > > spin_unlock(&consumer_lock); > > Here are the ordering requirements as I see them: > > 1. The producer is not allowed to clobber a location that the > consumer is in the process of reading from. > > 2. The consumer is not allowed to read from a location that the > producer has not yet completed writing to. > > #1 is helped out by the fact that there is always an empty element in > the array, so that the producer will need to produce twice in a row > to catch up to where the consumer is currently consuming. #2 has no > such benefit: The consumer can consume an item that has just now been > produced. > > #1 requires that CD is ordered before CT in a way that pairs with the > ordering of PT and PD. There is of course no effective ordering between > PT and PD within a given call to the producer, but we only need the > ordering between the read from PT for one call to the producer and the > PD of the -next- call to the producer, courtesy of the fact that there > is always one empty cell in the array. Therefore, the required ordering > between PT of one call and PD of the next is provided by the unlock-lock > pair. The ordering of CD and CT is of course provided by the smp_mb(). > (And yes, I was missing the unlock-lock pair earlier. In my defense, > you did leave this unlock-lock pair out of your example.) > > So ordering requirement #1 is handled by the original, but only if you > leave the locking in place. The producer's smp_wmb() does not necessarily > order prior loads against subsequent stores, and the wake_up() only > guarantees ordering if something was actually awakened. As noted earlier, > the "if" does not necessarily provide ordering. > > On to ordering requirement #2. > > This requires that CH and CD is ordered in a way that pairs with ordering > between PD and PH. PD and PH are both writes, so the smp_wmb() does > the trick there. The consumer side is a bit strange. On DEC Alpha, > smp_read_barrier_dependes() turns into smp_mb(), so that case is covered > (though by accident). On other architectures, smp_read_barrier_depends() > generates no code, and there is no data dependency between the CH and CD. > The dependency is instead between the read from ->tail and the write, Sigh. Make that "The dependency is instead between the read from ->tail and the read from the array." > and as you noted, ->tail is written by the consumer, not the producer. And non-dependent reads -can- be speculated, so the smp_read_barrier_depends() needs to be at least an smp_rmb(). Again, don't take my word for it, try it with either ppcmem or real weakly ordered hardware. I am not 100% confident of the patch below, but am getting there. If a change is really needed, it must of course be propagated to the uses within the Linux kernel. Thanx, Paul > But my battery is dying, so more later, including ACCESS_ONCE(). documentation: Fix circular-buffer example. The code sample in Documentation/circular-buffers.txt appears to have a few ordering bugs. This patch therefore applies the needed fixes. Reported-by: Lech Fomicki <lfomicki@poczta.fm> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/Documentation/circular-buffers.txt b/Documentation/circular-buffers.txt index 8117e5bf6065..a36bed3db4ee 100644 --- a/Documentation/circular-buffers.txt +++ b/Documentation/circular-buffers.txt @@ -170,7 +170,7 @@ The producer will look something like this: smp_wmb(); /* commit the item before incrementing the head */ - buffer->head = (head + 1) & (buffer->size - 1); + ACCESS_ONCE(buffer->head) = (head + 1) & (buffer->size - 1); /* wake_up() will make sure that the head is committed before * waking anyone up */ @@ -183,9 +183,14 @@ This will instruct the CPU that the contents of the new item must be written before the head index makes it available to the consumer and then instructs the CPU that the revised head index must be written before the consumer is woken. -Note that wake_up() doesn't have to be the exact mechanism used, but whatever -is used must guarantee a (write) memory barrier between the update of the head -index and the change of state of the consumer, if a change of state occurs. +Note that wake_up() does not guarantee any sort of barrier unless something +is actually awakened. We therefore cannot rely on it for ordering. However, +there is always one element of the array left empty. Therefore, the +producer must produce two elements before it could possibly corrupt the +element currently being read by the consumer. Therefore, the unlock-lock +pair between consecutive invocations of the consumer provides the necessary +ordering between the read of the index indicating that the consumer has +vacated a given element and the write by the producer to that same element. THE CONSUMER @@ -200,7 +205,7 @@ The consumer will look something like this: if (CIRC_CNT(head, tail, buffer->size) >= 1) { /* read index before reading contents at that index */ - smp_read_barrier_depends(); + smp_rmb(); /* extract one item from the buffer */ struct item *item = buffer[tail]; @@ -209,7 +214,7 @@ The consumer will look something like this: smp_mb(); /* finish reading descriptor before incrementing tail */ - buffer->tail = (tail + 1) & (buffer->size - 1); + ACCESS_ONCE(buffer->tail) = (tail + 1) & (buffer->size - 1); } spin_unlock(&consumer_lock); @@ -223,7 +228,10 @@ Note the use of ACCESS_ONCE() in both algorithms to read the opposition index. This prevents the compiler from discarding and reloading its cached value - which some compilers will do across smp_read_barrier_depends(). This isn't strictly needed if you can be sure that the opposition index will _only_ be -used the once. +used the once. Similarly, ACCESS_ONCE() is used in both algorithms to +write the thread's index. This documents the fact that we are writing +to something that can be read concurrently and also prevents the compiler +from tearing the store. =============== ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 11:25 ` Peter Zijlstra 2013-10-30 14:52 ` Victor Kaplansky @ 2013-10-31 6:40 ` Paul E. McKenney 2013-11-01 14:25 ` Victor Kaplansky ` (3 more replies) 1 sibling, 4 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 6:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Wed, Oct 30, 2013 at 12:25:26PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 02:27:25AM -0700, Paul E. McKenney wrote: > > On Mon, Oct 28, 2013 at 10:58:58PM +0200, Victor Kaplansky wrote: > > > Oleg Nesterov <oleg@redhat.com> wrote on 10/28/2013 10:17:35 PM: > > > > > > > mb(); // XXXXXXXX: do we really need it? I think yes. > > > > > > Oh, it is hard to argue with feelings. Also, it is easy to be on > > > conservative side and put the barrier here just in case. > > > But I still insist that the barrier is redundant in your example. > > > > If you were to back up that insistence with a description of the orderings > > you are relying on, why other orderings are not important, and how the > > important orderings are enforced, I might be tempted to pay attention > > to your opinion. > > OK, so let me try.. a slightly less convoluted version of the code in > kernel/events/ring_buffer.c coupled with a userspace consumer would look > something like the below. > > One important detail is that the kbuf part and the kbuf_writer() are > strictly per cpu and we can thus rely on implicit ordering for those. > > Only the userspace consumer can possibly run on another cpu, and thus we > need to ensure data consistency for those. > > struct buffer { > u64 size; > u64 tail; > u64 head; > void *data; > }; > > struct buffer *kbuf, *ubuf; > > /* > * Determine there's space in the buffer to store data at @offset to > * @head without overwriting data at @tail. > */ > bool space(u64 tail, u64 offset, u64 head) > { > offset = (offset - tail) % kbuf->size; > head = (head - tail) % kbuf->size; > > return (s64)(head - offset) >= 0; > } > > /* > * If there's space in the buffer; store the data @buf; otherwise > * discard it. > */ > void kbuf_write(int sz, void *buf) > { > u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */ > u64 offset = kbuf->head; /* we already know where we last wrote */ > u64 head = offset + sz; > > if (!space(tail, offset, head)) { > /* discard @buf */ > return; > } > > /* > * Ensure that if we see the userspace tail (ubuf->tail) such > * that there is space to write @buf without overwriting data > * userspace hasn't seen yet, we won't in fact store data before > * that read completes. > */ > > smp_mb(); /* A, matches with D */ > > write(kbuf->data + offset, buf, sz); > kbuf->head = head % kbuf->size; > > /* > * Ensure that we write all the @buf data before we update the > * userspace visible ubuf->head pointer. > */ > smp_wmb(); /* B, matches with C */ > > ubuf->head = kbuf->head; > } > > /* > * Consume the buffer data and update the tail pointer to indicate to > * kernel space there's 'free' space. > */ > void ubuf_read(void) > { > u64 head, tail; > > tail = ACCESS_ONCE(ubuf->tail); > head = ACCESS_ONCE(ubuf->head); > > /* > * Ensure we read the buffer boundaries before the actual buffer > * data... > */ > smp_rmb(); /* C, matches with B */ > > while (tail != head) { > obj = ubuf->data + tail; > /* process obj */ > tail += obj->size; > tail %= ubuf->size; > } > > /* > * Ensure all data reads are complete before we issue the > * ubuf->tail update; once that update hits, kbuf_write() can > * observe and overwrite data. > */ > smp_mb(); /* D, matches with A */ > > ubuf->tail = tail; > } > > > Now the whole crux of the question is if we need barrier A at all, since > the STORES issued by the @buf writes are dependent on the ubuf->tail > read. The dependency you are talking about is via the "if" statement? Even C/C++11 is not required to respect control dependencies. This one is a bit annoying. The x86 TSO means that you really only need barrier(), ARM (recent ARM, anyway) and Power could use a weaker barrier, and so on -- but smp_mb() emits a full barrier. Perhaps a new smp_tmb() for TSO semantics, where reads are ordered before reads, writes before writes, and reads before writes, but not writes before reads? Another approach would be to define a per-arch barrier for this particular case. > If the read shows no available space, we simply will not issue those > writes -- therefore we could argue we can avoid the memory barrier. Proving that means iterating through the permitted combinations of compilers and architectures... There is always hand-coded assembly language, I suppose. > However, that leaves D unpaired and me confused. We must have D because > otherwise the CPU could reorder that write into the reads previous and > the kernel could start overwriting data we're still reading.. which > seems like a bad deal. Yep. If you were hand-coding only for x86 and s390, D would pair with the required barrier() asm. > Also, I'm not entirely sure on C, that too seems like a dependency, we > simply cannot read the buffer @tail before we've read the tail itself, > now can we? Similarly we cannot compare tail to head without having the > head read completed. > > Could we replace A and C with an smp_read_barrier_depends()? C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail and that the value fetch from ->tail feeds into the address used for the "obj =" assignment. A, not so much -- again, compilers are not required to respect control dependencies. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 6:40 ` Paul E. McKenney @ 2013-11-01 14:25 ` Victor Kaplansky 2013-11-02 17:28 ` Paul E. McKenney 2013-11-01 14:56 ` Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-11-01 14:25 UTC (permalink / raw) To: paulmck Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 08:40:15 AM: > > void ubuf_read(void) > > { > > u64 head, tail; > > > > tail = ACCESS_ONCE(ubuf->tail); > > head = ACCESS_ONCE(ubuf->head); > > > > /* > > * Ensure we read the buffer boundaries before the actual buffer > > * data... > > */ > > smp_rmb(); /* C, matches with B */ > > > > while (tail != head) { > > obj = ubuf->data + tail; > > /* process obj */ > > tail += obj->size; > > tail %= ubuf->size; > > } > > > > /* > > * Ensure all data reads are complete before we issue the > > * ubuf->tail update; once that update hits, kbuf_write() can > > * observe and overwrite data. > > */ > > smp_mb(); /* D, matches with A */ > > > > ubuf->tail = tail; > > } > > Could we replace A and C with an smp_read_barrier_depends()? > > C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail > and that the value fetch from ->tail feeds into the address used for > the "obj =" assignment. No! You must to have a full smp_rmb() at C. The race on the reader side is not between fetch of @tail and read from address pointed by @tail. The real race here is between a fetch of @head and read of obj from memory pointed by @tail. Regards, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 14:25 ` Victor Kaplansky @ 2013-11-02 17:28 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 17:28 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra On Fri, Nov 01, 2013 at 04:25:42PM +0200, Victor Kaplansky wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 > 08:40:15 AM: > > > > void ubuf_read(void) > > > { > > > u64 head, tail; > > > > > > tail = ACCESS_ONCE(ubuf->tail); > > > head = ACCESS_ONCE(ubuf->head); > > > > > > /* > > > * Ensure we read the buffer boundaries before the actual buffer > > > * data... > > > */ > > > smp_rmb(); /* C, matches with B */ > > > > > > while (tail != head) { > > > obj = ubuf->data + tail; > > > /* process obj */ > > > tail += obj->size; > > > tail %= ubuf->size; > > > } > > > > > > /* > > > * Ensure all data reads are complete before we issue the > > > * ubuf->tail update; once that update hits, kbuf_write() can > > > * observe and overwrite data. > > > */ > > > smp_mb(); /* D, matches with A */ > > > > > > ubuf->tail = tail; > > > } > > > > Could we replace A and C with an smp_read_barrier_depends()? > > > > C, yes, given that you have ACCESS_ONCE() on the fetch from ->tail > > and that the value fetch from ->tail feeds into the address used for > > the "obj =" assignment. > > No! You must to have a full smp_rmb() at C. The race on the reader side > is not between fetch of @tail and read from address pointed by @tail. > The real race here is between a fetch of @head and read of obj from > memory pointed by @tail. I believe you are in fact correct, good catch. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 6:40 ` Paul E. McKenney 2013-11-01 14:25 ` Victor Kaplansky @ 2013-11-01 14:56 ` Peter Zijlstra 2013-11-02 17:32 ` Paul E. McKenney 2013-11-01 16:11 ` Peter Zijlstra 2013-11-01 16:18 ` Peter Zijlstra 3 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-11-01 14:56 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > Now the whole crux of the question is if we need barrier A at all, since > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > read. > > The dependency you are talking about is via the "if" statement? > Even C/C++11 is not required to respect control dependencies. > > This one is a bit annoying. The x86 TSO means that you really only > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > barrier, and so on -- but smp_mb() emits a full barrier. > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > before reads, writes before writes, and reads before writes, but not > writes before reads? Another approach would be to define a per-arch > barrier for this particular case. I suppose we can only introduce new barrier primitives if there's more than 1 use-case. > > If the read shows no available space, we simply will not issue those > > writes -- therefore we could argue we can avoid the memory barrier. > > Proving that means iterating through the permitted combinations of > compilers and architectures... There is always hand-coded assembly > language, I suppose. I'm starting to think that while the C/C++ language spec says they can wreck the world by doing these silly optimization, real world users will push back for breaking their existing code. I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they break the kernel by doing crazy shit like that. Given its near impossible to write a correct program in C/C++ and tagging the entire kernel with __atomic is equally not going to happen, I think we must find a practical solution. Either that, or we really need to consider forking the language and compiler :-( ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 14:56 ` Peter Zijlstra @ 2013-11-02 17:32 ` Paul E. McKenney 2013-11-03 14:40 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 17:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > Now the whole crux of the question is if we need barrier A at all, since > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > read. > > > > The dependency you are talking about is via the "if" statement? > > Even C/C++11 is not required to respect control dependencies. > > > > This one is a bit annoying. The x86 TSO means that you really only > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > before reads, writes before writes, and reads before writes, but not > > writes before reads? Another approach would be to define a per-arch > > barrier for this particular case. > > I suppose we can only introduce new barrier primitives if there's more > than 1 use-case. There probably are others. > > > If the read shows no available space, we simply will not issue those > > > writes -- therefore we could argue we can avoid the memory barrier. > > > > Proving that means iterating through the permitted combinations of > > compilers and architectures... There is always hand-coded assembly > > language, I suppose. > > I'm starting to think that while the C/C++ language spec says they can > wreck the world by doing these silly optimization, real world users will > push back for breaking their existing code. > > I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they > break the kernel by doing crazy shit like that. > > Given its near impossible to write a correct program in C/C++ and > tagging the entire kernel with __atomic is equally not going to happen, > I think we must find a practical solution. > > Either that, or we really need to consider forking the language and > compiler :-( Depends on how much benefit the optimizations provide. If they provide little or no benefit, I am with you, otherwise we will need to bit some bullet or another. Keep in mind that there is a lot of code in the kernel that runs sequentially (e.g., due to being fully protected by locks), and aggressive optimizations for that sort of code are harmless. Can't say I know the answer at the moment, though. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-02 17:32 ` Paul E. McKenney @ 2013-11-03 14:40 ` Paul E. McKenney 2013-11-03 17:07 ` Will Deacon 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-03 14:40 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote: > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > > Now the whole crux of the question is if we need barrier A at all, since > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > > read. > > > > > > The dependency you are talking about is via the "if" statement? > > > Even C/C++11 is not required to respect control dependencies. > > > > > > This one is a bit annoying. The x86 TSO means that you really only > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > > before reads, writes before writes, and reads before writes, but not > > > writes before reads? Another approach would be to define a per-arch > > > barrier for this particular case. > > > > I suppose we can only introduce new barrier primitives if there's more > > than 1 use-case. > > There probably are others. If there was an smp_tmb(), I would likely use it in rcu_assign_pointer(). There are some corner cases that can happen with the current smp_wmb() that would be prevented by smp_tmb(). These corner cases are a bit strange, as follows: struct foo gp; void P0(void) { struct foo *p = kmalloc(sizeof(*p); if (!p) return; ACCESS_ONCE(p->a) = 0; BUG_ON(ACCESS_ONCE(p->a)); rcu_assign_pointer(gp, p); } void P1(void) { struct foo *p = rcu_dereference(gp); if (!p) return; ACCESS_ONCE(p->a) = 1; } With smp_wmb(), the BUG_ON() can occur because smp_wmb() does not prevent CPU from reordering the read in the BUG_ON() with the rcu_assign_pointer(). With smp_tmb(), it could not. Now, I am not too worried about this because I cannot think of any use for code like that in P0() and P1(). But if there was an smp_tmb(), it would be cleaner to make the BUG_ON() impossible. Thanx, Paul > > > > If the read shows no available space, we simply will not issue those > > > > writes -- therefore we could argue we can avoid the memory barrier. > > > > > > Proving that means iterating through the permitted combinations of > > > compilers and architectures... There is always hand-coded assembly > > > language, I suppose. > > > > I'm starting to think that while the C/C++ language spec says they can > > wreck the world by doing these silly optimization, real world users will > > push back for breaking their existing code. > > > > I'm fairly sure the GCC people _will_ get shouted at _loudly_ when they > > break the kernel by doing crazy shit like that. > > > > Given its near impossible to write a correct program in C/C++ and > > tagging the entire kernel with __atomic is equally not going to happen, > > I think we must find a practical solution. > > > > Either that, or we really need to consider forking the language and > > compiler :-( > > Depends on how much benefit the optimizations provide. If they provide > little or no benefit, I am with you, otherwise we will need to bit some > bullet or another. Keep in mind that there is a lot of code in the > kernel that runs sequentially (e.g., due to being fully protected by > locks), and aggressive optimizations for that sort of code are harmless. > > Can't say I know the answer at the moment, though. > > Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-03 14:40 ` Paul E. McKenney @ 2013-11-03 17:07 ` Will Deacon 2013-11-03 22:47 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Will Deacon @ 2013-11-03 17:07 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote: > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote: > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > > > Now the whole crux of the question is if we need barrier A at all, since > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > > > read. > > > > > > > > The dependency you are talking about is via the "if" statement? > > > > Even C/C++11 is not required to respect control dependencies. > > > > > > > > This one is a bit annoying. The x86 TSO means that you really only > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > > > before reads, writes before writes, and reads before writes, but not > > > > writes before reads? Another approach would be to define a per-arch > > > > barrier for this particular case. > > > > > > I suppose we can only introduce new barrier primitives if there's more > > > than 1 use-case. Which barrier did you have in mind when you refer to `recent ARM' above? It seems to me like you'd need a combination if dmb ishld and dmb ishst, since the former doesn't order writes before writes. Will ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-03 17:07 ` Will Deacon @ 2013-11-03 22:47 ` Paul E. McKenney 2013-11-04 9:57 ` Will Deacon 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-03 22:47 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote: > On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote: > > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote: > > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > > > > Now the whole crux of the question is if we need barrier A at all, since > > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > > > > read. > > > > > > > > > > The dependency you are talking about is via the "if" statement? > > > > > Even C/C++11 is not required to respect control dependencies. > > > > > > > > > > This one is a bit annoying. The x86 TSO means that you really only > > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > > > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > > > > before reads, writes before writes, and reads before writes, but not > > > > > writes before reads? Another approach would be to define a per-arch > > > > > barrier for this particular case. > > > > > > > > I suppose we can only introduce new barrier primitives if there's more > > > > than 1 use-case. > > Which barrier did you have in mind when you refer to `recent ARM' above? It > seems to me like you'd need a combination if dmb ishld and dmb ishst, since > the former doesn't order writes before writes. I heard a rumor that ARM had recently added a new dmb variant that acted similarly to PowerPC's lwsync, and it was on my list to follow up. Given your response, I am guessing that there is no truth to this rumor... Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-03 22:47 ` Paul E. McKenney @ 2013-11-04 9:57 ` Will Deacon 2013-11-04 10:52 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Will Deacon @ 2013-11-04 9:57 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML Hi Paul, On Sun, Nov 03, 2013 at 10:47:12PM +0000, Paul E. McKenney wrote: > On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote: > > On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote: > > > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote: > > > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > > > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > > > > > Now the whole crux of the question is if we need barrier A at all, since > > > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > > > > > read. > > > > > > > > > > > > The dependency you are talking about is via the "if" statement? > > > > > > Even C/C++11 is not required to respect control dependencies. > > > > > > > > > > > > This one is a bit annoying. The x86 TSO means that you really only > > > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > > > > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > > > > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > > > > > before reads, writes before writes, and reads before writes, but not > > > > > > writes before reads? Another approach would be to define a per-arch > > > > > > barrier for this particular case. > > > > > > > > > > I suppose we can only introduce new barrier primitives if there's more > > > > > than 1 use-case. > > > > Which barrier did you have in mind when you refer to `recent ARM' above? It > > seems to me like you'd need a combination if dmb ishld and dmb ishst, since > > the former doesn't order writes before writes. > > I heard a rumor that ARM had recently added a new dmb variant that acted > similarly to PowerPC's lwsync, and it was on my list to follow up. > > Given your response, I am guessing that there is no truth to this rumor... I think you're talking about the -ld option to dmb, which was introduced in ARMv8. That option orders loads against loads and stores, but doesn't order writes against writes. So you could do: dmb ishld dmb ishst but it's questionable whether that performs better than a dmb ish. Will ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-04 9:57 ` Will Deacon @ 2013-11-04 10:52 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-04 10:52 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML On Mon, Nov 04, 2013 at 09:57:17AM +0000, Will Deacon wrote: > Hi Paul, > > On Sun, Nov 03, 2013 at 10:47:12PM +0000, Paul E. McKenney wrote: > > On Sun, Nov 03, 2013 at 05:07:59PM +0000, Will Deacon wrote: > > > On Sun, Nov 03, 2013 at 02:40:17PM +0000, Paul E. McKenney wrote: > > > > On Sat, Nov 02, 2013 at 10:32:39AM -0700, Paul E. McKenney wrote: > > > > > On Fri, Nov 01, 2013 at 03:56:34PM +0100, Peter Zijlstra wrote: > > > > > > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > > > > > > Now the whole crux of the question is if we need barrier A at all, since > > > > > > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > > > > > > read. > > > > > > > > > > > > > > The dependency you are talking about is via the "if" statement? > > > > > > > Even C/C++11 is not required to respect control dependencies. > > > > > > > > > > > > > > This one is a bit annoying. The x86 TSO means that you really only > > > > > > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > > > > > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > > > > > > > > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > > > > > > before reads, writes before writes, and reads before writes, but not > > > > > > > writes before reads? Another approach would be to define a per-arch > > > > > > > barrier for this particular case. > > > > > > > > > > > > I suppose we can only introduce new barrier primitives if there's more > > > > > > than 1 use-case. > > > > > > Which barrier did you have in mind when you refer to `recent ARM' above? It > > > seems to me like you'd need a combination if dmb ishld and dmb ishst, since > > > the former doesn't order writes before writes. > > > > I heard a rumor that ARM had recently added a new dmb variant that acted > > similarly to PowerPC's lwsync, and it was on my list to follow up. > > > > Given your response, I am guessing that there is no truth to this rumor... > > I think you're talking about the -ld option to dmb, which was introduced in > ARMv8. That option orders loads against loads and stores, but doesn't order > writes against writes. So you could do: > > dmb ishld > dmb ishst > > but it's questionable whether that performs better than a dmb ish. If Linus's smp_store_with_release_semantics() approach works out, ARM should be able to use its shiny new ldar and stlr instructions. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 6:40 ` Paul E. McKenney 2013-11-01 14:25 ` Victor Kaplansky 2013-11-01 14:56 ` Peter Zijlstra @ 2013-11-01 16:11 ` Peter Zijlstra 2013-11-02 17:46 ` Paul E. McKenney 2013-11-01 16:18 ` Peter Zijlstra 3 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-11-01 16:11 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > void kbuf_write(int sz, void *buf) > > { > > u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */ > > u64 offset = kbuf->head; /* we already know where we last wrote */ > > u64 head = offset + sz; > > > > if (!space(tail, offset, head)) { > > /* discard @buf */ > > return; > > } > > > > /* > > * Ensure that if we see the userspace tail (ubuf->tail) such > > * that there is space to write @buf without overwriting data > > * userspace hasn't seen yet, we won't in fact store data before > > * that read completes. > > */ > > > > smp_mb(); /* A, matches with D */ > > > > write(kbuf->data + offset, buf, sz); > > kbuf->head = head % kbuf->size; > > > > /* > > * Ensure that we write all the @buf data before we update the > > * userspace visible ubuf->head pointer. > > */ > > smp_wmb(); /* B, matches with C */ > > > > ubuf->head = kbuf->head; > > } > > Now the whole crux of the question is if we need barrier A at all, since > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > read. > > The dependency you are talking about is via the "if" statement? > Even C/C++11 is not required to respect control dependencies. But surely we must be able to make it so; otherwise you'd never be able to write: void *ptr = obj1; void foo(void) { /* create obj2, obj3 */ smp_wmb(); /* ensure the objs are complete */ /* expose either obj2 or obj3 */ if (x) ptr = obj2; else ptr = obj3; /* free the unused one */ if (x) free(obj3); else free(obj2); } Earlier you said that 'volatile' or '__atomic' avoids speculative writes; so would: volatile void *ptr = obj1; Make the compiler respect control dependencies again? If so, could we somehow mark that !space() condition volatile? Currently the above would be considered a valid pattern. But you're saying its not because the compiler is free to expose both obj2 and obj3 (for however short a time) and thus the free of the 'unused' object is incorrect and can cause use-after-free. In fact; how can we be sure that: void *ptr = NULL; void bar(void) { void *obj = malloc(...); /* fill obj */ if (!err) rcu_assign_pointer(ptr, obj); else free(obj); } Does not get 'optimized' into: void bar(void) { void *obj = malloc(...); void *old_ptr = ptr; /* fill obj */ rcu_assign_pointer(ptr, obj); if (err) { /* because runtime profile data says this is unlikely */ ptr = old_ptr; free(obj); } } We _MUST_ be able to rely on control flow, otherwise me might as well all go back to writing kernels in asm. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 16:11 ` Peter Zijlstra @ 2013-11-02 17:46 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 17:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Fri, Nov 01, 2013 at 05:11:29PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > > void kbuf_write(int sz, void *buf) > > > { > > > u64 tail = ACCESS_ONCE(ubuf->tail); /* last location userspace read */ > > > u64 offset = kbuf->head; /* we already know where we last wrote */ > > > u64 head = offset + sz; > > > > > > if (!space(tail, offset, head)) { > > > /* discard @buf */ > > > return; > > > } > > > > > > /* > > > * Ensure that if we see the userspace tail (ubuf->tail) such > > > * that there is space to write @buf without overwriting data > > > * userspace hasn't seen yet, we won't in fact store data before > > > * that read completes. > > > */ > > > > > > smp_mb(); /* A, matches with D */ > > > > > > write(kbuf->data + offset, buf, sz); > > > kbuf->head = head % kbuf->size; > > > > > > /* > > > * Ensure that we write all the @buf data before we update the > > > * userspace visible ubuf->head pointer. > > > */ > > > smp_wmb(); /* B, matches with C */ > > > > > > ubuf->head = kbuf->head; > > > } > > > > Now the whole crux of the question is if we need barrier A at all, since > > > the STORES issued by the @buf writes are dependent on the ubuf->tail > > > read. > > > > The dependency you are talking about is via the "if" statement? > > Even C/C++11 is not required to respect control dependencies. > > But surely we must be able to make it so; otherwise you'd never be able > to write: > > void *ptr = obj1; > > void foo(void) > { > > /* create obj2, obj3 */ > > smp_wmb(); /* ensure the objs are complete */ > > /* expose either obj2 or obj3 */ > if (x) > ptr = obj2; > else > ptr = obj3; OK, the smp_wmb() orders the creation and the exposing. But the compiler can do this: ptr = obj3; if (x) ptr = obj2; And that could momentarily expose obj3 to readers, and these readers might be fatally disappointed by the free() below. If you instead said: if (x) ACCESS_ONCE(ptr) = obj2; else ACCESS_ONCE(ptr) = obj3; then the general consensus appears to be that the compiler would not be permitted to carry out the above optimization. Since you have the smp_wmb(), readers that are properly ordered (e.g., smp_rmb() or rcu_dereference()) would be prevented from seeing pre-initialization state. > /* free the unused one */ > if (x) > free(obj3); > else > free(obj2); > } > > Earlier you said that 'volatile' or '__atomic' avoids speculative > writes; so would: > > volatile void *ptr = obj1; > > Make the compiler respect control dependencies again? If so, could we > somehow mark that !space() condition volatile? The compiler should, but the CPU is still free to ignore the control dependencies in the general case. We might be able to rely on weakly ordered hardware refraining from speculating stores, but not sure that this applies across all architectures of interest. We definitely can -not- rely on weakly ordered hardware refraining from speculating loads. > Currently the above would be considered a valid pattern. But you're > saying its not because the compiler is free to expose both obj2 and obj3 > (for however short a time) and thus the free of the 'unused' object is > incorrect and can cause use-after-free. Yes, it is definitely unsafe and invalid in absence of ACCESS_ONCE(). > In fact; how can we be sure that: > > void *ptr = NULL; > > void bar(void) > { > void *obj = malloc(...); > > /* fill obj */ > > if (!err) > rcu_assign_pointer(ptr, obj); > else > free(obj); > } > > Does not get 'optimized' into: > > void bar(void) > { > void *obj = malloc(...); > void *old_ptr = ptr; > > /* fill obj */ > > rcu_assign_pointer(ptr, obj); > if (err) { /* because runtime profile data says this is unlikely */ > ptr = old_ptr; > free(obj); > } > } In this particular case, the barrier() implied by the smp_wmb() in rcu_assign_pointer() will prevent this "optimization". However, other "optimizations" are the reason why I am working to introduce ACCESS_ONCE() into rcu_assign_pointer. > We _MUST_ be able to rely on control flow, otherwise me might as well > all go back to writing kernels in asm. It isn't -that- bad! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 6:40 ` Paul E. McKenney ` (2 preceding siblings ...) 2013-11-01 16:11 ` Peter Zijlstra @ 2013-11-01 16:18 ` Peter Zijlstra 2013-11-02 17:49 ` Paul E. McKenney 3 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-11-01 16:18 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > The dependency you are talking about is via the "if" statement? > Even C/C++11 is not required to respect control dependencies. > > This one is a bit annoying. The x86 TSO means that you really only > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > barrier, and so on -- but smp_mb() emits a full barrier. > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > before reads, writes before writes, and reads before writes, but not > writes before reads? Another approach would be to define a per-arch > barrier for this particular case. Supposing a sane language where we can rely on control flow; would that change the story? I'm afraid I'm now terminally confused between actual proper memory model issues and fucked compilers. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 16:18 ` Peter Zijlstra @ 2013-11-02 17:49 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 17:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Oleg Nesterov, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, tony.luck On Fri, Nov 01, 2013 at 05:18:19PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 11:40:15PM -0700, Paul E. McKenney wrote: > > The dependency you are talking about is via the "if" statement? > > Even C/C++11 is not required to respect control dependencies. > > > > This one is a bit annoying. The x86 TSO means that you really only > > need barrier(), ARM (recent ARM, anyway) and Power could use a weaker > > barrier, and so on -- but smp_mb() emits a full barrier. > > > > Perhaps a new smp_tmb() for TSO semantics, where reads are ordered > > before reads, writes before writes, and reads before writes, but not > > writes before reads? Another approach would be to define a per-arch > > barrier for this particular case. > > Supposing a sane language where we can rely on control flow; would that > change the story? > > I'm afraid I'm now terminally confused between actual proper memory > model issues and fucked compilers. Power and ARM won't speculate stores, but they will happily speculate loads. Not sure about Itanium, perhaps Tony knows. And yes, reordering by the compilers and CPUs does sometimes seem a bit intertwined. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 9:27 ` Paul E. McKenney 2013-10-30 11:25 ` Peter Zijlstra @ 2013-10-30 13:28 ` Victor Kaplansky 2013-10-30 15:51 ` Peter Zijlstra 2013-10-31 4:32 ` Paul E. McKenney 1 sibling, 2 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-30 13:28 UTC (permalink / raw) To: paulmck Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/30/2013 11:27:25 AM: > If you were to back up that insistence with a description of the orderings > you are relying on, why other orderings are not important, and how the > important orderings are enforced, I might be tempted to pay attention > to your opinion. > > Thanx, Paul NP, though, I feel too embarrassed to explain things about memory barriers when one of the authors of Documentation/memory-barriers.txt is on cc: list ;-) Disclaimer: it is anyway impossible to prove lack of *any* problem. Having said that, lets look into an example in Documentation/circular-buffers.txt: > THE PRODUCER > ------------ > > The producer will look something like this: > > spin_lock(&producer_lock); > > unsigned long head = buffer->head; > unsigned long tail = ACCESS_ONCE(buffer->tail); > > if (CIRC_SPACE(head, tail, buffer->size) >= 1) { > /* insert one item into the buffer */ > struct item *item = buffer[head]; > > produce_item(item); > > smp_wmb(); /* commit the item before incrementing the head */ > > buffer->head = (head + 1) & (buffer->size - 1); > > /* wake_up() will make sure that the head is committed before > * waking anyone up */ > wake_up(consumer); > } > > spin_unlock(&producer_lock); We can see that authors of the document didn't put any memory barrier after "buffer->tail" read and before "produce_item(item)" and I think they have a good reason. Lets consider an imaginary smp_mb() right before "produce_item(item);". Such a barrier will ensure that - - the memory read on "buffer->tail" is completed before store to memory pointed by "item" is committed. However, the store to "buffer->tail" anyway cannot be completed before conditional branch implied by "if ()" is proven to execute body statement of the if(). And the latter cannot be proven before read of "buffer->tail" is completed. Lets see this other way. Lets imagine that somehow a store to the data pointed by "item" is completed before we read "buffer->tail". That would mean, that the store was completed speculatively. But speculative execution of conditional stores is prohibited by C/C++ standard, otherwise any conditional store at any random place of code could pollute shared memory. On the other hand, if compiler or processor can prove that condition in above if() is going to be true (or if speculative store writes the same value as it was before write), the speculative store *is* allowed. In this case we should not be bothered by the fact that memory pointed by "item" is written before a read from "buffer->tail" is completed. Regards, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 13:28 ` Victor Kaplansky @ 2013-10-30 15:51 ` Peter Zijlstra 2013-10-30 18:29 ` Peter Zijlstra 2013-10-31 4:33 ` Paul E. McKenney 2013-10-31 4:32 ` Paul E. McKenney 1 sibling, 2 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 15:51 UTC (permalink / raw) To: Victor Kaplansky Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote: > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-) > > Disclaimer: it is anyway impossible to prove lack of *any* problem. > > Having said that, lets look into an example in > Documentation/circular-buffers.txt: > > We can see that authors of the document didn't put any memory barrier Note that both documents have the same author list ;-) Anyway, I didn't know about the circular thing, I suppose I should use CIRC_SPACE() thing :-) ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 15:51 ` Peter Zijlstra @ 2013-10-30 18:29 ` Peter Zijlstra 2013-10-30 19:11 ` Peter Zijlstra 2013-10-31 4:33 ` Paul E. McKenney 1 sibling, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 18:29 UTC (permalink / raw) To: Victor Kaplansky Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote: > > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-) > > > > Disclaimer: it is anyway impossible to prove lack of *any* problem. > > > > Having said that, lets look into an example in > > Documentation/circular-buffers.txt: > > > > > We can see that authors of the document didn't put any memory barrier > > Note that both documents have the same author list ;-) > > Anyway, I didn't know about the circular thing, I suppose I should use > CIRC_SPACE() thing :-) The below removes 80 bytes from ring_buffer.o of which 50 bytes are from perf_output_begin(), it also removes 30 lines of code, so yay! (x86_64 build) And it appears to still work.. although I've not stressed the no-space bits. --- kernel/events/ring_buffer.c | 74 ++++++++++++++------------------------------- 1 file changed, 22 insertions(+), 52 deletions(-) diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index 9c2ddfbf4525..e4a51fa10595 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -12,40 +12,10 @@ #include <linux/perf_event.h> #include <linux/vmalloc.h> #include <linux/slab.h> +#include <linux/circ_buf.h> #include "internal.h" -static bool perf_output_space(struct ring_buffer *rb, unsigned long tail, - unsigned long offset, unsigned long head) -{ - unsigned long sz = perf_data_size(rb); - unsigned long mask = sz - 1; - - /* - * check if user-writable - * overwrite : over-write its own tail - * !overwrite: buffer possibly drops events. - */ - if (rb->overwrite) - return true; - - /* - * verify that payload is not bigger than buffer - * otherwise masking logic may fail to detect - * the "not enough space" condition - */ - if ((head - offset) > sz) - return false; - - offset = (offset - tail) & mask; - head = (head - tail) & mask; - - if ((int)(head - offset) < 0) - return false; - - return true; -} - static void perf_output_wakeup(struct perf_output_handle *handle) { atomic_set(&handle->rb->poll, POLL_IN); @@ -115,8 +85,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle) rb->user_page->data_head = head; /* - * Now check if we missed an update, rely on the (compiler) - * barrier in atomic_dec_and_test() to re-read rb->head. + * Now check if we missed an update. */ if (unlikely(head != local_read(&rb->head))) { local_inc(&rb->nest); @@ -135,7 +104,7 @@ int perf_output_begin(struct perf_output_handle *handle, { struct ring_buffer *rb; unsigned long tail, offset, head; - int have_lost; + int have_lost, page_shift; struct perf_sample_data sample_data; struct { struct perf_event_header header; @@ -161,7 +130,7 @@ int perf_output_begin(struct perf_output_handle *handle, goto out; have_lost = local_read(&rb->lost); - if (have_lost) { + if (unlikely(have_lost)) { lost_event.header.size = sizeof(lost_event); perf_event_header__init_id(&lost_event.header, &sample_data, event); @@ -171,32 +140,33 @@ int perf_output_begin(struct perf_output_handle *handle, perf_output_get_handle(handle); do { - /* - * Userspace could choose to issue a mb() before updating the - * tail pointer. So that all reads will be completed before the - * write is issued. - * - * See perf_output_put_handle(). - */ tail = ACCESS_ONCE(rb->user_page->data_tail); - smp_mb(); offset = head = local_read(&rb->head); - head += size; - if (unlikely(!perf_output_space(rb, tail, offset, head))) + if (!rb->overwrite && + unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size)) goto fail; + head += size; } while (local_cmpxchg(&rb->head, offset, head) != offset); + /* + * Userspace SHOULD issue an MB before writing the tail; see + * perf_output_put_handle(). + */ + smp_mb(); + if (head - local_read(&rb->wakeup) > rb->watermark) local_add(rb->watermark, &rb->wakeup); - handle->page = offset >> (PAGE_SHIFT + page_order(rb)); - handle->page &= rb->nr_pages - 1; - handle->size = offset & ((PAGE_SIZE << page_order(rb)) - 1); - handle->addr = rb->data_pages[handle->page]; - handle->addr += handle->size; - handle->size = (PAGE_SIZE << page_order(rb)) - handle->size; + page_shift = PAGE_SHIFT + page_order(rb); + + handle->page = (offset >> page_shift) & (rb->nr_pages - 1); + + offset &= page_shift - 1; + + handle->addr = rb->data_pages[handle->page] + offset; + handle->size = (1 << page_shift) - offset; - if (have_lost) { + if (unlikely(have_lost)) { lost_event.header.type = PERF_RECORD_LOST; lost_event.header.misc = 0; lost_event.id = event->id; ^ permalink raw reply related [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 18:29 ` Peter Zijlstra @ 2013-10-30 19:11 ` Peter Zijlstra 0 siblings, 0 replies; 74+ messages in thread From: Peter Zijlstra @ 2013-10-30 19:11 UTC (permalink / raw) To: Victor Kaplansky Cc: paulmck, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 07:29:30PM +0100, Peter Zijlstra wrote: > + page_shift = PAGE_SHIFT + page_order(rb); > + > + handle->page = (offset >> page_shift) & (rb->nr_pages - 1); > + > + offset &= page_shift - 1; offset &= (1UL << page_shift) - 1; Weird that it even appeared to work.. /me wonders if he even booted the right kernel. > + > + handle->addr = rb->data_pages[handle->page] + offset; > + handle->size = (1 << page_shift) - offset; ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 15:51 ` Peter Zijlstra 2013-10-30 18:29 ` Peter Zijlstra @ 2013-10-31 4:33 ` Paul E. McKenney 1 sibling, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 4:33 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 04:51:16PM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote: > > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-) > > > > Disclaimer: it is anyway impossible to prove lack of *any* problem. > > > > Having said that, lets look into an example in > > Documentation/circular-buffers.txt: > > > > > We can see that authors of the document didn't put any memory barrier > > Note that both documents have the same author list ;-) > > Anyway, I didn't know about the circular thing, I suppose I should use > CIRC_SPACE() thing :-) Interesting that we didn't seem to supply a definition... ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-30 13:28 ` Victor Kaplansky 2013-10-30 15:51 ` Peter Zijlstra @ 2013-10-31 4:32 ` Paul E. McKenney 2013-10-31 9:04 ` Peter Zijlstra 2013-10-31 9:59 ` Victor Kaplansky 1 sibling, 2 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 4:32 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra On Wed, Oct 30, 2013 at 03:28:54PM +0200, Victor Kaplansky wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/30/2013 > 11:27:25 AM: > > > If you were to back up that insistence with a description of the > orderings > > you are relying on, why other orderings are not important, and how the > > important orderings are enforced, I might be tempted to pay attention > > to your opinion. > > > > Thanx, Paul > > NP, though, I feel too embarrassed to explain things about memory barriers > when > one of the authors of Documentation/memory-barriers.txt is on cc: list ;-) > > Disclaimer: it is anyway impossible to prove lack of *any* problem. If you want to play the "omit memory barriers" game, then proving a negative is in fact the task before you. > Having said that, lets look into an example in > Documentation/circular-buffers.txt: And the correctness of this code has been called into question. :-( An embarrassingly long time ago -- I need to get this either proven or fixed. > > THE PRODUCER > > ------------ > > > > The producer will look something like this: > > > > spin_lock(&producer_lock); > > > > unsigned long head = buffer->head; > > unsigned long tail = ACCESS_ONCE(buffer->tail); > > > > if (CIRC_SPACE(head, tail, buffer->size) >= 1) { > > /* insert one item into the buffer */ > > struct item *item = buffer[head]; > > > > produce_item(item); > > > > smp_wmb(); /* commit the item before incrementing the head > */ > > > > buffer->head = (head + 1) & (buffer->size - 1); > > > > /* wake_up() will make sure that the head is committed > before > > * waking anyone up */ > > wake_up(consumer); > > } > > > > spin_unlock(&producer_lock); > > We can see that authors of the document didn't put any memory barrier > after "buffer->tail" read and before "produce_item(item)" and I think they > have > a good reason. > > Lets consider an imaginary smp_mb() right before "produce_item(item);". > Such a barrier will ensure that - > > - the memory read on "buffer->tail" is completed > before store to memory pointed by "item" is committed. > > However, the store to "buffer->tail" anyway cannot be completed before > conditional > branch implied by "if ()" is proven to execute body statement of the if(). > And the > latter cannot be proven before read of "buffer->tail" is completed. > > Lets see this other way. Lets imagine that somehow a store to the data > pointed by "item" > is completed before we read "buffer->tail". That would mean, that the store > was completed > speculatively. But speculative execution of conditional stores is > prohibited by C/C++ standard, > otherwise any conditional store at any random place of code could pollute > shared memory. Before C/C++11, the closest thing to such a prohibition is use of volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to use atomics to get anything resembing this prohibition. If you just use normal variables, the compiler is within its rights to transform something like the following: if (a) b = 1; else b = 42; Into: b = 42; if (a) b = 1; Many other similar transformations are permitted. Some are used to all vector instructions to be used -- the compiler can do a write with an overly wide vector instruction, then clean up the clobbered variables later, if it wishes. Again, if the variables are not marked volatile, or, in C/C++11, atomic. > On the other hand, if compiler or processor can prove that condition in > above if() is going > to be true (or if speculative store writes the same value as it was before > write), the > speculative store *is* allowed. In this case we should not be bothered by > the fact that > memory pointed by "item" is written before a read from "buffer->tail" is > completed. The compilers don't always know as much as they might about the underlying hardware's memory model. Of course, if this code is architecture specific, it can avoid DEC Alpha's fun and games, which could also violate your assumptions in the above paragraph: http://www.openvms.compaq.com/wizard/wiz_2637.html Anyway, proving or fixing the code in Documentation/circular-buffers.txt has been on my list for too long, so I will take a closer look at it. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 4:32 ` Paul E. McKenney @ 2013-10-31 9:04 ` Peter Zijlstra 2013-10-31 15:07 ` Paul E. McKenney 2013-10-31 9:59 ` Victor Kaplansky 1 sibling, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-31 9:04 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote: > Before C/C++11, the closest thing to such a prohibition is use of > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > use atomics to get anything resembing this prohibition. > > If you just use normal variables, the compiler is within its rights > to transform something like the following: > > if (a) > b = 1; > else > b = 42; > > Into: > > b = 42; > if (a) > b = 1; > > Many other similar transformations are permitted. Some are used to all > vector instructions to be used -- the compiler can do a write with an > overly wide vector instruction, then clean up the clobbered variables > later, if it wishes. Again, if the variables are not marked volatile, > or, in C/C++11, atomic. While I've heard you tell this story before, my mind keeps boggling how we've been able to use shared memory at all, all these years. It seems to me stuff should have broken left, right and center if compilers were really aggressive about this. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 9:04 ` Peter Zijlstra @ 2013-10-31 15:07 ` Paul E. McKenney 2013-10-31 15:19 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 15:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote: > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote: > > Before C/C++11, the closest thing to such a prohibition is use of > > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > > use atomics to get anything resembing this prohibition. > > > > If you just use normal variables, the compiler is within its rights > > to transform something like the following: > > > > if (a) > > b = 1; > > else > > b = 42; > > > > Into: > > > > b = 42; > > if (a) > > b = 1; > > > > Many other similar transformations are permitted. Some are used to all > > vector instructions to be used -- the compiler can do a write with an > > overly wide vector instruction, then clean up the clobbered variables > > later, if it wishes. Again, if the variables are not marked volatile, > > or, in C/C++11, atomic. > > While I've heard you tell this story before, my mind keeps boggling how > we've been able to use shared memory at all, all these years. > > It seems to me stuff should have broken left, right and center if > compilers were really aggressive about this. Sometimes having stupid compilers is a good thing. But they really are getting more aggressive. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 15:07 ` Paul E. McKenney @ 2013-10-31 15:19 ` Peter Zijlstra 2013-11-01 9:28 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-10-31 15:19 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Thu, Oct 31, 2013 at 08:07:56AM -0700, Paul E. McKenney wrote: > On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote: > > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote: > > > Before C/C++11, the closest thing to such a prohibition is use of > > > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > > > use atomics to get anything resembing this prohibition. > > > > > > If you just use normal variables, the compiler is within its rights > > > to transform something like the following: > > > > > > if (a) > > > b = 1; > > > else > > > b = 42; > > > > > > Into: > > > > > > b = 42; > > > if (a) > > > b = 1; > > > > > > Many other similar transformations are permitted. Some are used to all > > > vector instructions to be used -- the compiler can do a write with an > > > overly wide vector instruction, then clean up the clobbered variables > > > later, if it wishes. Again, if the variables are not marked volatile, > > > or, in C/C++11, atomic. > > > > While I've heard you tell this story before, my mind keeps boggling how > > we've been able to use shared memory at all, all these years. > > > > It seems to me stuff should have broken left, right and center if > > compilers were really aggressive about this. > > Sometimes having stupid compilers is a good thing. But they really are > getting more aggressive. But surely we cannot go mark all data structures lodged in shared memory as volatile, that's insane. I'm sure you're quite worried about this as well. Suppose we have: struct foo { unsigned long value; void *ptr; unsigned long value1; }; And our ptr member is RCU managed. Then while the assignment using: rcu_assign_ptr() will use the volatile cast, what stops the compiler from wrecking ptr while writing either of the value* members and 'fixing' her up after? This is a completely untenable position. How do the C/C++ people propose to deal with this? ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 15:19 ` Peter Zijlstra @ 2013-11-01 9:28 ` Paul E. McKenney 2013-11-01 10:30 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-01 9:28 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Thu, Oct 31, 2013 at 04:19:55PM +0100, Peter Zijlstra wrote: > On Thu, Oct 31, 2013 at 08:07:56AM -0700, Paul E. McKenney wrote: > > On Thu, Oct 31, 2013 at 10:04:57AM +0100, Peter Zijlstra wrote: > > > On Wed, Oct 30, 2013 at 09:32:58PM -0700, Paul E. McKenney wrote: > > > > Before C/C++11, the closest thing to such a prohibition is use of > > > > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > > > > use atomics to get anything resembing this prohibition. > > > > > > > > If you just use normal variables, the compiler is within its rights > > > > to transform something like the following: > > > > > > > > if (a) > > > > b = 1; > > > > else > > > > b = 42; > > > > > > > > Into: > > > > > > > > b = 42; > > > > if (a) > > > > b = 1; > > > > > > > > Many other similar transformations are permitted. Some are used to all > > > > vector instructions to be used -- the compiler can do a write with an > > > > overly wide vector instruction, then clean up the clobbered variables > > > > later, if it wishes. Again, if the variables are not marked volatile, > > > > or, in C/C++11, atomic. > > > > > > While I've heard you tell this story before, my mind keeps boggling how > > > we've been able to use shared memory at all, all these years. > > > > > > It seems to me stuff should have broken left, right and center if > > > compilers were really aggressive about this. > > > > Sometimes having stupid compilers is a good thing. But they really are > > getting more aggressive. > > But surely we cannot go mark all data structures lodged in shared memory > as volatile, that's insane. > > I'm sure you're quite worried about this as well. Suppose we have: > > struct foo { > unsigned long value; > void *ptr; > unsigned long value1; > }; > > And our ptr member is RCU managed. Then while the assignment using: > rcu_assign_ptr() will use the volatile cast, what stops the compiler > from wrecking ptr while writing either of the value* members and > 'fixing' her up after? Nothing at all! We can reduce the probability by putting the pointer at one end or the other, so that if the compiler uses (say) vector instructions to aggregate individual assignments to the other fields, it will be less likely to hit "ptr". But yes, this is ugly and it would be really hard to get all this right, and would often conflict with cache-locality needs. > This is a completely untenable position. Indeed it is! C/C++ never was intended to be used for parallel programming, and this is but one of the problems that can arise when we nevertheless use it for parallel programming. As compilers get smarter (for some definition of "smarter") and as more systems have special-purpose hardware (such as vector units) that are visible to the compiler, we can expect more of this kind of trouble. This was one of many reasons that I decided to help with the C/C++11 effort, whatever anyone might think about the results. > How do the C/C++ people propose to deal with this? By marking "ptr" as atomic, thus telling the compiler not to mess with it. And thus requiring that all accesses to it be decorated, which in the case of RCU could be buried in the RCU accessors. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 9:28 ` Paul E. McKenney @ 2013-11-01 10:30 ` Peter Zijlstra 2013-11-02 15:20 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-11-01 10:30 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Fri, Nov 01, 2013 at 02:28:14AM -0700, Paul E. McKenney wrote: > > This is a completely untenable position. > > Indeed it is! > > C/C++ never was intended to be used for parallel programming, And yet pretty much all kernels ever written for SMP systems are written in it; what drugs are those people smoking? Furthermore there's a gazillion parallel userspace programs. > and this is > but one of the problems that can arise when we nevertheless use it for > parallel programming. As compilers get smarter (for some definition of > "smarter") and as more systems have special-purpose hardware (such as > vector units) that are visible to the compiler, we can expect more of > this kind of trouble. > > This was one of many reasons that I decided to help with the C/C++11 > effort, whatever anyone might think about the results. Well, I applaud your efforts, but given the results I think the C/C++ people are all completely insane. > > How do the C/C++ people propose to deal with this? > > By marking "ptr" as atomic, thus telling the compiler not to mess with it. > And thus requiring that all accesses to it be decorated, which in the > case of RCU could be buried in the RCU accessors. This seems contradictory; marking it atomic would look like: struct foo { unsigned long value; __atomic void *ptr; unsigned long value1; }; Clearly we cannot hide this definition in accessors, because then accesses to value* won't see the annotation. That said; mandating we mark all 'shared' data with __atomic is completely untenable and is not backwards compatible. To be safe we must assume all data shared unless indicated otherwise. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 10:30 ` Peter Zijlstra @ 2013-11-02 15:20 ` Paul E. McKenney 2013-11-04 9:07 ` Peter Zijlstra 0 siblings, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 15:20 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote: > On Fri, Nov 01, 2013 at 02:28:14AM -0700, Paul E. McKenney wrote: > > > This is a completely untenable position. > > > > Indeed it is! > > > > C/C++ never was intended to be used for parallel programming, > > And yet pretty much all kernels ever written for SMP systems are written > in it; what drugs are those people smoking? There was a time when I wished that the C/C++ standards people had added concurrency to the language 30 years ago, but I eventually realized that any attempt at that time would have been totally broken. > Furthermore there's a gazillion parallel userspace programs. Most of which have very unaggressive concurrency designs. > > and this is > > but one of the problems that can arise when we nevertheless use it for > > parallel programming. As compilers get smarter (for some definition of > > "smarter") and as more systems have special-purpose hardware (such as > > vector units) that are visible to the compiler, we can expect more of > > this kind of trouble. > > > > This was one of many reasons that I decided to help with the C/C++11 > > effort, whatever anyone might think about the results. > > Well, I applaud your efforts, but given the results I think the C/C++ > people are all completely insane. If it makes you feel any better, they have the same opinion of all of us who use C/C++ for concurrency given that the standard provides no guarantee. > > > How do the C/C++ people propose to deal with this? > > > > By marking "ptr" as atomic, thus telling the compiler not to mess with it. > > And thus requiring that all accesses to it be decorated, which in the > > case of RCU could be buried in the RCU accessors. > > This seems contradictory; marking it atomic would look like: > > struct foo { > unsigned long value; > __atomic void *ptr; > unsigned long value1; > }; > > Clearly we cannot hide this definition in accessors, because then > accesses to value* won't see the annotation. #define __rcu __atomic Though there are probably placement restrictions for __atomic that current use of __rcu doesn't pay attention to. > That said; mandating we mark all 'shared' data with __atomic is > completely untenable and is not backwards compatible. > > To be safe we must assume all data shared unless indicated otherwise. Something similar to the compiler directives forcing twos-complement interpretation of signed overflow could be attractive. Not sure what it would do to code generation, though. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-02 15:20 ` Paul E. McKenney @ 2013-11-04 9:07 ` Peter Zijlstra 2013-11-04 10:00 ` Paul E. McKenney 0 siblings, 1 reply; 74+ messages in thread From: Peter Zijlstra @ 2013-11-04 9:07 UTC (permalink / raw) To: Paul E. McKenney Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Sat, Nov 02, 2013 at 08:20:48AM -0700, Paul E. McKenney wrote: > On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote: > > Furthermore there's a gazillion parallel userspace programs. > > Most of which have very unaggressive concurrency designs. pthread_mutex_t A, B; char data_A[x]; int counter_B = 1; void funA(void) { pthread_mutex_lock(&A); memset(data_A, 0, sizeof(data_A)); pthread_mutex_unlock(&A); } void funB(void) { pthread_mutex_lock(&B); counter_B++; pthread_mutex_unlock(&B); } void funC(void) { pthread_mutex_lock(&B) printf("%d\n", counter_B); pthread_mutex_unlock(&B); } Then run: funA, funB, funC concurrently, and end with a funC. Then explain to userman than his unaggressive program can return: 0 1 Because the memset() thought it might be a cute idea to overwrite counter_B and fix it up 'later'. Which if I understood you right is valid in C/C++ :-( Not that any actual memset implementation exhibiting this trait wouldn't be shot on the spot. > > > By marking "ptr" as atomic, thus telling the compiler not to mess with it. > > > And thus requiring that all accesses to it be decorated, which in the > > > case of RCU could be buried in the RCU accessors. > > > > This seems contradictory; marking it atomic would look like: > > > > struct foo { > > unsigned long value; > > __atomic void *ptr; > > unsigned long value1; > > }; > > > > Clearly we cannot hide this definition in accessors, because then > > accesses to value* won't see the annotation. > > #define __rcu __atomic Yeah, except we don't use __rcu all that consistently; in fact I don't know if I ever added it. ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-04 9:07 ` Peter Zijlstra @ 2013-11-04 10:00 ` Paul E. McKenney 0 siblings, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-04 10:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov On Mon, Nov 04, 2013 at 10:07:44AM +0100, Peter Zijlstra wrote: > On Sat, Nov 02, 2013 at 08:20:48AM -0700, Paul E. McKenney wrote: > > On Fri, Nov 01, 2013 at 11:30:17AM +0100, Peter Zijlstra wrote: > > > Furthermore there's a gazillion parallel userspace programs. > > > > Most of which have very unaggressive concurrency designs. > > pthread_mutex_t A, B; > > char data_A[x]; > int counter_B = 1; > > void funA(void) > { > pthread_mutex_lock(&A); > memset(data_A, 0, sizeof(data_A)); > pthread_mutex_unlock(&A); > } > > void funB(void) > { > pthread_mutex_lock(&B); > counter_B++; > pthread_mutex_unlock(&B); > } > > void funC(void) > { > pthread_mutex_lock(&B) > printf("%d\n", counter_B); > pthread_mutex_unlock(&B); > } > > Then run: funA, funB, funC concurrently, and end with a funC. > > Then explain to userman than his unaggressive program can return: > 0 > 1 > > Because the memset() thought it might be a cute idea to overwrite > counter_B and fix it up 'later'. Which if I understood you right is > valid in C/C++ :-( > > Not that any actual memset implementation exhibiting this trait wouldn't > be shot on the spot. Even without such a malicious memcpy() implementation I must still explain about false sharing when the developer notices that the unaggressive program isn't running as fast as expected. > > > > By marking "ptr" as atomic, thus telling the compiler not to mess with it. > > > > And thus requiring that all accesses to it be decorated, which in the > > > > case of RCU could be buried in the RCU accessors. > > > > > > This seems contradictory; marking it atomic would look like: > > > > > > struct foo { > > > unsigned long value; > > > __atomic void *ptr; > > > unsigned long value1; > > > }; > > > > > > Clearly we cannot hide this definition in accessors, because then > > > accesses to value* won't see the annotation. > > > > #define __rcu __atomic > > Yeah, except we don't use __rcu all that consistently; in fact I don't > know if I ever added it. There are more than 300 of them in the kernel. Plus sparse can be convinced to yell at you if you don't use them. So lack of __rcu could be fixed without too much trouble. The C/C++11 need to annotate functions that take arguments or return values taken from rcu_dereference() is another story. But the compilers have to get significantly more aggressive or developers have to be doing unusual things that result in rcu_dereference() returning something whose value the compiler can predict exactly. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 4:32 ` Paul E. McKenney 2013-10-31 9:04 ` Peter Zijlstra @ 2013-10-31 9:59 ` Victor Kaplansky 2013-10-31 12:28 ` David Laight 2013-10-31 15:25 ` Paul E. McKenney 1 sibling, 2 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-31 9:59 UTC (permalink / raw) To: paulmck Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 06:32:58 AM: > If you want to play the "omit memory barriers" game, then proving a > negative is in fact the task before you. Generally it is not fair. Otherwise, anyone could put an smp_mb() at a random place, and expect others to "prove" that it is not needed. It is not fair also because it should be virtually impossible to prove lack of any problem. OTH, if a problem exists, it should be easy for proponents of a memory barrier to build a test case or design a scenario demonstrating the problem. Actually, advocates of the memory barrier in our case do have an argument - - the rule of thumb saying that barriers should be paired. I consider this rule only as a general recommendation to look into potentially risky places. And indeed, in our case if the store to circular wasn't conditional, it would require a memory barrier to prevent the store to be performed before the read of @tail. But in our case the store is conditional, so no memory barrier is required. > And the correctness of this code has been called into question. :-( > An embarrassingly long time ago -- I need to get this either proven > or fixed. I agree. > Before C/C++11, the closest thing to such a prohibition is use of > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > use atomics to get anything resembing this prohibition. > > If you just use normal variables, the compiler is within its rights > to transform something like the following: > > if (a) > b = 1; > else > b = 42; > > Into: > > b = 42; > if (a) > b = 1; > > Many other similar transformations are permitted. Some are used to all > vector instructions to be used -- the compiler can do a write with an > overly wide vector instruction, then clean up the clobbered variables > later, if it wishes. Again, if the variables are not marked volatile, > or, in C/C++11, atomic. All this can justify only compiler barrier() which is almost free from performance point of view, since current gcc anyway doesn't perform store hoisting optimization in our case. (And I'm not getting into philosophical discussion whether kernel code should consider future possible bugs/features in gcc or C/C++11 standard). > The compilers don't always know as much as they might about the underlying > hardware's memory model. That's correct in general. But can you point out a problem that really exists? > Of course, if this code is architecture specific, > it can avoid DEC Alpha's fun and games, which could also violate your > assumptions in the above paragraph: > > http://www.openvms.compaq.com/wizard/wiz_2637.html Are you talking about this paragraph from above link: "For instance, your producer must issue a "memory barrier" instruction after writing the data to shared memory and before inserting it on the queue; likewise, your consumer must issue a memory barrier instruction after removing an item from the queue and before reading from its memory. Otherwise, you risk seeing stale data, since, while the Alpha processor does provide coherent memory, it does not provide implicit ordering of reads and writes. (That is, the write of the producer's data might reach memory after the write of the queue, such that the consumer might read the new item from the queue but get the previous values from the item's memory." If yes, I don't think it explains the need of memory barrier on Alpha in our case (we all agree about the need of smp_wmb() right before @head update by producer). If not, could you please point to specific paragraph? > > Anyway, proving or fixing the code in Documentation/circular-buffers.txt > has been on my list for too long, so I will take a closer look at it. Thanks! I'm concerned more about performance overhead imposed by the full memory barrier in kfifo circular buffers. Even if it is needed on Alpha (I don't understand why) we could try to solve this with some memory barrier which is effective only on architectures which really need it. Regards, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: perf events ring buffer memory barrier on powerpc 2013-10-31 9:59 ` Victor Kaplansky @ 2013-10-31 12:28 ` David Laight 2013-10-31 12:55 ` Victor Kaplansky 2013-10-31 15:25 ` Paul E. McKenney 1 sibling, 1 reply; 74+ messages in thread From: David Laight @ 2013-10-31 12:28 UTC (permalink / raw) To: Victor Kaplansky, paulmck Cc: Michael Neuling, Mathieu Desnoyers, Peter Zijlstra, LKML, Oleg Nesterov, Linux PPC dev, Anton Blanchard, Frederic Weisbecker > "For instance, your producer must issue a "memory barrier" instruction > after writing the data to shared memory and before inserting it on > the queue; likewise, your consumer must issue a memory barrier > instruction after removing an item from the queue and before reading > from its memory. Otherwise, you risk seeing stale data, since, while > the Alpha processor does provide coherent memory, it does not provide > implicit ordering of reads and writes. (That is, the write of the > producer's data might reach memory after the write of the queue, such > that the consumer might read the new item from the queue but get the > previous values from the item's memory." > > If yes, I don't think it explains the need of memory barrier on Alpha > in our case (we all agree about the need of smp_wmb() right before @head > update by producer). If not, could you please point to specific paragraph? My understanding is that the extra read barrier the alpha needs isn't to control the order the cpu performs the memory cycles in, but rather to wait while the cache system performs all outstanding operations. So even though the wmb() in the writer ensures the writes are correctly ordered, the reader can read the old value from the second location from its local cache. David ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: perf events ring buffer memory barrier on powerpc 2013-10-31 12:28 ` David Laight @ 2013-10-31 12:55 ` Victor Kaplansky 0 siblings, 0 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-10-31 12:55 UTC (permalink / raw) To: David Laight Cc: Anton Blanchard, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Neuling, Oleg Nesterov, paulmck, Peter Zijlstra "David Laight" <David.Laight@aculab.com> wrote on 10/31/2013 02:28:56 PM: > So even though the wmb() in the writer ensures the writes are correctly > ordered, the reader can read the old value from the second location from > its local cache. In case of circular buffer, the only thing that producer reads is @tail, and nothing wrong will happen if producer reads old value of @tail. Moreover, adherents of smp_mb() insert it *after* the read of @tail, so it cannot prevent reading of old value anyway. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 9:59 ` Victor Kaplansky 2013-10-31 12:28 ` David Laight @ 2013-10-31 15:25 ` Paul E. McKenney 2013-11-01 16:06 ` Victor Kaplansky 1 sibling, 1 reply; 74+ messages in thread From: Paul E. McKenney @ 2013-10-31 15:25 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra On Thu, Oct 31, 2013 at 11:59:21AM +0200, Victor Kaplansky wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 > 06:32:58 AM: > > > If you want to play the "omit memory barriers" game, then proving a > > negative is in fact the task before you. > > Generally it is not fair. Otherwise, anyone could put an smp_mb() at a > random place, and expect others to "prove" that it is not needed. > > It is not fair also because it should be virtually impossible to prove lack > of any problem. OTH, if a problem exists, it should be easy for proponents > of a memory barrier to build a test case or design a scenario demonstrating > the problem. I really don't care about "fair" -- I care instead about the kernel working reliably. And it should also be easy for proponents of removing memory barriers to clearly articulate what orderings their code does and does not need. > Actually, advocates of the memory barrier in our case do have an argument - > - the rule of thumb saying that barriers should be paired. I consider this > rule only as a general recommendation to look into potentially risky > places. > And indeed, in our case if the store to circular wasn't conditional, it > would require a memory barrier to prevent the store to be performed before > the read of @tail. But in our case the store is conditional, so no memory > barrier is required. You are assuming control dependencies that the C language does not provide. Now, for all I know right now, there might well be some other reason why a full barrier is not required, but the "if" statement cannot be that reason. Please review section 1.10 of the C++11 standard (or the corresponding section of the C11 standard, if you prefer). The point is that the C/C++11 covers only data dependencies, not control dependencies. > > And the correctness of this code has been called into question. :-( > > An embarrassingly long time ago -- I need to get this either proven > > or fixed. > > I agree. Glad we agree on something! > > Before C/C++11, the closest thing to such a prohibition is use of > > volatile, for example, ACCESS_ONCE(). Even in C/C++11, you have to > > use atomics to get anything resembing this prohibition. > > > > If you just use normal variables, the compiler is within its rights > > to transform something like the following: > > > > if (a) > > b = 1; > > else > > b = 42; > > > > Into: > > > > b = 42; > > if (a) > > b = 1; > > > > Many other similar transformations are permitted. Some are used to all > > vector instructions to be used -- the compiler can do a write with an > > overly wide vector instruction, then clean up the clobbered variables > > later, if it wishes. Again, if the variables are not marked volatile, > > or, in C/C++11, atomic. > > All this can justify only compiler barrier() which is almost free from > performance point of view, since current gcc anyway doesn't perform store > hoisting optimization in our case. If the above example doesn't get you to give up your incorrect assumption about "if" statements having much effect on ordering, you need more help than I can give you just now. > (And I'm not getting into philosophical discussion whether kernel code > should consider future possible bugs/features in gcc or C/C++11 > standard). Should you wish to get into that discussion in the near future, you will need to find someone else to discuss it with. > > The compilers don't always know as much as they might about the > underlying > > hardware's memory model. > > That's correct in general. But can you point out a problem that really > exists? We will see. In the meantime, can you summarize the ordering requirements of your code? > > Of course, if this code is architecture specific, > > it can avoid DEC Alpha's fun and games, which could also violate your > > assumptions in the above paragraph: > > > > http://www.openvms.compaq.com/wizard/wiz_2637.html > > Are you talking about this paragraph from above link: > > "For instance, your producer must issue a "memory barrier" instruction > after writing the data to shared memory and before inserting it on > the queue; likewise, your consumer must issue a memory barrier > instruction after removing an item from the queue and before reading > from its memory. Otherwise, you risk seeing stale data, since, while > the Alpha processor does provide coherent memory, it does not provide > implicit ordering of reads and writes. (That is, the write of the > producer's data might reach memory after the write of the queue, such > that the consumer might read the new item from the queue but get the > previous values from the item's memory." > > If yes, I don't think it explains the need of memory barrier on Alpha > in our case (we all agree about the need of smp_wmb() right before @head > update by producer). If not, could you please point to specific paragraph? Did you miss the following passage in the paragraph you quoted? "... likewise, your consumer must issue a memory barrier instruction after removing an item from the queue and before reading from its memory." That is why DEC Alpha readers need a read-side memory barrier -- it says so right there. And as either you or Peter noted earlier in this thread, this barrier can be supplied by smp_read_barrier_depends(). I can sympathize if you are having trouble believing this. After all, it took the DEC Alpha architects a full hour to convince me, and that was in a face-to-face meeting instead of over email. (Just for the record, it took me even longer to convince them that their earlier documentation did not clearly indicate the need for these read-side barriers.) But regardless of whether or not I sympathize, DEC Alpha is what it is. > > Anyway, proving or fixing the code in Documentation/circular-buffers.txt > > has been on my list for too long, so I will take a closer look at it. > > Thanks! > > I'm concerned more about performance overhead imposed by the full memory > barrier in kfifo circular buffers. Even if it is needed on Alpha (I don't > understand why) we could try to solve this with some memory barrier which > is effective only on architectures which really need it. By exactly how much does the memory barrier slow your code down on some example system? (Yes, I can believe that it is a problem, but is it really a problem in your exact situation?) Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-31 15:25 ` Paul E. McKenney @ 2013-11-01 16:06 ` Victor Kaplansky 2013-11-01 16:25 ` David Laight 2013-11-02 15:46 ` Paul E. McKenney 0 siblings, 2 replies; 74+ messages in thread From: Victor Kaplansky @ 2013-11-01 16:06 UTC (permalink / raw) To: paulmck Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 05:25:43 PM: > I really don't care about "fair" -- I care instead about the kernel > working reliably. Though I don't see how putting a memory barrier without deep understanding why it is needed helps kernel reliability, I do agree that reliability is more important than performance. > And it should also be easy for proponents of removing memory barriers to > clearly articulate what orderings their code does and does not need. I intentionally took a simplified example of circle buffer from Documentation/circular-buffers.txt. I think both sides agree about memory ordering requirements in the example. At least I didn't see anyone argued about them. > You are assuming control dependencies that the C language does not > provide. Now, for all I know right now, there might well be some other > reason why a full barrier is not required, but the "if" statement cannot > be that reason. > > Please review section 1.10 of the C++11 standard (or the corresponding > section of the C11 standard, if you prefer). The point is that the > C/C++11 covers only data dependencies, not control dependencies. I feel you made a wrong assumption about my expertise in compilers. I don't need to reread section 1.10 of the C++11 standard, because I do agree that potentially compiler can break the code in our case. And I do agree that a compiler barrier() or some other means (including a change of the standard) can be required in future to prevent a compiler from moving memory accesses around. But "broken" compiler is much wider issue to be deeply discussed in this thread. I'm pretty sure that kernel have tons of very subtle code that actually creates locks and memory ordering. Such code usually just use the "barrier()" approach to tell gcc not to combine or move memory accesses around it. Let's just agree for the sake of this memory barrier discussion that we *do* need compiler barrier to tell gcc not to combine or move memory accesses around it. > Glad we agree on something! I'm glad too! > Did you miss the following passage in the paragraph you quoted? > > "... likewise, your consumer must issue a memory barrier > instruction after removing an item from the queue and before > reading from its memory." > > That is why DEC Alpha readers need a read-side memory barrier -- it says > so right there. And as either you or Peter noted earlier in this thread, > this barrier can be supplied by smp_read_barrier_depends(). I did not miss that passage. That passage explains why consumer on Alpha processor after reading @head is required to execute an additional smp_read_barrier_depends() before it can *read* from memory pointed by @tail. And I think that I understand why - because the reader have to wait till local caches are fully updated and only then it can read data from the data buffer. But on the producer side, after we read @tail, we don't need to wait for update of local caches before we start *write* data to the buffer, since the producer is the only one who write data there! > > I can sympathize if you are having trouble believing this. After all, > it took the DEC Alpha architects a full hour to convince me, and that was > in a face-to-face meeting instead of over email. (Just for the record, > it took me even longer to convince them that their earlier documentation > did not clearly indicate the need for these read-side barriers.) But > regardless of whether or not I sympathize, DEC Alpha is what it is. Again, I do understand quirkiness of the DEC Alpha, and I still think that there is no need in *full* memory barrier on producer side - the one before writing data to the buffer and which you've put in kfifo implementation. Regard, -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: perf events ring buffer memory barrier on powerpc 2013-11-01 16:06 ` Victor Kaplansky @ 2013-11-01 16:25 ` David Laight 2013-11-01 16:30 ` Victor Kaplansky 2013-11-02 15:46 ` Paul E. McKenney 1 sibling, 1 reply; 74+ messages in thread From: David Laight @ 2013-11-01 16:25 UTC (permalink / raw) To: Victor Kaplansky, paulmck Cc: Michael Neuling, Mathieu Desnoyers, Peter Zijlstra, LKML, Oleg Nesterov, Linux PPC dev, Anton Blanchard, Frederic Weisbecker > But "broken" compiler is much wider issue to be deeply discussed in this > thread. I'm pretty sure that kernel have tons of very subtle > code that actually creates locks and memory ordering. Such code > usually just use the "barrier()" approach to tell gcc not to combine > or move memory accesses around it. gcc will do unexpected memory accesses for bit fields that are adjacent to volatile data. In particular it may generate 64bit sized (and aligned) RMW cycles when accessing bit fields. And yes, this has caused real problems. David ^ permalink raw reply [flat|nested] 74+ messages in thread
* RE: perf events ring buffer memory barrier on powerpc 2013-11-01 16:25 ` David Laight @ 2013-11-01 16:30 ` Victor Kaplansky 2013-11-03 20:57 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 74+ messages in thread From: Victor Kaplansky @ 2013-11-01 16:30 UTC (permalink / raw) To: David Laight Cc: Anton Blanchard, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Neuling, Oleg Nesterov, paulmck, Peter Zijlstra "David Laight" <David.Laight@aculab.com> wrote on 11/01/2013 06:25:29 PM: > gcc will do unexpected memory accesses for bit fields that are > adjacent to volatile data. > In particular it may generate 64bit sized (and aligned) RMW cycles > when accessing bit fields. > And yes, this has caused real problems. Thanks, I am aware about this bug/feature in gcc. -- Victor ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 16:30 ` Victor Kaplansky @ 2013-11-03 20:57 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 74+ messages in thread From: Benjamin Herrenschmidt @ 2013-11-03 20:57 UTC (permalink / raw) To: Victor Kaplansky Cc: David Laight, Michael Neuling, Mathieu Desnoyers, Peter Zijlstra, Frederic Weisbecker, LKML, Oleg Nesterov, Linux PPC dev, Anton Blanchard, paulmck On Fri, 2013-11-01 at 18:30 +0200, Victor Kaplansky wrote: > "David Laight" <David.Laight@aculab.com> wrote on 11/01/2013 06:25:29 PM: > > gcc will do unexpected memory accesses for bit fields that are > > adjacent to volatile data. > > In particular it may generate 64bit sized (and aligned) RMW cycles > > when accessing bit fields. > > And yes, this has caused real problems. > > Thanks, I am aware about this bug/feature in gcc. AFAIK, this has been fixed in 4.8 and 4.7.3 ... Cheers, Ben. > -- Victor > > _______________________________________________ > Linuxppc-dev mailing list > Linuxppc-dev@lists.ozlabs.org > https://lists.ozlabs.org/listinfo/linuxppc-dev ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-11-01 16:06 ` Victor Kaplansky 2013-11-01 16:25 ` David Laight @ 2013-11-02 15:46 ` Paul E. McKenney 1 sibling, 0 replies; 74+ messages in thread From: Paul E. McKenney @ 2013-11-02 15:46 UTC (permalink / raw) To: Victor Kaplansky Cc: Anton Blanchard, Benjamin Herrenschmidt, Frederic Weisbecker, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Oleg Nesterov, Peter Zijlstra On Fri, Nov 01, 2013 at 06:06:58PM +0200, Victor Kaplansky wrote: > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote on 10/31/2013 > 05:25:43 PM: > > > I really don't care about "fair" -- I care instead about the kernel > > working reliably. > > Though I don't see how putting a memory barrier without deep understanding > why it is needed helps kernel reliability, I do agree that reliability > is more important than performance. True enough. Of course, the same applies to removing memory barriers. > > And it should also be easy for proponents of removing memory barriers to > > clearly articulate what orderings their code does and does not need. > > I intentionally took a simplified example of circle buffer from > Documentation/circular-buffers.txt. I think both sides agree about > memory ordering requirements in the example. At least I didn't see anyone > argued about them. Hard to say. No one has actually stated them clearly, so how could we know whether or not we agree. > > You are assuming control dependencies that the C language does not > > provide. Now, for all I know right now, there might well be some other > > reason why a full barrier is not required, but the "if" statement cannot > > be that reason. > > > > Please review section 1.10 of the C++11 standard (or the corresponding > > section of the C11 standard, if you prefer). The point is that the > > C/C++11 covers only data dependencies, not control dependencies. > > I feel you made a wrong assumption about my expertise in compilers. I don't > need to reread section 1.10 of the C++11 standard, because I do agree that > potentially compiler can break the code in our case. And I do agree that > a compiler barrier() or some other means (including a change of the > standard) > can be required in future to prevent a compiler from moving memory accesses > around. I was simply reacting to what seemed to me to be your statement that control dependencies affect ordering. They don't. The C/C++ standard does not in any way respect control dependencies. In fact, there are implementations that do not respect control dependencies. But don't take my word for it, actually try it out on a weakly ordered system. Or try out either ppcmem or armmem, which does a full state-space search. Here is the paper: http://www.cl.cam.ac.uk/~pes20/ppc-supplemental/pldi105-sarkar.pdf And here is the web-based tool: http://www.cl.cam.ac.uk/~pes20/ppcmem/ And here is a much faster version that you can run locally: http://www.cl.cam.ac.uk/~pes20/weakmemory/index.html > But "broken" compiler is much wider issue to be deeply discussed in this > thread. I'm pretty sure that kernel have tons of very subtle > code that actually creates locks and memory ordering. Such code > usually just use the "barrier()" approach to tell gcc not to combine > or move memory accesses around it. > > Let's just agree for the sake of this memory barrier discussion that we > *do* need compiler barrier to tell gcc not to combine or move memory > accesses around it. Sometimes barrier() is indeed all you need, other times more is needed. > > Glad we agree on something! > > I'm glad too! > > > Did you miss the following passage in the paragraph you quoted? > > > > "... likewise, your consumer must issue a memory barrier > > instruction after removing an item from the queue and before > > reading from its memory." > > > > That is why DEC Alpha readers need a read-side memory barrier -- it says > > so right there. And as either you or Peter noted earlier in this thread, > > this barrier can be supplied by smp_read_barrier_depends(). > > I did not miss that passage. That passage explains why consumer on Alpha > processor after reading @head is required to execute an additional > smp_read_barrier_depends() before it can *read* from memory pointed by > @tail. And I think that I understand why - because the reader have to wait > till local caches are fully updated and only then it can read data from > the data buffer. > > But on the producer side, after we read @tail, we don't need to wait for > update of local caches before we start *write* data to the buffer, since > the > producer is the only one who write data there! Well, we cannot allow the producer to clobber data while the consumer is reading it out. That said, I do agree that we should get some help from the fact that one element of the array is left empty, so that the producer goes through a full write before clobbering the cell that the consumer just vacated. > > I can sympathize if you are having trouble believing this. After all, > > it took the DEC Alpha architects a full hour to convince me, and that was > > in a face-to-face meeting instead of over email. (Just for the record, > > it took me even longer to convince them that their earlier documentation > > did not clearly indicate the need for these read-side barriers.) But > > regardless of whether or not I sympathize, DEC Alpha is what it is. > > Again, I do understand quirkiness of the DEC Alpha, and I still think that > there is no need in *full* memory barrier on producer side - the one > before writing data to the buffer and which you've put in kfifo > implementation. There really does need to be some sort of memory barrier there to order the read of the index before the write into the array element. Now, it might well be that this barrier is supplied by the unlock-lock pair guarding the producer, but either way, there does need to be some ordering. Thanx, Paul ^ permalink raw reply [flat|nested] 74+ messages in thread
* Re: perf events ring buffer memory barrier on powerpc 2013-10-28 13:26 ` Peter Zijlstra 2013-10-28 16:34 ` Paul E. McKenney @ 2013-10-28 19:09 ` Oleg Nesterov 1 sibling, 0 replies; 74+ messages in thread From: Oleg Nesterov @ 2013-10-28 19:09 UTC (permalink / raw) To: Peter Zijlstra Cc: Victor Kaplansky, Frederic Weisbecker, Anton Blanchard, Benjamin Herrenschmidt, LKML, Linux PPC dev, Mathieu Desnoyers, Michael Ellerman, Michael Neuling, Paul McKenney On 10/28, Peter Zijlstra wrote: > > Lets add Paul and Oleg to the thread; this is getting far more 'fun' > that it should be ;-) Heh. All I can say is that I would like to see the authoritative answer, you know who can shed a light ;) But to avoid the confusion, wmb() added by this patch looks "obviously correct" to me. > + * Since the mmap() consumer (userspace) can run on a different CPU: > + * > + * kernel user > + * > + * READ ->data_tail READ ->data_head > + * smp_rmb() (A) smp_rmb() (C) > + * WRITE $data READ $data > + * smp_wmb() (B) smp_mb() (D) > + * STORE ->data_head WRITE ->data_tail > + * > + * Where A pairs with D, and B pairs with C. > + * > + * I don't think A needs to be a full barrier because we won't in fact > + * write data until we see the store from userspace. this matches the intuition, but ... > So we simply don't > + * issue the data WRITE until we observe it. why do we need any barrier (rmb) then? how it can help to serialize with "WRITE $data" ? (of course there could be other reasons for this rmb(), just I can't really understand "A pairs with D"). And this reminds me about the memory barrier in kfifo.c which I was not able to understand. Hmm, it has already gone away, and now I do not understand kfifo.c at all ;) But I have found the commit, attached below. Note that that commit added the full barrier into __kfifo_put(). And to me it looks the same as "A" above. Following the logic above we could say that we do not need a barrier (at least the full one), we won't in fact write into the "unread" area until we see the store to ->out from __kfifo_get() ? In short. I am confused, I _feel_ that "A" has to be a full barrier, but I can't prove this. And let me suggest the artificial/simplified example, bool BUSY; data_t DATA; bool try_to_get(data_t *data) { if (!BUSY) return false; rmb(); *data = DATA; mb(); BUSY = false; return true; } bool try_to_put(data_t *data) { if (BUSY) return false; mb(); // XXXXXXXX: do we really need it? I think yes. DATA = *data; wmb(); BUSY = true; return true; } Again, following the description above we could turn the mb() in _put() into barrier(), but I do not think we can rely on the contorl dependency. Oleg. --- commit a45bce49545739a940f8bd4ca85c3b7435564893 Author: Paul E. McKenney <paulmck@us.ibm.com> Date: Fri Sep 29 02:00:11 2006 -0700 [PATCH] memory ordering in __kfifo primitives Both __kfifo_put() and __kfifo_get() have header comments stating that if there is but one concurrent reader and one concurrent writer, locking is not necessary. This is almost the case, but a couple of memory barriers are needed. Another option would be to change the header comments to remove the bit about locking not being needed, and to change the those callers who currently don't use locking to add the required locking. The attachment analyzes this approach, but the patch below seems simpler. Signed-off-by: Paul E. McKenney <paulmck@us.ibm.com> Cc: Stelian Pop <stelian@popies.net> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org> diff --git a/kernel/kfifo.c b/kernel/kfifo.c index 64ab045..5d1d907 100644 --- a/kernel/kfifo.c +++ b/kernel/kfifo.c @@ -122,6 +122,13 @@ unsigned int __kfifo_put(struct kfifo *fifo, len = min(len, fifo->size - fifo->in + fifo->out); + /* + * Ensure that we sample the fifo->out index -before- we + * start putting bytes into the kfifo. + */ + + smp_mb(); + /* first put the data starting from fifo->in to buffer end */ l = min(len, fifo->size - (fifo->in & (fifo->size - 1))); memcpy(fifo->buffer + (fifo->in & (fifo->size - 1)), buffer, l); @@ -129,6 +136,13 @@ unsigned int __kfifo_put(struct kfifo *fifo, /* then put the rest (if any) at the beginning of the buffer */ memcpy(fifo->buffer, buffer + l, len - l); + /* + * Ensure that we add the bytes to the kfifo -before- + * we update the fifo->in index. + */ + + smp_wmb(); + fifo->in += len; return len; @@ -154,6 +168,13 @@ unsigned int __kfifo_get(struct kfifo *fifo, len = min(len, fifo->in - fifo->out); + /* + * Ensure that we sample the fifo->in index -before- we + * start removing bytes from the kfifo. + */ + + smp_rmb(); + /* first get the data from fifo->out until the end of the buffer */ l = min(len, fifo->size - (fifo->out & (fifo->size - 1))); memcpy(buffer, fifo->buffer + (fifo->out & (fifo->size - 1)), l); @@ -161,6 +182,13 @@ unsigned int __kfifo_get(struct kfifo *fifo, /* then get the rest (if any) from the beginning of the buffer */ memcpy(buffer + l, fifo->buffer, len - l); + /* + * Ensure that we remove the bytes from the kfifo -before- + * we update the fifo->out index. + */ + + smp_mb(); + fifo->out += len; return len; ^ permalink raw reply related [flat|nested] 74+ messages in thread
end of thread, other threads:[~2014-05-09 13:47 UTC | newest] Thread overview: 74+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-05-08 20:46 perf events ring buffer memory barrier on powerpc Mikulas Patocka [not found] ` <OF667059AA.7F151BCC-ONC2257CD3.0036CFEB-C2257CD3.003BBF01@il.ibm.com> 2014-05-09 12:20 ` Mikulas Patocka 2014-05-09 13:47 ` Paul E. McKenney -- strict thread matches above, loose matches on Subject: below -- 2013-10-22 23:54 Michael Neuling 2013-10-23 7:39 ` Victor Kaplansky 2013-10-23 14:19 ` Frederic Weisbecker 2013-10-23 14:25 ` Frederic Weisbecker 2013-10-25 17:37 ` Peter Zijlstra 2013-10-25 20:31 ` Michael Neuling 2013-10-27 9:00 ` Victor Kaplansky 2013-10-28 9:22 ` Peter Zijlstra 2013-10-28 10:02 ` Frederic Weisbecker 2013-10-28 12:38 ` Victor Kaplansky 2013-10-28 13:26 ` Peter Zijlstra 2013-10-28 16:34 ` Paul E. McKenney 2013-10-28 20:17 ` Oleg Nesterov 2013-10-28 20:58 ` Victor Kaplansky 2013-10-29 10:21 ` Peter Zijlstra 2013-10-29 10:30 ` Peter Zijlstra 2013-10-29 10:35 ` Peter Zijlstra 2013-10-29 20:15 ` Oleg Nesterov 2013-10-29 19:27 ` Vince Weaver 2013-10-30 10:42 ` Peter Zijlstra 2013-10-30 11:48 ` James Hogan 2013-10-30 12:48 ` Peter Zijlstra 2013-10-29 21:23 ` Michael Neuling 2013-10-30 9:27 ` Paul E. McKenney 2013-10-30 11:25 ` Peter Zijlstra 2013-10-30 14:52 ` Victor Kaplansky 2013-10-30 15:39 ` Peter Zijlstra 2013-10-30 17:14 ` Victor Kaplansky 2013-10-30 17:44 ` Peter Zijlstra 2013-10-31 6:16 ` Paul E. McKenney 2013-11-01 13:12 ` Victor Kaplansky 2013-11-02 16:36 ` Paul E. McKenney 2013-11-02 17:26 ` Paul E. McKenney 2013-10-31 6:40 ` Paul E. McKenney 2013-11-01 14:25 ` Victor Kaplansky 2013-11-02 17:28 ` Paul E. McKenney 2013-11-01 14:56 ` Peter Zijlstra 2013-11-02 17:32 ` Paul E. McKenney 2013-11-03 14:40 ` Paul E. McKenney 2013-11-03 17:07 ` Will Deacon 2013-11-03 22:47 ` Paul E. McKenney 2013-11-04 9:57 ` Will Deacon 2013-11-04 10:52 ` Paul E. McKenney 2013-11-01 16:11 ` Peter Zijlstra 2013-11-02 17:46 ` Paul E. McKenney 2013-11-01 16:18 ` Peter Zijlstra 2013-11-02 17:49 ` Paul E. McKenney 2013-10-30 13:28 ` Victor Kaplansky 2013-10-30 15:51 ` Peter Zijlstra 2013-10-30 18:29 ` Peter Zijlstra 2013-10-30 19:11 ` Peter Zijlstra 2013-10-31 4:33 ` Paul E. McKenney 2013-10-31 4:32 ` Paul E. McKenney 2013-10-31 9:04 ` Peter Zijlstra 2013-10-31 15:07 ` Paul E. McKenney 2013-10-31 15:19 ` Peter Zijlstra 2013-11-01 9:28 ` Paul E. McKenney 2013-11-01 10:30 ` Peter Zijlstra 2013-11-02 15:20 ` Paul E. McKenney 2013-11-04 9:07 ` Peter Zijlstra 2013-11-04 10:00 ` Paul E. McKenney 2013-10-31 9:59 ` Victor Kaplansky 2013-10-31 12:28 ` David Laight 2013-10-31 12:55 ` Victor Kaplansky 2013-10-31 15:25 ` Paul E. McKenney 2013-11-01 16:06 ` Victor Kaplansky 2013-11-01 16:25 ` David Laight 2013-11-01 16:30 ` Victor Kaplansky 2013-11-03 20:57 ` Benjamin Herrenschmidt 2013-11-02 15:46 ` Paul E. McKenney 2013-10-28 19:09 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).