* Re: [RFC][PATCH 0/5] arch: atomic rework
@ 2014-02-26 3:06 George Spelvin
2014-02-26 5:22 ` Paul E. McKenney
0 siblings, 1 reply; 285+ messages in thread
From: George Spelvin @ 2014-02-26 3:06 UTC (permalink / raw)
To: paulmck
Cc: akpm, dhowells, gcc, linux, linux-arch, linux-kernel, mingo,
peterz, Ramana.Radhakrishnan, torvalds, triegel, will.deacon
<paulmck@linux.vnet.ibm.com> wrote:
> <torvalds@linux-foundation.org> wrote:
>> I have for the last several years been 100% convinced that the Intel
>> memory ordering is the right thing, and that people who like weak
>> memory ordering are wrong and should try to avoid reproducing if at
>> all possible.
>
> Are ARM and Power really the bad boys here? Or are they instead playing
> the role of the canary in the coal mine?
To paraphrase some older threads, I think Linus's argument is that
weak memory ordering is like branch delay slots: a way to make a simple
implementation simpler, but ends up being no help to a more aggressive
implementation.
Branch delay slots give a one-cycle bonus to in-order cores, but
once you go superscalar and add branch prediction, they stop helping,
and once you go full out of order, they're just an annoyance.
Likewise, I can see the point that weak ordering can help make a simple
cache interface simpler, but once you start doing speculative loads,
you've already bought and paid for all the hardware you need to do
stronger coherency.
Another thing that requires all the strong-coherency machinery is
a high-performance implementation of the various memory barrier and
synchronization operations. Yes, a low-performance (drain the pipeline)
implementation is tolerable if the instructions aren't used frequently,
but once you're really trying, it doesn't save complexity.
Once you're there, strong coherency always doesn't actually cost you any
time outside of critical synchronization code, and it both simplifies
and speeds up the tricky synchronization software.
So PPC and ARM's weak ordering are not the direction the future is going.
Rather, weak ordering is something that's only useful in a limited
technology window, which is rapidly passing.
If you can find someone in IBM who's worked on the Z series cache
coherency (extremely strong ordering), they probably have some useful
insights. The big question is if strong ordering, once you've accepted
the implementation complexity and area, actually costs anything in
execution time. If there's an unavoidable cost which weak ordering saves,
that's significant.
^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 3:06 [RFC][PATCH 0/5] arch: atomic rework George Spelvin @ 2014-02-26 5:22 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-26 5:22 UTC (permalink / raw) To: George Spelvin Cc: akpm, dhowells, gcc, linux-arch, linux-kernel, mingo, peterz, Ramana.Radhakrishnan, torvalds, triegel, will.deacon On Tue, Feb 25, 2014 at 10:06:53PM -0500, George Spelvin wrote: > <paulmck@linux.vnet.ibm.com> wrote: > > <torvalds@linux-foundation.org> wrote: > >> I have for the last several years been 100% convinced that the Intel > >> memory ordering is the right thing, and that people who like weak > >> memory ordering are wrong and should try to avoid reproducing if at > >> all possible. > > > > Are ARM and Power really the bad boys here? Or are they instead playing > > the role of the canary in the coal mine? > > To paraphrase some older threads, I think Linus's argument is that > weak memory ordering is like branch delay slots: a way to make a simple > implementation simpler, but ends up being no help to a more aggressive > implementation. > > Branch delay slots give a one-cycle bonus to in-order cores, but > once you go superscalar and add branch prediction, they stop helping, > and once you go full out of order, they're just an annoyance. > > Likewise, I can see the point that weak ordering can help make a simple > cache interface simpler, but once you start doing speculative loads, > you've already bought and paid for all the hardware you need to do > stronger coherency. > > Another thing that requires all the strong-coherency machinery is > a high-performance implementation of the various memory barrier and > synchronization operations. Yes, a low-performance (drain the pipeline) > implementation is tolerable if the instructions aren't used frequently, > but once you're really trying, it doesn't save complexity. > > Once you're there, strong coherency always doesn't actually cost you any > time outside of critical synchronization code, and it both simplifies > and speeds up the tricky synchronization software. > > > So PPC and ARM's weak ordering are not the direction the future is going. > Rather, weak ordering is something that's only useful in a limited > technology window, which is rapidly passing. That does indeed appear to be Intel's story. Might well be correct. Time will tell. > If you can find someone in IBM who's worked on the Z series cache > coherency (extremely strong ordering), they probably have some useful > insights. The big question is if strong ordering, once you've accepted > the implementation complexity and area, actually costs anything in > execution time. If there's an unavoidable cost which weak ordering saves, > that's significant. There has been a lot of ink spilled on this argument. ;-) PPC has much larger CPU counts than does the mainframe. On the other hand, there are large x86 systems. Some claim that there are differences in latency due to the different approaches, and there could be a long argument about whether all this in inherent in the memory ordering or whether it is due to implementation issues. I don't claim to know the answer. I do know that ARM and PPC are here now, and that I need to deal with them. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework @ 2014-02-18 12:12 Peter Sewell 2014-02-18 12:53 ` Peter Zijlstra ` (3 more replies) 0 siblings, 4 replies; 285+ messages in thread From: Peter Sewell @ 2014-02-18 12:12 UTC (permalink / raw) To: Peter Sewell, mark.batty@cl.cam.ac.uk, Paul McKenney, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc Several of you have said that the standard and compiler should not permit speculative writes of atomics, or (effectively) that the compiler should preserve dependencies. In simple examples it's easy to see what that means, but in general it's not so clear what the language should guarantee, because dependencies may go via non-atomic code in other compilation units, and we have to consider the extent to which it's desirable to limit optimisation there. For example, suppose we have, in one compilation unit: void f(int ra, int*rb) { if (ra==42) *rb=42; else *rb=42; } and in another compilation unit the bodies of two threads: // Thread 0 r1 = x; f(r1,&r2); y = r2; // Thread 1 r3 = y; f(r3,&r4); x = r4; where accesses to x and y are annotated C11 atomic memory_order_relaxed or Linux ACCESS_ONCE(), accesses to r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. (Of course, this is an artificial example, to make the point below as simply as possible - in real code the branches of the conditional might not be syntactically identical, just equivalent after macro expansion and other optimisation.) In the source program there's a dependency from the read of x to the write of y in Thread 0, and from the read of y to the write of x on Thread 1. Dependency-respecting compilation would preserve those and the ARM and POWER architectures both respect them, so the reads of x and y could not give 42. But a compiler might well optimise the (non-atomic) body of f() to just *rb=42, making the threads effectively // Thread 0 r1 = x; y = 42; // Thread 1 r3 = y; x = 42; (GCC does this at O1, O2, and O3) and the ARM and POWER architectures permit those two reads to see 42. That is moreover actually observable on current ARM hardware. So as far as we can see, either: 1) if you can accept the latter behaviour (if the Linux codebase does not rely on its absence), the language definition should permit it, and current compiler optimisations can be used, or 2) otherwise, the language definition should prohibit it but the compiler would have to preserve dependencies even in compilation units that have no mention of atomics. It's unclear what the (runtime and compiler development) cost of that would be in practice - perhaps Torvald could comment? For more context, this example is taken from a summary of the thin-air problem by Mark Batty and myself, <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with dependencies via other compilation units was AFAIK first pointed out by Hans Boehm. Peter ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 12:12 Peter Sewell @ 2014-02-18 12:53 ` Peter Zijlstra 2014-02-18 16:08 ` Peter Sewell 2014-02-18 14:56 ` Paul E. McKenney ` (2 subsequent siblings) 3 siblings, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-18 12:53 UTC (permalink / raw) To: Peter Sewell Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. The example below only deals with control dependencies; so I'll limit myself to that. > In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } > > and in another compilation unit the bodies of two threads: > > // Thread 0 > r1 = x; > f(r1,&r2); > y = r2; > > // Thread 1 > r3 = y; > f(r3,&r4); > x = r4; > > where accesses to x and y are annotated C11 atomic > memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. So I'm intuitively ok with this, however I would expect something like: void f(_Atomic int ra, _Atomic int *rb); To preserve dependencies and not make the conditional go away, simply because in that case the: if (ra == 42) the 'ra' usage can be seen as an atomic load. > So as far as we can see, either: > > 1) if you can accept the latter behaviour (if the Linux codebase does > not rely on its absence), the language definition should permit it, > and current compiler optimisations can be used, Currently there's exactly 1 site in the Linux kernel that relies on control dependencies as far as I know -- the one I put in. And its limited to a single function, so no cross translation unit funnies there. Of course, nobody is going to tell me when or where they'll put in the next one; since its now documented as accepted practise. However, PaulMck and our RCU usage very much do cross all sorts of TU boundaries; but those are data dependencies. ~ Peter ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 12:53 ` Peter Zijlstra @ 2014-02-18 16:08 ` Peter Sewell 0 siblings, 0 replies; 285+ messages in thread From: Peter Sewell @ 2014-02-18 16:08 UTC (permalink / raw) To: Peter Zijlstra Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Torvald Riegel, torvalds, Will Deacon, ramana.radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On 18 February 2014 12:53, Peter Zijlstra <peterz@infradead.org> wrote: > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. > > The example below only deals with control dependencies; so I'll limit > myself to that. Data/address dependencies are, if anything, even less clear - see a paragraph on that in my reply to Paul. >> In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } >> >> and in another compilation unit the bodies of two threads: >> >> // Thread 0 >> r1 = x; >> f(r1,&r2); >> y = r2; >> >> // Thread 1 >> r3 = y; >> f(r3,&r4); >> x = r4; >> >> where accesses to x and y are annotated C11 atomic >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > > So I'm intuitively ok with this, however I would expect something like: > > void f(_Atomic int ra, _Atomic int *rb); > > To preserve dependencies and not make the conditional go away, simply > because in that case the: > > if (ra == 42) > > the 'ra' usage can be seen as an atomic load. > >> So as far as we can see, either: >> >> 1) if you can accept the latter behaviour (if the Linux codebase does >> not rely on its absence), the language definition should permit it, >> and current compiler optimisations can be used, > > Currently there's exactly 1 site in the Linux kernel that relies on > control dependencies as far as I know -- the one I put in. ok, thanks > And its > limited to a single function, so no cross translation unit funnies > there. One can imagine a language definition that treats code that lies entirely within a single compilation unit specially (e.g. if it's somehow annotated as relying on dependencies). But I imagine it would be pretty unappealing to work with. > Of course, nobody is going to tell me when or where they'll put in the > next one; since its now documented as accepted practise. Is that not fragile? > However, PaulMck and our RCU usage very much do cross all sorts of TU > boundaries; but those are data dependencies. yes - though again see my reply to Paul's note thanks, Peter > ~ Peter ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 12:12 Peter Sewell 2014-02-18 12:53 ` Peter Zijlstra @ 2014-02-18 14:56 ` Paul E. McKenney 2014-02-18 15:16 ` Mark Batty 2014-02-18 15:33 ` Peter Sewell 2014-02-18 17:38 ` Linus Torvalds 2014-02-18 20:43 ` Torvald Riegel 3 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 14:56 UTC (permalink / raw) To: Peter Sewell Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } Hello, Peter! Nice example! The relevant portion of Documentation/memory-barriers.txt in my -rcu tree says the following about the control dependency in the above construct: ------------------------------------------------------------------------ q = ACCESS_ONCE(a); if (q) { barrier(); ACCESS_ONCE(b) = p; do_something(); } else { barrier(); ACCESS_ONCE(b) = p; do_something_else(); } The initial ACCESS_ONCE() is required to prevent the compiler from proving the value of 'a', and the pair of barrier() invocations are required to prevent the compiler from pulling the two identical stores to 'b' out from the legs of the "if" statement. ------------------------------------------------------------------------ So yes, current compilers need significant help if it is necessary to maintain dependencies in that sort of code. Similar examples came up in the data-dependency discussions in the standards committee, which led to the [[carries_dependency]] attribute for C11 and C++11. Of course, current compilers don't have this attribute, and the current Linux kernel code doesn't have any other marking for data dependencies passing across function boundaries. (Maybe some time as an assist for detecting pointer leaks out of RCU read-side critical sections, but efforts along those lines are a bit stalled at the moment.) More on data dependencies below... > and in another compilation unit the bodies of two threads: > > // Thread 0 > r1 = x; > f(r1,&r2); > y = r2; > > // Thread 1 > r3 = y; > f(r3,&r4); > x = r4; > > where accesses to x and y are annotated C11 atomic > memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > > (Of course, this is an artificial example, to make the point below as > simply as possible - in real code the branches of the conditional > might not be syntactically identical, just equivalent after macro > expansion and other optimisation.) > > In the source program there's a dependency from the read of x to the > write of y in Thread 0, and from the read of y to the write of x on > Thread 1. Dependency-respecting compilation would preserve those and > the ARM and POWER architectures both respect them, so the reads of x > and y could not give 42. > > But a compiler might well optimise the (non-atomic) body of f() to > just *rb=42, making the threads effectively > > // Thread 0 > r1 = x; > y = 42; > > // Thread 1 > r3 = y; > x = 42; > > (GCC does this at O1, O2, and O3) and the ARM and POWER architectures > permit those two reads to see 42. That is moreover actually observable > on current ARM hardware. I do agree that this could happen on current compilers and hardware. Agreed, but as Peter Zijlstra noted in this thread, this optimization is to a control dependency, not a data dependency. > So as far as we can see, either: > > 1) if you can accept the latter behaviour (if the Linux codebase does > not rely on its absence), the language definition should permit it, > and current compiler optimisations can be used, > > or > > 2) otherwise, the language definition should prohibit it but the > compiler would have to preserve dependencies even in compilation > units that have no mention of atomics. It's unclear what the > (runtime and compiler development) cost of that would be in > practice - perhaps Torvald could comment? For current compilers, we have to rely on coding conventions within the Linux kernel in combination with non-standard extentions to gcc and specified compiler flags to disable undesirable behavior. I have a start on specifying this in a document I am preparing for the standards committee, a very early draft of which may be found here: http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf Section 3 shows the results of a manual scan through the Linux kernel's dependency chains, and Section 4.1 lists a probably incomplete (and no doubt erroneous) list of coding standards required to make dependency chains work on current compilers. Any comments and suggestions are more than welcome! > For more context, this example is taken from a summary of the thin-air > problem by Mark Batty and myself, > <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with > dependencies via other compilation units was AFAIK first pointed out > by Hans Boehm. Nice document! One point of confusion for me... Example 4 says "language must allow". Shouldn't that be "language is permitted to allow"? Seems like an implementation is always within its rights to avoid an optimization if its implementation prevents it from safely detecting the oppportunity for that optimization. Or am I missing something here? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 14:56 ` Paul E. McKenney @ 2014-02-18 15:16 ` Mark Batty 2014-02-18 17:17 ` Paul E. McKenney 2014-02-18 15:33 ` Peter Sewell 1 sibling, 1 reply; 285+ messages in thread From: Mark Batty @ 2014-02-18 15:16 UTC (permalink / raw) To: paulmck Cc: Peter Sewell, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc Hi Paul, Thanks for the document. I'm looking forward to reading the bits about dependency chains in Linux. > One point of confusion for me... Example 4 says "language must allow". > Shouldn't that be "language is permitted to allow"? When we say "allow", we mean that the optimised execution should be allowed by the specification, and Implicitly, the unoptimised execution should remain allowed too. We want to be concrete about what the language specification allows, and that's why we say "must". It is not to disallow the unoptimised execution. > Seems like an > implementation is always within its rights to avoid an optimization if > its implementation prevents it from safely detecting the oppportunity > for that optimization. That's right. - Mark > Or am I missing something here? > > Thanx, Paul > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 15:16 ` Mark Batty @ 2014-02-18 17:17 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 17:17 UTC (permalink / raw) To: Mark Batty Cc: Peter Sewell, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 03:16:33PM +0000, Mark Batty wrote: > Hi Paul, > > Thanks for the document. I'm looking forward to reading the bits about > dependency chains in Linux. And I am looking forward to your thoughts on those bits! > > One point of confusion for me... Example 4 says "language must allow". > > Shouldn't that be "language is permitted to allow"? > > When we say "allow", we mean that the optimised execution should be > allowed by the specification, and Implicitly, the unoptimised > execution should remain allowed too. We want to be concrete about what > the language specification allows, and that's why we say "must". It is > not to disallow the unoptimised execution. OK, got it! Thanx, Paul > > Seems like an > > implementation is always within its rights to avoid an optimization if > > its implementation prevents it from safely detecting the oppportunity > > for that optimization. > > That's right. > > - Mark > > > > Or am I missing something here? > > > > Thanx, Paul > > > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 14:56 ` Paul E. McKenney 2014-02-18 15:16 ` Mark Batty @ 2014-02-18 15:33 ` Peter Sewell 2014-02-18 16:47 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Peter Sewell @ 2014-02-18 15:33 UTC (permalink / raw) To: Paul McKenney Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc Hi Paul, On 18 February 2014 14:56, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } > > Hello, Peter! > > Nice example! > > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree > says the following about the control dependency in the above construct: > > ------------------------------------------------------------------------ > > q = ACCESS_ONCE(a); > if (q) { > barrier(); > ACCESS_ONCE(b) = p; > do_something(); > } else { > barrier(); > ACCESS_ONCE(b) = p; > do_something_else(); > } > > The initial ACCESS_ONCE() is required to prevent the compiler from > proving the value of 'a', and the pair of barrier() invocations are > required to prevent the compiler from pulling the two identical stores > to 'b' out from the legs of the "if" statement. thanks > ------------------------------------------------------------------------ > > So yes, current compilers need significant help if it is necessary to > maintain dependencies in that sort of code. > > Similar examples came up in the data-dependency discussions in the > standards committee, which led to the [[carries_dependency]] attribute for > C11 and C++11. Of course, current compilers don't have this attribute, > and the current Linux kernel code doesn't have any other marking for > data dependencies passing across function boundaries. (Maybe some time > as an assist for detecting pointer leaks out of RCU read-side critical > sections, but efforts along those lines are a bit stalled at the moment.) > > More on data dependencies below... > >> and in another compilation unit the bodies of two threads: >> >> // Thread 0 >> r1 = x; >> f(r1,&r2); >> y = r2; >> >> // Thread 1 >> r3 = y; >> f(r3,&r4); >> x = r4; >> >> where accesses to x and y are annotated C11 atomic >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. >> >> (Of course, this is an artificial example, to make the point below as >> simply as possible - in real code the branches of the conditional >> might not be syntactically identical, just equivalent after macro >> expansion and other optimisation.) >> >> In the source program there's a dependency from the read of x to the >> write of y in Thread 0, and from the read of y to the write of x on >> Thread 1. Dependency-respecting compilation would preserve those and >> the ARM and POWER architectures both respect them, so the reads of x >> and y could not give 42. >> >> But a compiler might well optimise the (non-atomic) body of f() to >> just *rb=42, making the threads effectively >> >> // Thread 0 >> r1 = x; >> y = 42; >> >> // Thread 1 >> r3 = y; >> x = 42; >> >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures >> permit those two reads to see 42. That is moreover actually observable >> on current ARM hardware. > > I do agree that this could happen on current compilers and hardware. > > Agreed, but as Peter Zijlstra noted in this thread, this optimization > is to a control dependency, not a data dependency. Indeed. In principle (again as Hans has observed) a compiler might well convert between the two, e.g. if operating on single-bit values, or where value-range analysis has shown that a variable can only contain one of a small set of values. I don't know whether that happens in practice? Then there are also cases where a compiler is very likely to remove data/address dependencies, eg if some constant C is #define'd to be 0 then an array access indexed by x * C will have the dependency on x removed. The runtime and compiler development costs of preventing that are also unclear to me. Given that, whether it's reasonable to treat control and data dependencies differently seems to be an open question. >> So as far as we can see, either: >> >> 1) if you can accept the latter behaviour (if the Linux codebase does >> not rely on its absence), the language definition should permit it, >> and current compiler optimisations can be used, >> >> or >> >> 2) otherwise, the language definition should prohibit it but the >> compiler would have to preserve dependencies even in compilation >> units that have no mention of atomics. It's unclear what the >> (runtime and compiler development) cost of that would be in >> practice - perhaps Torvald could comment? > > For current compilers, we have to rely on coding conventions within > the Linux kernel in combination with non-standard extentions to gcc > and specified compiler flags to disable undesirable behavior. I have a > start on specifying this in a document I am preparing for the standards > committee, a very early draft of which may be found here: > > http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf > > Section 3 shows the results of a manual scan through the Linux kernel's > dependency chains, and Section 4.1 lists a probably incomplete (and no > doubt erroneous) list of coding standards required to make dependency > chains work on current compilers. Any comments and suggestions are more > than welcome! Thanks, that's very interesting (especially the non-local dependency chains). At a first glance, the "4.1 Rules for C-Language RCU Users" seem pretty fragile - they're basically trying to guess the limits of compiler optimisation smartness. >> For more context, this example is taken from a summary of the thin-air >> problem by Mark Batty and myself, >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with >> dependencies via other compilation units was AFAIK first pointed out >> by Hans Boehm. > > Nice document! > > One point of confusion for me... Example 4 says "language must allow". > Shouldn't that be "language is permitted to allow"? Seems like an > implementation is always within its rights to avoid an optimization if > its implementation prevents it from safely detecting the oppportunity > for that optimization. Or am I missing something here? We're saying that the language definition must allow it, not that any particular implementation must be able to exhibit it. best, Peter > Thanx, Paul > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 15:33 ` Peter Sewell @ 2014-02-18 16:47 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 16:47 UTC (permalink / raw) To: Peter Sewell Cc: mark.batty@cl.cam.ac.uk, peterz, Torvald Riegel, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 03:33:35PM +0000, Peter Sewell wrote: > Hi Paul, > > On 18 February 2014 14:56, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Tue, Feb 18, 2014 at 12:12:06PM +0000, Peter Sewell wrote: > >> Several of you have said that the standard and compiler should not > >> permit speculative writes of atomics, or (effectively) that the > >> compiler should preserve dependencies. In simple examples it's easy > >> to see what that means, but in general it's not so clear what the > >> language should guarantee, because dependencies may go via non-atomic > >> code in other compilation units, and we have to consider the extent to > >> which it's desirable to limit optimisation there. > >> > >> For example, suppose we have, in one compilation unit: > >> > >> void f(int ra, int*rb) { > >> if (ra==42) > >> *rb=42; > >> else > >> *rb=42; > >> } > > > > Hello, Peter! > > > > Nice example! > > > > The relevant portion of Documentation/memory-barriers.txt in my -rcu tree > > says the following about the control dependency in the above construct: > > > > ------------------------------------------------------------------------ > > > > q = ACCESS_ONCE(a); > > if (q) { > > barrier(); > > ACCESS_ONCE(b) = p; > > do_something(); > > } else { > > barrier(); > > ACCESS_ONCE(b) = p; > > do_something_else(); > > } > > > > The initial ACCESS_ONCE() is required to prevent the compiler from > > proving the value of 'a', and the pair of barrier() invocations are > > required to prevent the compiler from pulling the two identical stores > > to 'b' out from the legs of the "if" statement. > > thanks > > > ------------------------------------------------------------------------ > > > > So yes, current compilers need significant help if it is necessary to > > maintain dependencies in that sort of code. > > > > Similar examples came up in the data-dependency discussions in the > > standards committee, which led to the [[carries_dependency]] attribute for > > C11 and C++11. Of course, current compilers don't have this attribute, > > and the current Linux kernel code doesn't have any other marking for > > data dependencies passing across function boundaries. (Maybe some time > > as an assist for detecting pointer leaks out of RCU read-side critical > > sections, but efforts along those lines are a bit stalled at the moment.) > > > > More on data dependencies below... > > > >> and in another compilation unit the bodies of two threads: > >> > >> // Thread 0 > >> r1 = x; > >> f(r1,&r2); > >> y = r2; > >> > >> // Thread 1 > >> r3 = y; > >> f(r3,&r4); > >> x = r4; > >> > >> where accesses to x and y are annotated C11 atomic > >> memory_order_relaxed or Linux ACCESS_ONCE(), accesses to > >> r1,r2,r3,r4,ra,rb are not annotated, and x and y initially hold 0. > >> > >> (Of course, this is an artificial example, to make the point below as > >> simply as possible - in real code the branches of the conditional > >> might not be syntactically identical, just equivalent after macro > >> expansion and other optimisation.) > >> > >> In the source program there's a dependency from the read of x to the > >> write of y in Thread 0, and from the read of y to the write of x on > >> Thread 1. Dependency-respecting compilation would preserve those and > >> the ARM and POWER architectures both respect them, so the reads of x > >> and y could not give 42. > >> > >> But a compiler might well optimise the (non-atomic) body of f() to > >> just *rb=42, making the threads effectively > >> > >> // Thread 0 > >> r1 = x; > >> y = 42; > >> > >> // Thread 1 > >> r3 = y; > >> x = 42; > >> > >> (GCC does this at O1, O2, and O3) and the ARM and POWER architectures > >> permit those two reads to see 42. That is moreover actually observable > >> on current ARM hardware. > > > > I do agree that this could happen on current compilers and hardware. > > > > Agreed, but as Peter Zijlstra noted in this thread, this optimization > > is to a control dependency, not a data dependency. > > Indeed. In principle (again as Hans has observed) a compiler might > well convert between the two, e.g. if operating on single-bit values, > or where value-range analysis has shown that a variable can only > contain one of a small set of values. I don't know whether that > happens in practice? Then there are also cases where a compiler is > very likely to remove data/address dependencies, eg if some constant C > is #define'd to be 0 then an array access indexed by x * C will have > the dependency on x removed. The runtime and compiler development > costs of preventing that are also unclear to me. > > Given that, whether it's reasonable to treat control and data > dependencies differently seems to be an open question. Here is another (admittedly fanciful and probably buggy) implementation of f() that relies on data dependencies (according to C11 and C++11), but which could not be relied on to preserve thosse data dependencies given current pre-C11 compilers: int arr[2] = { 42, 43 }; int *bigarr; int f(int ra) { return arr[ra != 42]; } // Thread 0 r1 = atomic_load_explicit(&gidx, memory_order_consume); r2 = bigarr[f(r1)]; // Thread 1 r3 = random() % BIGARR_SIZE; bigarr[r3] = some_integer(); atomic_store_explicit(&gidx, r3, memory_order_release); // Mainprogram bigarr = kmalloc(BIGARR_SIZE * sizeof(*bigarr), ...); // Note: bigarr currently contains pre-initialization garbage // Spawn threads 1 and 2 Many compilers would be happy to convert f() into something like the following: int f(int ra) { if (ra == 42) return arr[0]; else return arr[1]; } And many would argue that this is a perfectly reasonable conversion. However, this breaks the data dependency, and allows Thread 0's load from bigarr[] to be speculated, so that r2 might end up containing pre-initialization garbage. This is why the consume.2014.02.16c.pdf document advises against attempting to carry dependencies through relational operators and booleans (&& and ||) when using current compilers (hmmm... I need to make that advice more strongly stated). And again, this is one of the reasons for the [[carries_dependency]] attribute in C11 -- to signal the compiler to be careful in a given function. Again, this example is fanciful. It is intended to illustrate a data dependency that could be broken given current compilers and hardware. It is -not- intended as an example of good code for the Linux kernel, much the opposite, in fact. That said, I would very much welcome a more realistic example. > >> So as far as we can see, either: > >> > >> 1) if you can accept the latter behaviour (if the Linux codebase does > >> not rely on its absence), the language definition should permit it, > >> and current compiler optimisations can be used, > >> > >> or > >> > >> 2) otherwise, the language definition should prohibit it but the > >> compiler would have to preserve dependencies even in compilation > >> units that have no mention of atomics. It's unclear what the > >> (runtime and compiler development) cost of that would be in > >> practice - perhaps Torvald could comment? > > > > For current compilers, we have to rely on coding conventions within > > the Linux kernel in combination with non-standard extentions to gcc > > and specified compiler flags to disable undesirable behavior. I have a > > start on specifying this in a document I am preparing for the standards > > committee, a very early draft of which may be found here: > > > > http://www2.rdrop.com/users/paulmck/scalability/paper/consume.2014.02.16c.pdf > > > > Section 3 shows the results of a manual scan through the Linux kernel's > > dependency chains, and Section 4.1 lists a probably incomplete (and no > > doubt erroneous) list of coding standards required to make dependency > > chains work on current compilers. Any comments and suggestions are more > > than welcome! > > Thanks, that's very interesting (especially the non-local dependency chains). > > At a first glance, the "4.1 Rules for C-Language RCU Users" seem > pretty fragile - they're basically trying to guess the limits of > compiler optimisation smartness. Agreed, but that is the world we currently must live in, given pre-C11 compilers and the tepid implementations of memory_order_consume in the current C11 implementations that I am aware of. As long as the Linux kernel must live in this world for some time to come, I might as well document the limitations, fragile though they might be. > >> For more context, this example is taken from a summary of the thin-air > >> problem by Mark Batty and myself, > >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with > >> dependencies via other compilation units was AFAIK first pointed out > >> by Hans Boehm. > > > > Nice document! > > > > One point of confusion for me... Example 4 says "language must allow". > > Shouldn't that be "language is permitted to allow"? Seems like an > > implementation is always within its rights to avoid an optimization if > > its implementation prevents it from safely detecting the oppportunity > > for that optimization. Or am I missing something here? > > We're saying that the language definition must allow it, not that any > particular implementation must be able to exhibit it. Ah, got it. You had me worried there for a bit! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 12:12 Peter Sewell 2014-02-18 12:53 ` Peter Zijlstra 2014-02-18 14:56 ` Paul E. McKenney @ 2014-02-18 17:38 ` Linus Torvalds 2014-02-18 18:21 ` Peter Sewell 2014-02-18 20:43 ` Torvald Riegel 3 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 17:38 UTC (permalink / raw) To: Peter.Sewell Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > > For example, suppose we have, in one compilation unit: > > void f(int ra, int*rb) { > if (ra==42) > *rb=42; > else > *rb=42; > } So this is a great example, and in general I really like your page at: > For more context, this example is taken from a summary of the thin-air > problem by Mark Batty and myself, > <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with > dependencies via other compilation units was AFAIK first pointed out > by Hans Boehm. and the reason I like your page is that it really talks about the problem by pointing to the "unoptimized" code, and what hardware would do. As mentioned, I think that's actually the *correct* way to think about the problem space, because it allows the programmer to take hardware characteristics into account, without having to try to "describe" them at a source level. As to your example of if (ra) atomic_write(rb, A); else atomic_write(rb, B); I really think that it is ok to combine that into atomic_write(rb, ra ? A:B); (by virtue of "exact same behavior on actual hardware"), and then the only remaining question is whether the "ra?A:B" can be optimized to remove the conditional if A==B as in your example where both are "42". Agreed? Now, I would argue that the "naive" translation of that is unambiguous, and since "ra" is not volatile or magic in any way, then "ra?42:42" can obviously be optimized into just 42 - by the exact same rule that says "the compiler can do any transformation that is equivalent in the hardware". The compiler can *locally* decide that that is the right thing to do, and any programmer that complains about that decision is just crazy. So my "local machine behavior equivalency" rule means that that function can be optimized into a single "store 42 atomically into rb". Now, if it's *not* compiled locally, and is instead implemented as a macro (or inline function), there are obviously situations where "ra ? A : B" ends up having to do other things. In particular, X may be volatile or an atomic read that has ordering semantics, and then that expression doesn't become just "42", but that's a separate issue. It's not all that dissimilar to "function calls are sequence points", though, and obviously if the source of "ra" has semantic meaning, you have to honor that semantic meaning. Agreed? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 17:38 ` Linus Torvalds @ 2014-02-18 18:21 ` Peter Sewell 2014-02-18 18:49 ` Linus Torvalds 2014-02-18 20:46 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Peter Sewell @ 2014-02-18 18:21 UTC (permalink / raw) To: Linus Torvalds Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On 18 February 2014 17:38, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: >> >> For example, suppose we have, in one compilation unit: >> >> void f(int ra, int*rb) { >> if (ra==42) >> *rb=42; >> else >> *rb=42; >> } > > So this is a great example, and in general I really like your page at: > >> For more context, this example is taken from a summary of the thin-air >> problem by Mark Batty and myself, >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with >> dependencies via other compilation units was AFAIK first pointed out >> by Hans Boehm. > > and the reason I like your page is that it really talks about the > problem by pointing to the "unoptimized" code, and what hardware would > do. Thanks. It's certainly necessary to separately understand what compiler optimisation and the hardware might do, to get anywhere here. But... > As mentioned, I think that's actually the *correct* way to think about > the problem space, because it allows the programmer to take hardware > characteristics into account, without having to try to "describe" them > at a source level. ...to be clear, I am ultimately after a decent source-level description of what programmers can depend on, and we (Mark and I) view that page as identifying constraints on what that description can say. There are too many compiler optimisations for people to reason directly in terms of the set of all transformations that they do, so we need some more concise and comprehensible envelope identifying what is allowed, as an interface between compiler writers and users. AIUI that's basically what Torvald is arguing. The C11 spec in its current form is not yet fully up to that task, for one thing because it doesn't attempt to cover all the h/w interactions that you and Paul list, but that is where we're trying to go with our formalisation work. > As to your example of > > if (ra) > atomic_write(rb, A); > else > atomic_write(rb, B); > > I really think that it is ok to combine that into > > atomic_write(rb, ra ? A:B); > > (by virtue of "exact same behavior on actual hardware"), and then the > only remaining question is whether the "ra?A:B" can be optimized to > remove the conditional if A==B as in your example where both are "42". > Agreed? y > Now, I would argue that the "naive" translation of that is > unambiguous, and since "ra" is not volatile or magic in any way, then > "ra?42:42" can obviously be optimized into just 42 - by the exact same > rule that says "the compiler can do any transformation that is > equivalent in the hardware". The compiler can *locally* decide that > that is the right thing to do, and any programmer that complains about > that decision is just crazy. > > So my "local machine behavior equivalency" rule means that that > function can be optimized into a single "store 42 atomically into rb". This is a bit more subtle, because (on ARM and POWER) removing the dependency and conditional branch is actually in general *not* equivalent in the hardware, in a concurrent context. That notwithstanding, I tend to agree that preventing that optimisation for non-atomics would be prohibitively costly (though I'd like real data). It's tempting then to permit more-or-less any optimisation for thread-local accesses but rule out value-range analysis and suchlike for shared-memory accesses. Whether that would be viable from a compiler point of view, I don't know. In C, one will at best only be able to get an approximate analysis of which is which, just for a start. > Now, if it's *not* compiled locally, and is instead implemented as a > macro (or inline function), there are obviously situations where "ra ? > A : B" ends up having to do other things. In particular, X may be > volatile or an atomic read that has ordering semantics, and then that > expression doesn't become just "42", but that's a separate issue. It's > not all that dissimilar to "function calls are sequence points", > though, and obviously if the source of "ra" has semantic meaning, you > have to honor that semantic meaning. separate point, indeed Peter ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 18:21 ` Peter Sewell @ 2014-02-18 18:49 ` Linus Torvalds 2014-02-18 19:47 ` Paul E. McKenney 2014-02-18 20:46 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 18:49 UTC (permalink / raw) To: Peter.Sewell Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > > This is a bit more subtle, because (on ARM and POWER) removing the > dependency and conditional branch is actually in general *not* equivalent > in the hardware, in a concurrent context. So I agree, but I think that's a generic issue with non-local memory ordering, and is not at all specific to the optimization wrt that "x?42:42" expression. If you have a value that you loaded with a non-relaxed load, and you pass that value off to a non-local function that you don't know what it does, in my opinion that implies that the compiler had better add the necessary serialization to say "whatever that other function does, we guarantee the semantics of the load". So on ppc, if you do a load with "consume" or "acquire" and then call another function without having had something in the caller that serializes the load, you'd better add the lwsync or whatever before the call. Exactly because the function call itself otherwise basically breaks the visibility into ordering. You've basically turned a load-with-ordering-guarantees into just an integer that you passed off to something that doesn't know about the ordering guarantees - and you need that "lwsync" in order to still guarantee the ordering. Tough titties. That's what a CPU with weak memory ordering semantics gets in order to have sufficient memory ordering. And I don't think it's actually a problem in practice. If you are doing loads with ordered semantics, you're not going to pass the result off willy-nilly to random functions (or you really *do* require the ordering, because the load that did the "acquire" was actually for a lock! So I really think that the "local optimization" is correct regardless. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 18:49 ` Linus Torvalds @ 2014-02-18 19:47 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 19:47 UTC (permalink / raw) To: Linus Torvalds Cc: Peter.Sewell, mark.batty@cl.cam.ac.uk, Peter Zijlstra, Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On Tue, Feb 18, 2014 at 10:49:27AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 10:21 AM, Peter Sewell > <Peter.Sewell@cl.cam.ac.uk> wrote: > > > > This is a bit more subtle, because (on ARM and POWER) removing the > > dependency and conditional branch is actually in general *not* equivalent > > in the hardware, in a concurrent context. > > So I agree, but I think that's a generic issue with non-local memory > ordering, and is not at all specific to the optimization wrt that > "x?42:42" expression. > > If you have a value that you loaded with a non-relaxed load, and you > pass that value off to a non-local function that you don't know what > it does, in my opinion that implies that the compiler had better add > the necessary serialization to say "whatever that other function does, > we guarantee the semantics of the load". > > So on ppc, if you do a load with "consume" or "acquire" and then call > another function without having had something in the caller that > serializes the load, you'd better add the lwsync or whatever before > the call. Exactly because the function call itself otherwise basically > breaks the visibility into ordering. You've basically turned a > load-with-ordering-guarantees into just an integer that you passed off > to something that doesn't know about the ordering guarantees - and you > need that "lwsync" in order to still guarantee the ordering. > > Tough titties. That's what a CPU with weak memory ordering semantics > gets in order to have sufficient memory ordering. And that is in fact what C11 compilers are supposed to do if the function doesn't have the [[carries_dependency]] attribute on the corresponding argument or return of the non-local function. If the function is marked with [[carries_dependency]], then the compiler has the information needed in both compilations to make things work correctly. Thanx, Paul > And I don't think it's actually a problem in practice. If you are > doing loads with ordered semantics, you're not going to pass the > result off willy-nilly to random functions (or you really *do* require > the ordering, because the load that did the "acquire" was actually for > a lock! > > So I really think that the "local optimization" is correct regardless. > > Linus > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 18:21 ` Peter Sewell 2014-02-18 18:49 ` Linus Torvalds @ 2014-02-18 20:46 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 20:46 UTC (permalink / raw) To: Peter.Sewell Cc: Linus Torvalds, mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On Tue, 2014-02-18 at 18:21 +0000, Peter Sewell wrote: > On 18 February 2014 17:38, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, Feb 18, 2014 at 4:12 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > >> > >> For example, suppose we have, in one compilation unit: > >> > >> void f(int ra, int*rb) { > >> if (ra==42) > >> *rb=42; > >> else > >> *rb=42; > >> } > > > > So this is a great example, and in general I really like your page at: > > > >> For more context, this example is taken from a summary of the thin-air > >> problem by Mark Batty and myself, > >> <www.cl.cam.ac.uk/~pes20/cpp/notes42.html>, and the problem with > >> dependencies via other compilation units was AFAIK first pointed out > >> by Hans Boehm. > > > > and the reason I like your page is that it really talks about the > > problem by pointing to the "unoptimized" code, and what hardware would > > do. > > Thanks. It's certainly necessary to separately understand what compiler > optimisation and the hardware might do, to get anywhere here. But... > > > As mentioned, I think that's actually the *correct* way to think about > > the problem space, because it allows the programmer to take hardware > > characteristics into account, without having to try to "describe" them > > at a source level. > > ...to be clear, I am ultimately after a decent source-level description of what > programmers can depend on, and we (Mark and I) view that page as > identifying constraints on what that description can say. There are too > many compiler optimisations for people to reason directly in terms of > the set of all transformations that they do, so we need some more > concise and comprehensible envelope identifying what is allowed, > as an interface between compiler writers and users. AIUI that's basically > what Torvald is arguing. Yes, that's one reason. Another one is that if a programmer would actually want to use atomics in a machine-independent / portable way, he/she does also not want to reason about how all those transformations might interact with the machine's memory model. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 12:12 Peter Sewell ` (2 preceding siblings ...) 2014-02-18 17:38 ` Linus Torvalds @ 2014-02-18 20:43 ` Torvald Riegel 2014-02-18 21:29 ` Paul E. McKenney 2014-02-18 23:48 ` Peter Sewell 3 siblings, 2 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 20:43 UTC (permalink / raw) To: Peter.Sewell Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, peterz, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote: > Several of you have said that the standard and compiler should not > permit speculative writes of atomics, or (effectively) that the > compiler should preserve dependencies. In simple examples it's easy > to see what that means, but in general it's not so clear what the > language should guarantee, because dependencies may go via non-atomic > code in other compilation units, and we have to consider the extent to > which it's desirable to limit optimisation there. [...] > 2) otherwise, the language definition should prohibit it but the > compiler would have to preserve dependencies even in compilation > units that have no mention of atomics. It's unclear what the > (runtime and compiler development) cost of that would be in > practice - perhaps Torvald could comment? If I'm reading the standard correctly, it requires that data dependencies are preserved through loads and stores, including nonatomic ones. That sounds convenient because it allows programmers to use temporary storage. However, what happens if a dependency "arrives" at a store for which the alias set isn't completely known? Then we either have to add a barrier to enforce the ordering at this point, or we have to assume that all other potentially aliasing memory locations would also have to start carrying dependencies (which might be in other functions in other compilation units). Neither option is good. The first might introduce barriers in places in which they might not be required (or the programmer has to use kill_dependency() quite often to avoid all these). The second is bad because points-to analysis is hard, so in practice the points-to set will not be precisely known for a lot of pointers. So this might not just creep into other functions via calls of [[carries_dependency]] functions, but also through normal loads and stores, likely prohibiting many optimizations. Furthermore, the dependency tracking can currently only be "disabled/enabled" on a function granularity (via [[carries_dependency]]). Thus, if we have big functions, then dependency tracking may slow down a lot of code in the big function. If we have small functions, there's a lot of attributes to be added. If a function may only carry a dependency but doesn't necessarily (eg, depending on input parameters), then the programmer has to make a trade-off whether he/she want's to benefit from mo_consume but slow down other calls due to additional barriers (ie, when this function is called from non-[[carries_dependency]] functions), or vice versa. (IOW, because of the function granularity, other code's performance is affected.) If a compiler wants to implement dependency tracking just for a few constructs (e.g., operators -> + ...) and use barriers otherwise, then this decision must be compatible with how all this is handled in other compilation units. Thus, compiler optimizations effectively become part of the ABI, which doesn't seem right. I hope these examples illustrate my concerns about the implementability in practice of this. It's also why I've suggested to move from an opt-out approach as in the current standard (ie, with kill_dependency()) to an opt-in approach for conservative dependency tracking (e.g., with a preserve_dependencies(exp) call, where exp will not be optimized in a way that removes any dependencies). This wouldn't help with many optimizations being prevented, but it should at least help programmers contain the problem to smaller regions of code. I'm not aware of any implementation that tries to track dependencies, so I can't give any real performance numbers. This could perhaps be simulated, but I'm not sure whether a realistic case would be made without at least supporting [[carries_dependency]] properly in the compiler, which would be some work. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 20:43 ` Torvald Riegel @ 2014-02-18 21:29 ` Paul E. McKenney 2014-02-18 23:48 ` Peter Sewell 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 21:29 UTC (permalink / raw) To: Torvald Riegel Cc: Peter.Sewell, mark.batty@cl.cam.ac.uk, peterz, torvalds, Will Deacon, Ramana.Radhakrishnan, dhowells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 09:43:31PM +0100, Torvald Riegel wrote: > xagsmtp5.20140218204423.3934@bldgate.vnet.ibm.com > X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP5 at BLDGATE) > > On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote: > > Several of you have said that the standard and compiler should not > > permit speculative writes of atomics, or (effectively) that the > > compiler should preserve dependencies. In simple examples it's easy > > to see what that means, but in general it's not so clear what the > > language should guarantee, because dependencies may go via non-atomic > > code in other compilation units, and we have to consider the extent to > > which it's desirable to limit optimisation there. > > [...] > > > 2) otherwise, the language definition should prohibit it but the > > compiler would have to preserve dependencies even in compilation > > units that have no mention of atomics. It's unclear what the > > (runtime and compiler development) cost of that would be in > > practice - perhaps Torvald could comment? > > If I'm reading the standard correctly, it requires that data > dependencies are preserved through loads and stores, including nonatomic > ones. That sounds convenient because it allows programmers to use > temporary storage. > > However, what happens if a dependency "arrives" at a store for which the > alias set isn't completely known? Then we either have to add a barrier > to enforce the ordering at this point, or we have to assume that all > other potentially aliasing memory locations would also have to start > carrying dependencies (which might be in other functions in other > compilation units). Neither option is good. The first might introduce > barriers in places in which they might not be required (or the > programmer has to use kill_dependency() quite often to avoid all these). > The second is bad because points-to analysis is hard, so in practice the > points-to set will not be precisely known for a lot of pointers. So > this might not just creep into other functions via calls of > [[carries_dependency]] functions, but also through normal loads and > stores, likely prohibiting many optimizations. I cannot immediately think of a situation where a store carrying a dependency into a non-trivially aliased object wouldn't be a usage error, so perhaps emitting a barrier and a diagnostic at that point is best. > Furthermore, the dependency tracking can currently only be > "disabled/enabled" on a function granularity (via > [[carries_dependency]]). Thus, if we have big functions, then > dependency tracking may slow down a lot of code in the big function. If > we have small functions, there's a lot of attributes to be added. > > If a function may only carry a dependency but doesn't necessarily (eg, > depending on input parameters), then the programmer has to make a > trade-off whether he/she want's to benefit from mo_consume but slow down > other calls due to additional barriers (ie, when this function is called > from non-[[carries_dependency]] functions), or vice versa. (IOW, > because of the function granularity, other code's performance is > affected.) > > If a compiler wants to implement dependency tracking just for a few > constructs (e.g., operators -> + ...) and use barriers otherwise, then > this decision must be compatible with how all this is handled in other > compilation units. Thus, compiler optimizations effectively become part > of the ABI, which doesn't seem right. > > I hope these examples illustrate my concerns about the implementability > in practice of this. It's also why I've suggested to move from an > opt-out approach as in the current standard (ie, with kill_dependency()) > to an opt-in approach for conservative dependency tracking (e.g., with a > preserve_dependencies(exp) call, where exp will not be optimized in a > way that removes any dependencies). This wouldn't help with many > optimizations being prevented, but it should at least help programmers > contain the problem to smaller regions of code. > > I'm not aware of any implementation that tries to track dependencies, so > I can't give any real performance numbers. This could perhaps be > simulated, but I'm not sure whether a realistic case would be made > without at least supporting [[carries_dependency]] properly in the > compiler, which would be some work. Another approach would be to use start-tracking/stop-tracking directives that could be buried into rcu_read_lock() and rcu_read_unlock(). There are issues with nesting and conditional use of rcu_read_lock() and rcu_read_unlock(), but it does give you nicer granularity properties. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 20:43 ` Torvald Riegel 2014-02-18 21:29 ` Paul E. McKenney @ 2014-02-18 23:48 ` Peter Sewell 2014-02-19 9:46 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Peter Sewell @ 2014-02-18 23:48 UTC (permalink / raw) To: Torvald Riegel Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Linus Torvalds, Will Deacon, ramana.radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On 18 February 2014 20:43, Torvald Riegel <triegel@redhat.com> wrote: > On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote: >> Several of you have said that the standard and compiler should not >> permit speculative writes of atomics, or (effectively) that the >> compiler should preserve dependencies. In simple examples it's easy >> to see what that means, but in general it's not so clear what the >> language should guarantee, because dependencies may go via non-atomic >> code in other compilation units, and we have to consider the extent to >> which it's desirable to limit optimisation there. > > [...] > >> 2) otherwise, the language definition should prohibit it but the >> compiler would have to preserve dependencies even in compilation >> units that have no mention of atomics. It's unclear what the >> (runtime and compiler development) cost of that would be in >> practice - perhaps Torvald could comment? > > If I'm reading the standard correctly, it requires that data > dependencies are preserved through loads and stores, including nonatomic > ones. That sounds convenient because it allows programmers to use > temporary storage. The standard only needs this for consume chains, but if one wanted to get rid of thin-air values by requiring implementations to respect all (reads-from union dependency) cycles, AFAICS we'd need it pretty much everywhere. I don't myself think that's likely to be a realistic proposal, but it does keep coming up, and it'd be very interesting to know the actual cost on some credible workload. > However, what happens if a dependency "arrives" at a store for which the > alias set isn't completely known? Then we either have to add a barrier > to enforce the ordering at this point, or we have to assume that all > other potentially aliasing memory locations would also have to start > carrying dependencies (which might be in other functions in other > compilation units). Neither option is good. The first might introduce > barriers in places in which they might not be required (or the > programmer has to use kill_dependency() quite often to avoid all these). > The second is bad because points-to analysis is hard, so in practice the > points-to set will not be precisely known for a lot of pointers. So > this might not just creep into other functions via calls of > [[carries_dependency]] functions, but also through normal loads and > stores, likely prohibiting many optimizations. > > Furthermore, the dependency tracking can currently only be > "disabled/enabled" on a function granularity (via > [[carries_dependency]]). Thus, if we have big functions, then > dependency tracking may slow down a lot of code in the big function. If > we have small functions, there's a lot of attributes to be added. > > If a function may only carry a dependency but doesn't necessarily (eg, > depending on input parameters), then the programmer has to make a > trade-off whether he/she want's to benefit from mo_consume but slow down > other calls due to additional barriers (ie, when this function is called > from non-[[carries_dependency]] functions), or vice versa. (IOW, > because of the function granularity, other code's performance is > affected.) > > If a compiler wants to implement dependency tracking just for a few > constructs (e.g., operators -> + ...) and use barriers otherwise, then > this decision must be compatible with how all this is handled in other > compilation units. Thus, compiler optimizations effectively become part > of the ABI, which doesn't seem right. > > I hope these examples illustrate my concerns about the implementability > in practice of this. It's also why I've suggested to move from an > opt-out approach as in the current standard (ie, with kill_dependency()) > to an opt-in approach for conservative dependency tracking (e.g., with a > preserve_dependencies(exp) call, where exp will not be optimized in a > way that removes any dependencies). This wouldn't help with many > optimizations being prevented, but it should at least help programmers > contain the problem to smaller regions of code. > > I'm not aware of any implementation that tries to track dependencies, so > I can't give any real performance numbers. This could perhaps be > simulated, but I'm not sure whether a realistic case would be made > without at least supporting [[carries_dependency]] properly in the > compiler, which would be some work. > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 23:48 ` Peter Sewell @ 2014-02-19 9:46 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 9:46 UTC (permalink / raw) To: Peter.Sewell Cc: mark.batty@cl.cam.ac.uk, Paul McKenney, Peter Zijlstra, Linus Torvalds, Will Deacon, ramana.radhakrishnan, David Howells, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, gcc On Tue, 2014-02-18 at 23:48 +0000, Peter Sewell wrote: > On 18 February 2014 20:43, Torvald Riegel <triegel@redhat.com> wrote: > > On Tue, 2014-02-18 at 12:12 +0000, Peter Sewell wrote: > >> Several of you have said that the standard and compiler should not > >> permit speculative writes of atomics, or (effectively) that the > >> compiler should preserve dependencies. In simple examples it's easy > >> to see what that means, but in general it's not so clear what the > >> language should guarantee, because dependencies may go via non-atomic > >> code in other compilation units, and we have to consider the extent to > >> which it's desirable to limit optimisation there. > > > > [...] > > > >> 2) otherwise, the language definition should prohibit it but the > >> compiler would have to preserve dependencies even in compilation > >> units that have no mention of atomics. It's unclear what the > >> (runtime and compiler development) cost of that would be in > >> practice - perhaps Torvald could comment? > > > > If I'm reading the standard correctly, it requires that data > > dependencies are preserved through loads and stores, including nonatomic > > ones. That sounds convenient because it allows programmers to use > > temporary storage. > > The standard only needs this for consume chains, That's right, and the runtime cost / implementation problems of mo_consume was what I was making statements about. Sorry if that wasn't clear. ^ permalink raw reply [flat|nested] 285+ messages in thread
* [RFC][PATCH 0/5] arch: atomic rework @ 2014-02-06 13:48 Peter Zijlstra 2014-02-06 18:25 ` David Howells [not found] ` <52F93B7C.2090304@tilera.com> 0 siblings, 2 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-06 13:48 UTC (permalink / raw) To: linux-arch, linux-kernel Cc: torvalds, akpm, mingo, will.deacon, paulmck, Peter Zijlstra Hi all, A few too large patches here, mostly as RFC to see if we want to continue with this before I sink more time into it. I hope they make it out to the lists. This all started with me wanting to implement atomic_sub_release() for all archs, but I got side-tracked a bit and it ended up cleaning up bits and deleting almost 1400 lines of code. Its been compiled on everything I have a compiler for, however frv and tile are missing because they're special and I was tired. --- Documentation/atomic_ops.txt | 31 - Documentation/memory-barriers.txt | 44 - a/arch/arc/include/asm/barrier.h | 37 - a/arch/hexagon/include/asm/barrier.h | 37 - arch/alpha/include/asm/atomic.h | 225 +++----- arch/alpha/include/asm/bitops.h | 3 arch/arc/include/asm/atomic.h | 198 ++----- arch/arc/include/asm/bitops.h | 5 arch/arm/include/asm/atomic.h | 301 ++++------- arch/arm/include/asm/barrier.h | 3 arch/arm/include/asm/bitops.h | 4 arch/arm64/include/asm/atomic.h | 212 +++----- arch/arm64/include/asm/barrier.h | 3 arch/arm64/include/asm/bitops.h | 9 arch/avr32/include/asm/atomic.h | 95 +-- arch/avr32/include/asm/bitops.h | 9 arch/blackfin/include/asm/atomic.h | 5 arch/blackfin/include/asm/barrier.h | 3 arch/blackfin/include/asm/bitops.h | 14 arch/blackfin/mach-common/smp.c | 2 arch/c6x/include/asm/bitops.h | 8 arch/cris/include/arch-v10/arch/system.h | 2 arch/cris/include/asm/atomic.h | 66 +- arch/cris/include/asm/bitops.h | 9 arch/frv/include/asm/atomic.h | 7 arch/frv/include/asm/bitops.h | 6 arch/hexagon/include/asm/atomic.h | 74 +- arch/hexagon/include/asm/bitops.h | 4 arch/ia64/include/asm/atomic.h | 212 ++++---- arch/ia64/include/asm/barrier.h | 3 arch/ia64/include/asm/bitops.h | 9 arch/ia64/include/uapi/asm/cmpxchg.h | 9 arch/m32r/include/asm/atomic.h | 191 ++----- arch/m32r/include/asm/bitops.h | 6 arch/m32r/kernel/smp.c | 4 arch/m68k/include/asm/atomic.h | 130 +---- arch/m68k/include/asm/bitops.h | 7 arch/metag/include/asm/atomic.h | 6 arch/metag/include/asm/atomic_lnkget.h | 159 +----- arch/metag/include/asm/atomic_lock1.h | 100 +-- arch/metag/include/asm/barrier.h | 3 arch/metag/include/asm/bitops.h | 6 arch/mips/include/asm/atomic.h | 570 +++++++--------------- arch/mips/include/asm/barrier.h | 3 arch/mips/include/asm/bitops.h | 11 arch/mips/kernel/irq.c | 4 arch/mn10300/include/asm/atomic.h | 199 +------ arch/mn10300/include/asm/bitops.h | 4 arch/mn10300/mm/tlb-smp.c | 6 arch/openrisc/include/asm/bitops.h | 9 arch/parisc/include/asm/atomic.h | 121 ++-- arch/parisc/include/asm/bitops.h | 4 arch/powerpc/include/asm/atomic.h | 214 +++----- arch/powerpc/include/asm/barrier.h | 3 arch/powerpc/include/asm/bitops.h | 6 arch/powerpc/kernel/crash.c | 2 arch/powerpc/kernel/misc_32.S | 19 arch/s390/include/asm/atomic.h | 93 ++- arch/s390/include/asm/barrier.h | 5 arch/s390/include/asm/bitops.h | 1 arch/s390/kernel/time.c | 4 arch/s390/kvm/diag.c | 2 arch/s390/kvm/intercept.c | 2 arch/s390/kvm/interrupt.c | 16 arch/s390/kvm/kvm-s390.c | 14 arch/s390/kvm/sigp.c | 6 arch/score/include/asm/bitops.h | 7 arch/sh/include/asm/atomic-grb.h | 164 +----- arch/sh/include/asm/atomic-irq.h | 88 +-- arch/sh/include/asm/atomic-llsc.h | 135 +---- arch/sh/include/asm/atomic.h | 6 arch/sh/include/asm/bitops.h | 7 arch/sparc/include/asm/atomic_32.h | 30 - arch/sparc/include/asm/atomic_64.h | 53 -- arch/sparc/include/asm/barrier_32.h | 1 arch/sparc/include/asm/barrier_64.h | 3 arch/sparc/include/asm/bitops_32.h | 4 arch/sparc/include/asm/bitops_64.h | 4 arch/sparc/include/asm/processor.h | 2 arch/sparc/kernel/smp_64.c | 2 arch/sparc/lib/atomic32.c | 28 - arch/sparc/lib/atomic_64.S | 167 ++---- arch/sparc/lib/ksyms.c | 20 arch/tile/include/asm/atomic_32.h | 10 arch/tile/include/asm/atomic_64.h | 6 arch/tile/include/asm/barrier.h | 14 arch/tile/include/asm/bitops.h | 1 arch/tile/include/asm/bitops_32.h | 8 arch/tile/include/asm/bitops_64.h | 4 arch/x86/include/asm/atomic.h | 39 - arch/x86/include/asm/atomic64_32.h | 20 arch/x86/include/asm/atomic64_64.h | 22 arch/x86/include/asm/barrier.h | 4 arch/x86/include/asm/bitops.h | 6 arch/x86/include/asm/sync_bitops.h | 2 arch/x86/kernel/apic/hw_nmi.c | 2 arch/xtensa/include/asm/atomic.h | 318 +++--------- arch/xtensa/include/asm/bitops.h | 4 block/blk-iopoll.c | 4 crypto/chainiv.c | 2 drivers/base/power/domain.c | 2 drivers/block/mtip32xx/mtip32xx.c | 4 drivers/cpuidle/coupled.c | 2 drivers/firewire/ohci.c | 2 drivers/gpu/drm/drm_irq.c | 10 drivers/gpu/drm/i915/i915_irq.c | 6 drivers/md/bcache/bcache.h | 2 drivers/md/bcache/closure.h | 2 drivers/md/dm-bufio.c | 8 drivers/md/dm-snap.c | 4 drivers/md/dm.c | 2 drivers/md/raid5.c | 2 drivers/media/usb/dvb-usb-v2/dvb_usb_core.c | 6 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 6 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 34 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_sp.c | 26 - drivers/net/ethernet/broadcom/bnx2x/bnx2x_sriov.c | 12 drivers/net/ethernet/broadcom/cnic.c | 8 drivers/net/ethernet/brocade/bna/bnad.c | 6 drivers/net/ethernet/chelsio/cxgb/cxgb2.c | 2 drivers/net/ethernet/chelsio/cxgb3/sge.c | 6 drivers/net/ethernet/chelsio/cxgb4/sge.c | 2 drivers/net/ethernet/chelsio/cxgb4vf/sge.c | 2 drivers/net/ethernet/intel/i40e/i40e_main.c | 2 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 4 drivers/net/wireless/ti/wlcore/main.c | 2 drivers/pci/xen-pcifront.c | 4 drivers/s390/scsi/zfcp_aux.c | 2 drivers/s390/scsi/zfcp_erp.c | 68 +- drivers/s390/scsi/zfcp_fc.c | 8 drivers/s390/scsi/zfcp_fsf.c | 30 - drivers/s390/scsi/zfcp_qdio.c | 14 drivers/scsi/isci/remote_device.c | 2 drivers/target/loopback/tcm_loop.c | 4 drivers/target/target_core_alua.c | 26 - drivers/target/target_core_device.c | 6 drivers/target/target_core_iblock.c | 2 drivers/target/target_core_pr.c | 56 +- drivers/target/target_core_transport.c | 16 drivers/target/target_core_ua.c | 10 drivers/tty/n_tty.c | 2 drivers/tty/serial/mxs-auart.c | 4 drivers/usb/gadget/tcm_usb_gadget.c | 4 drivers/usb/serial/usb_wwan.c | 2 drivers/vhost/scsi.c | 2 drivers/w1/w1_family.c | 4 drivers/xen/xen-pciback/pciback_ops.c | 4 fs/btrfs/btrfs_inode.h | 2 fs/btrfs/extent_io.c | 2 fs/btrfs/inode.c | 6 fs/buffer.c | 2 fs/ext4/resize.c | 2 fs/gfs2/glock.c | 8 fs/gfs2/glops.c | 2 fs/gfs2/lock_dlm.c | 4 fs/gfs2/recovery.c | 2 fs/gfs2/sys.c | 4 fs/jbd2/commit.c | 6 fs/nfs/dir.c | 12 fs/nfs/inode.c | 2 fs/nfs/nfs4filelayoutdev.c | 4 fs/nfs/nfs4state.c | 4 fs/nfs/pagelist.c | 6 fs/nfs/pnfs.c | 2 fs/nfs/pnfs.h | 2 fs/nfs/write.c | 4 fs/ubifs/lpt_commit.c | 4 fs/ubifs/tnc_commit.c | 4 include/asm-generic/atomic.h | 176 ++---- include/asm-generic/atomic64.h | 17 include/asm-generic/barrier.h | 8 include/asm-generic/bitops.h | 8 include/asm-generic/bitops/atomic.h | 2 include/asm-generic/bitops/lock.h | 2 include/linux/atomic.h | 13 include/linux/buffer_head.h | 2 include/linux/genhd.h | 2 include/linux/interrupt.h | 8 include/linux/netdevice.h | 2 include/linux/sched.h | 6 include/linux/sunrpc/sched.h | 8 include/linux/sunrpc/xprt.h | 8 include/linux/tracehook.h | 2 include/net/ip_vs.h | 4 kernel/debug/debug_core.c | 4 kernel/futex.c | 2 kernel/kmod.c | 2 kernel/rcu/tree.c | 22 kernel/rcu/tree_plugin.h | 8 kernel/sched/cpupri.c | 6 kernel/sched/wait.c | 2 lib/atomic64.c | 77 +- mm/backing-dev.c | 2 mm/filemap.c | 4 net/atm/pppoatm.c | 2 net/bluetooth/hci_event.c | 4 net/core/dev.c | 8 net/core/link_watch.c | 2 net/ipv4/inetpeer.c | 2 net/netfilter/nf_conntrack_core.c | 2 net/rds/ib_recv.c | 4 net/rds/iw_recv.c | 4 net/rds/send.c | 6 net/rds/tcp_send.c | 2 net/sunrpc/auth.c | 2 net/sunrpc/auth_gss/auth_gss.c | 2 net/sunrpc/backchannel_rqst.c | 4 net/sunrpc/xprt.c | 4 net/sunrpc/xprtsock.c | 16 net/unix/af_unix.c | 2 sound/pci/bt87x.c | 4 211 files changed, 2188 insertions(+), 3563 deletions(-) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 13:48 Peter Zijlstra @ 2014-02-06 18:25 ` David Howells 2014-02-06 18:30 ` Peter Zijlstra ` (3 more replies) [not found] ` <52F93B7C.2090304@tilera.com> 1 sibling, 4 replies; 285+ messages in thread From: David Howells @ 2014-02-06 18:25 UTC (permalink / raw) To: Peter Zijlstra Cc: dhowells, linux-arch, linux-kernel, torvalds, akpm, mingo, will.deacon, paulmck, ramana.radhakrishnan Is it worth considering a move towards using C11 atomics and barriers and compiler intrinsics inside the kernel? The compiler _ought_ to be able to do these. One thing I'm not sure of, though, is how well gcc's atomics will cope with interrupt handlers touching atomics on CPUs without suitable atomic instructions - that said, userspace does have to deal with signals getting underfoot. but then userspace can't normally disable interrupts. David ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:25 ` David Howells @ 2014-02-06 18:30 ` Peter Zijlstra 2014-02-06 18:42 ` Paul E. McKenney ` (2 subsequent siblings) 3 siblings, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-06 18:30 UTC (permalink / raw) To: David Howells Cc: linux-arch, linux-kernel, torvalds, akpm, mingo, will.deacon, paulmck, ramana.radhakrishnan On Thu, Feb 06, 2014 at 06:25:49PM +0000, David Howells wrote: > > Is it worth considering a move towards using C11 atomics and barriers and > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > these. > > One thing I'm not sure of, though, is how well gcc's atomics will cope with > interrupt handlers touching atomics on CPUs without suitable atomic > instructions - that said, userspace does have to deal with signals getting > underfoot. but then userspace can't normally disable interrupts. I can do an asm-generic/atomic_c11.h if people want. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:25 ` David Howells 2014-02-06 18:30 ` Peter Zijlstra @ 2014-02-06 18:42 ` Paul E. McKenney 2014-02-06 18:55 ` Ramana Radhakrishnan 2014-02-06 19:21 ` Linus Torvalds 3 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-06 18:42 UTC (permalink / raw) To: David Howells Cc: Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, will.deacon, ramana.radhakrishnan On Thu, Feb 06, 2014 at 06:25:49PM +0000, David Howells wrote: > > Is it worth considering a move towards using C11 atomics and barriers and > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > these. Makes sense to me! > One thing I'm not sure of, though, is how well gcc's atomics will cope with > interrupt handlers touching atomics on CPUs without suitable atomic > instructions - that said, userspace does have to deal with signals getting > underfoot. but then userspace can't normally disable interrupts. Perhaps make the C11 definitions so that any arch can override any specific definition? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:25 ` David Howells 2014-02-06 18:30 ` Peter Zijlstra 2014-02-06 18:42 ` Paul E. McKenney @ 2014-02-06 18:55 ` Ramana Radhakrishnan 2014-02-06 18:59 ` Will Deacon 2014-02-06 19:21 ` Linus Torvalds 3 siblings, 1 reply; 285+ messages in thread From: Ramana Radhakrishnan @ 2014-02-06 18:55 UTC (permalink / raw) To: David Howells Cc: Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, Will Deacon, paulmck, gcc On 02/06/14 18:25, David Howells wrote: > > Is it worth considering a move towards using C11 atomics and barriers and > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > these. It sounds interesting to me, if we can make it work properly and reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > One thing I'm not sure of, though, is how well gcc's atomics will cope with > interrupt handlers touching atomics on CPUs without suitable atomic > instructions - that said, userspace does have to deal with signals getting > underfoot. but then userspace can't normally disable interrupts. > > David > Ramana ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:55 ` Ramana Radhakrishnan @ 2014-02-06 18:59 ` Will Deacon 2014-02-06 19:27 ` Paul E. McKenney 2014-02-06 21:09 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Will Deacon @ 2014-02-06 18:59 UTC (permalink / raw) To: Ramana Radhakrishnan Cc: David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > On 02/06/14 18:25, David Howells wrote: > > > > Is it worth considering a move towards using C11 atomics and barriers and > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > these. > > > It sounds interesting to me, if we can make it work properly and > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. Given my (albeit limited) experience playing with the C11 spec and GCC, I really think this is a bad idea for the kernel. It seems that nobody really agrees on exactly how the C11 atomics map to real architectural instructions on anything but the trivial architectures. For example, should the following code fire the assert? extern atomic<int> foo, bar, baz; void thread1(void) { foo.store(42, memory_order_relaxed); bar.fetch_add(1, memory_order_seq_cst); baz.store(42, memory_order_relaxed); } void thread2(void) { while (baz.load(memory_order_seq_cst) != 42) { /* do nothing */ } assert(foo.load(memory_order_seq_cst) == 42); } To answer that question, you need to go and look at the definitions of synchronises-with, happens-before, dependency_ordered_before and a whole pile of vaguely written waffle to realise that you don't know. Certainly, the code that arm64 GCC currently spits out would allow the assertion to fire on some microarchitectures. There are also so many ways to blow your head off it's untrue. For example, cmpxchg takes a separate memory model parameter for failure and success, but then there are restrictions on the sets you can use for each. It's not hard to find well-known memory-ordering experts shouting "Just use memory_model_seq_cst for everything, it's too hard otherwise". Then there's the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume atm and optimises all of the data dependencies away) as well as the definition of "data races", which seem to be used as an excuse to miscompile a program at the earliest opportunity. Trying to introduce system concepts (writes to devices, interrupts, non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd just rather stick to the semantics we have and the asm volatile barriers. That's not to say I don't there's no room for improvement in what we have in the kernel. Certainly, I'd welcome allowing more relaxed operations on architectures that support them, but it needs to be something that at least the different architecture maintainers can understand how to implement efficiently behind an uncomplicated interface. I don't think that interface is C11. Just my thoughts on the matter... Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:59 ` Will Deacon @ 2014-02-06 19:27 ` Paul E. McKenney 2014-02-06 21:17 ` Torvald Riegel 2014-02-06 21:09 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-06 19:27 UTC (permalink / raw) To: Will Deacon Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > On 02/06/14 18:25, David Howells wrote: > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > these. > > > > > > It sounds interesting to me, if we can make it work properly and > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > really think this is a bad idea for the kernel. It seems that nobody really > agrees on exactly how the C11 atomics map to real architectural > instructions on anything but the trivial architectures. For example, should > the following code fire the assert? > > > extern atomic<int> foo, bar, baz; > > void thread1(void) > { > foo.store(42, memory_order_relaxed); > bar.fetch_add(1, memory_order_seq_cst); > baz.store(42, memory_order_relaxed); > } > > void thread2(void) > { > while (baz.load(memory_order_seq_cst) != 42) { > /* do nothing */ > } > > assert(foo.load(memory_order_seq_cst) == 42); > } > > > To answer that question, you need to go and look at the definitions of > synchronises-with, happens-before, dependency_ordered_before and a whole > pile of vaguely written waffle to realise that you don't know. Certainly, > the code that arm64 GCC currently spits out would allow the assertion to fire > on some microarchitectures. Yep! I believe that a memory_order_seq_cst fence in combination with the fetch_add() would do the trick on many architectures, however. All of this is one reason that any C11 definitions need to be individually overridable by individual architectures. > There are also so many ways to blow your head off it's untrue. For example, > cmpxchg takes a separate memory model parameter for failure and success, but > then there are restrictions on the sets you can use for each. It's not hard > to find well-known memory-ordering experts shouting "Just use > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > atm and optimises all of the data dependencies away) as well as the definition > of "data races", which seem to be used as an excuse to miscompile a program > at the earliest opportunity. Trust me, rcu_dereference() is not going to be defined in terms of memory_order_consume until the compilers implement it both correctly and efficiently. They are not there yet, and there is currently no shortage of compiler writers who would prefer to ignore memory_order_consume. And rcu_dereference() will need per-arch overrides for some time during any transition to memory_order_consume. > Trying to introduce system concepts (writes to devices, interrupts, > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd > just rather stick to the semantics we have and the asm volatile barriers. And barrier() isn't going to go away any time soon, either. And ACCESS_ONCE() needs to keep volatile semantics until there is some memory_order_whatever that prevents loads and stores from being coalesced. > That's not to say I don't there's no room for improvement in what we have > in the kernel. Certainly, I'd welcome allowing more relaxed operations on > architectures that support them, but it needs to be something that at least > the different architecture maintainers can understand how to implement > efficiently behind an uncomplicated interface. I don't think that interface is > C11. > > Just my thoughts on the matter... C11 does not provide a good interface for the Linux kernel, nor was it intended to do so. It might provide good implementations for some of the atomic ops for some architectures. This could reduce the amount of assembly written for new architectures, and could potentially allow the compiler to do a better job of optimizing (scary thought!). But for this to work, that architecture's Linux-kernel maintainer and gcc maintainer would need to be working together. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 19:27 ` Paul E. McKenney @ 2014-02-06 21:17 ` Torvald Riegel 2014-02-06 22:11 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-06 21:17 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > > these. > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > really think this is a bad idea for the kernel. It seems that nobody really > > agrees on exactly how the C11 atomics map to real architectural > > instructions on anything but the trivial architectures. For example, should > > the following code fire the assert? > > > > > > extern atomic<int> foo, bar, baz; > > > > void thread1(void) > > { > > foo.store(42, memory_order_relaxed); > > bar.fetch_add(1, memory_order_seq_cst); > > baz.store(42, memory_order_relaxed); > > } > > > > void thread2(void) > > { > > while (baz.load(memory_order_seq_cst) != 42) { > > /* do nothing */ > > } > > > > assert(foo.load(memory_order_seq_cst) == 42); > > } > > > > > > To answer that question, you need to go and look at the definitions of > > synchronises-with, happens-before, dependency_ordered_before and a whole > > pile of vaguely written waffle to realise that you don't know. Certainly, > > the code that arm64 GCC currently spits out would allow the assertion to fire > > on some microarchitectures. > > Yep! I believe that a memory_order_seq_cst fence in combination with the > fetch_add() would do the trick on many architectures, however. All of > this is one reason that any C11 definitions need to be individually > overridable by individual architectures. "Overridable" in which sense? Do you want to change the semantics on the language level in the sense of altering the memory model, or rather use a different implementation under the hood to, for example, fix deficiencies in the compilers? > > There are also so many ways to blow your head off it's untrue. For example, > > cmpxchg takes a separate memory model parameter for failure and success, but > > then there are restrictions on the sets you can use for each. It's not hard > > to find well-known memory-ordering experts shouting "Just use > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > atm and optimises all of the data dependencies away) as well as the definition > > of "data races", which seem to be used as an excuse to miscompile a program > > at the earliest opportunity. > > Trust me, rcu_dereference() is not going to be defined in terms of > memory_order_consume until the compilers implement it both correctly and > efficiently. They are not there yet, and there is currently no shortage > of compiler writers who would prefer to ignore memory_order_consume. Do you have any input on http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the language standard's definition of dependencies? > And rcu_dereference() will need per-arch overrides for some time during > any transition to memory_order_consume. > > > Trying to introduce system concepts (writes to devices, interrupts, > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd > > just rather stick to the semantics we have and the asm volatile barriers. > > And barrier() isn't going to go away any time soon, either. And > ACCESS_ONCE() needs to keep volatile semantics until there is some > memory_order_whatever that prevents loads and stores from being coalesced. I'd be happy to discuss something like this in ISO C++ SG1 (or has this been discussed in the past already?). But it needs to have a paper I suppose. Will you be in Issaquah for the C++ meeting next week? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 21:17 ` Torvald Riegel @ 2014-02-06 22:11 ` Paul E. McKenney 2014-02-06 23:44 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-06 22:11 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > > > these. > > > > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > > really think this is a bad idea for the kernel. It seems that nobody really > > > agrees on exactly how the C11 atomics map to real architectural > > > instructions on anything but the trivial architectures. For example, should > > > the following code fire the assert? > > > > > > > > > extern atomic<int> foo, bar, baz; > > > > > > void thread1(void) > > > { > > > foo.store(42, memory_order_relaxed); > > > bar.fetch_add(1, memory_order_seq_cst); > > > baz.store(42, memory_order_relaxed); > > > } > > > > > > void thread2(void) > > > { > > > while (baz.load(memory_order_seq_cst) != 42) { > > > /* do nothing */ > > > } > > > > > > assert(foo.load(memory_order_seq_cst) == 42); > > > } > > > > > > > > > To answer that question, you need to go and look at the definitions of > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > pile of vaguely written waffle to realise that you don't know. Certainly, > > > the code that arm64 GCC currently spits out would allow the assertion to fire > > > on some microarchitectures. > > > > Yep! I believe that a memory_order_seq_cst fence in combination with the > > fetch_add() would do the trick on many architectures, however. All of > > this is one reason that any C11 definitions need to be individually > > overridable by individual architectures. > > "Overridable" in which sense? Do you want to change the semantics on > the language level in the sense of altering the memory model, or rather > use a different implementation under the hood to, for example, fix > deficiencies in the compilers? We need the architecture maintainer to be able to select either an assembly-language implementation or a C11-atomics implementation for any given Linux-kernel operation. For example, a given architecture might be able to use fetch_add(1, memory_order_relaxed) for atomic_inc() but assembly for atomic_add_return(). This is because atomic_inc() is not required to have any particular ordering properties, while as discussed previously, atomic_add_return() requires tighter ordering than the C11 standard provides. > > > There are also so many ways to blow your head off it's untrue. For example, > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > then there are restrictions on the sets you can use for each. It's not hard > > > to find well-known memory-ordering experts shouting "Just use > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > atm and optimises all of the data dependencies away) as well as the definition > > > of "data races", which seem to be used as an excuse to miscompile a program > > > at the earliest opportunity. > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > memory_order_consume until the compilers implement it both correctly and > > efficiently. They are not there yet, and there is currently no shortage > > of compiler writers who would prefer to ignore memory_order_consume. > > Do you have any input on > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > language standard's definition of dependencies? Let's see... 1.10p9 says that a dependency must be carried unless: — B is an invocation of any specialization of std::kill_dependency (29.3), or — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, or — A is the left operand of a conditional (?:, see 5.16) operator, or — A is the left operand of the built-in comma (,) operator (5.18); So the use of "flag" before the "?" is ignored. But the "flag - flag" after the "?" will carry a dependency, so the code fragment in 59448 needs to do the ordering rather than just optimizing "flag - flag" out of existence. One way to do that on both ARM and Power is to actually emit code for "flag - flag", but there are a number of other ways to make that work. BTW, there is some discussion on 1.10p9's handling of && and ||, and that clause is likely to change. And yes, I am behind on analyzing usage in the Linux kernel to find out if Linux cares... > > And rcu_dereference() will need per-arch overrides for some time during > > any transition to memory_order_consume. > > > > > Trying to introduce system concepts (writes to devices, interrupts, > > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd > > > just rather stick to the semantics we have and the asm volatile barriers. > > > > And barrier() isn't going to go away any time soon, either. And > > ACCESS_ONCE() needs to keep volatile semantics until there is some > > memory_order_whatever that prevents loads and stores from being coalesced. > > I'd be happy to discuss something like this in ISO C++ SG1 (or has this > been discussed in the past already?). But it needs to have a paper I > suppose. The current position of the usual suspects other than me is that this falls into the category of forward-progress guarantees, which are considers (again, by the usual suspects other than me) to be out of scope. > Will you be in Issaquah for the C++ meeting next week? Weather permitting, I will be there! Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 22:11 ` Paul E. McKenney @ 2014-02-06 23:44 ` Torvald Riegel 2014-02-07 4:20 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-06 23:44 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > to find well-known memory-ordering experts shouting "Just use > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > at the earliest opportunity. > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > memory_order_consume until the compilers implement it both correctly and > > > efficiently. They are not there yet, and there is currently no shortage > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > Do you have any input on > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > language standard's definition of dependencies? > > Let's see... 1.10p9 says that a dependency must be carried unless: > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > or > — A is the left operand of a conditional (?:, see 5.16) operator, or > — A is the left operand of the built-in comma (,) operator (5.18); > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > after the "?" will carry a dependency, so the code fragment in 59448 > needs to do the ordering rather than just optimizing "flag - flag" out > of existence. One way to do that on both ARM and Power is to actually > emit code for "flag - flag", but there are a number of other ways to > make that work. And that's what would concern me, considering that these requirements seem to be able to creep out easily. Also, whereas the other atomics just constrain compilers wrt. reordering across atomic accesses or changes to the atomic accesses themselves, the dependencies are new requirements on pieces of otherwise non-synchronizing code. The latter seems far more involved to me. > BTW, there is some discussion on 1.10p9's handling of && and ||, and > that clause is likely to change. And yes, I am behind on analyzing > usage in the Linux kernel to find out if Linux cares... Do you have any pointers to these discussions (e.g., LWG issues)? > > > And rcu_dereference() will need per-arch overrides for some time during > > > any transition to memory_order_consume. > > > > > > > Trying to introduce system concepts (writes to devices, interrupts, > > > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd > > > > just rather stick to the semantics we have and the asm volatile barriers. > > > > > > And barrier() isn't going to go away any time soon, either. And > > > ACCESS_ONCE() needs to keep volatile semantics until there is some > > > memory_order_whatever that prevents loads and stores from being coalesced. > > > > I'd be happy to discuss something like this in ISO C++ SG1 (or has this > > been discussed in the past already?). But it needs to have a paper I > > suppose. > > The current position of the usual suspects other than me is that this > falls into the category of forward-progress guarantees, which are > considers (again, by the usual suspects other than me) to be out > of scope. But I think we need to better describe forward progress, even though that might be tricky. We made at least some progress on http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even though we can't constrain the OS schedulers too much, and for lock-free we're in this weird position that on most general-purpose schedulers and machines, obstruction-free algorithms are likely to work just fine like lock-free, most of the time, in practice... We also need to discuss forward progress guarantees for any parallelism/concurrency abstractions, I believe: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3874.pdf Hopefully we'll get some more acceptance of this being in scope... > > Will you be in Issaquah for the C++ meeting next week? > > Weather permitting, I will be there! Great, maybe we can find some time in SG1 to discuss this then. Even if the standard doesn't want to include it, SG1 should be a good forum to understand everyone's concerns around that, with the hope that this would help potential non-standard extensions to be still checked by the same folks that did the rest of the memory model. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 23:44 ` Torvald Riegel @ 2014-02-07 4:20 ` Paul E. McKenney 2014-02-07 7:44 ` Peter Zijlstra 2014-02-10 0:06 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 4:20 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > > at the earliest opportunity. > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > memory_order_consume until the compilers implement it both correctly and > > > > efficiently. They are not there yet, and there is currently no shortage > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > Do you have any input on > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > language standard's definition of dependencies? > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > > or > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > after the "?" will carry a dependency, so the code fragment in 59448 > > needs to do the ordering rather than just optimizing "flag - flag" out > > of existence. One way to do that on both ARM and Power is to actually > > emit code for "flag - flag", but there are a number of other ways to > > make that work. > > And that's what would concern me, considering that these requirements > seem to be able to creep out easily. Also, whereas the other atomics > just constrain compilers wrt. reordering across atomic accesses or > changes to the atomic accesses themselves, the dependencies are new > requirements on pieces of otherwise non-synchronizing code. The latter > seems far more involved to me. Well, the wording of 1.10p9 is pretty explicit on this point. There are only a few exceptions to the rule that dependencies from memory_order_consume loads must be tracked. And to your point about requirements being placed on pieces of otherwise non-synchronizing code, we already have that with plain old load acquire and store release -- both of these put ordering constraints that affect the surrounding non-synchronizing code. This issue got a lot of discussion, and the compromise is that dependencies cannot leak into or out of functions unless the relevant parameters or return values are annotated with [[carries_dependency]]. This means that the compiler can see all the places where dependencies must be tracked. This is described in 7.6.4. If a dependency chain headed by a memory_order_consume load goes into or out of a function without the aid of the [[carries_dependency]] attribute, the compiler needs to do something else to enforce ordering, e.g., emit a memory barrier. >From a Linux-kernel viewpoint, this is a bit ugly, as it requires annotations and use of kill_dependency, but it was the best I could do at the time. If things go as they usually do, there will be some other reason why those are needed... > > BTW, there is some discussion on 1.10p9's handling of && and ||, and > > that clause is likely to change. And yes, I am behind on analyzing > > usage in the Linux kernel to find out if Linux cares... > > Do you have any pointers to these discussions (e.g., LWG issues)? Nope, just a bare email thread. I would guess that it will come up next week. The question is whether dependencies should be carried through && or || at all, and if so how. My current guess is that && and || should not carry dependencies. > > > > And rcu_dereference() will need per-arch overrides for some time during > > > > any transition to memory_order_consume. > > > > > > > > > Trying to introduce system concepts (writes to devices, interrupts, > > > > > non-coherent agents) into this mess is going to be an uphill battle IMHO. I'd > > > > > just rather stick to the semantics we have and the asm volatile barriers. > > > > > > > > And barrier() isn't going to go away any time soon, either. And > > > > ACCESS_ONCE() needs to keep volatile semantics until there is some > > > > memory_order_whatever that prevents loads and stores from being coalesced. > > > > > > I'd be happy to discuss something like this in ISO C++ SG1 (or has this > > > been discussed in the past already?). But it needs to have a paper I > > > suppose. > > > > The current position of the usual suspects other than me is that this > > falls into the category of forward-progress guarantees, which are > > considers (again, by the usual suspects other than me) to be out > > of scope. > > But I think we need to better describe forward progress, even though > that might be tricky. We made at least some progress on > http://cplusplus.github.io/LWG/lwg-active.html#2159 in Chicago, even > though we can't constrain the OS schedulers too much, and for lock-free > we're in this weird position that on most general-purpose schedulers and > machines, obstruction-free algorithms are likely to work just fine like > lock-free, most of the time, in practice... Yep, there is a draft paper by Alistarh et al. making this point. They could go quite a bit further. With a reasonably short set of additional constraints, you can get bounded execution times out of locking as well. They were not amused when I suggested this, Bjoern Brandenberg's dissertation notwithstanding. ;-) > We also need to discuss forward progress guarantees for any > parallelism/concurrency abstractions, I believe: > http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3874.pdf > > Hopefully we'll get some more acceptance of this being in scope... That would be good. Just in case C11 is to be applicable to real-time software. > > > Will you be in Issaquah for the C++ meeting next week? > > > > Weather permitting, I will be there! > > Great, maybe we can find some time in SG1 to discuss this then. Even if > the standard doesn't want to include it, SG1 should be a good forum to > understand everyone's concerns around that, with the hope that this > would help potential non-standard extensions to be still checked by the > same folks that did the rest of the memory model. Sounds good! Hopefully some discussion of out-of-thin-air values as well. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 4:20 ` Paul E. McKenney @ 2014-02-07 7:44 ` Peter Zijlstra 2014-02-07 16:50 ` Paul E. McKenney 2014-02-10 0:06 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-07 7:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > Hopefully some discussion of out-of-thin-air values as well. Yes, absolutely shoot store speculation in the head already. Then drive a wooden stake through its hart. C11/C++11 should not be allowed to claim itself a memory model until that is sorted. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 7:44 ` Peter Zijlstra @ 2014-02-07 16:50 ` Paul E. McKenney 2014-02-07 16:55 ` Will Deacon 2014-02-07 18:44 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 16:50 UTC (permalink / raw) To: Peter Zijlstra Cc: Torvald Riegel, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > Hopefully some discussion of out-of-thin-air values as well. > > Yes, absolutely shoot store speculation in the head already. Then drive > a wooden stake through its hart. > > C11/C++11 should not be allowed to claim itself a memory model until that > is sorted. There actually is a proposal being put forward, but it might not make ARM and Power people happy because it involves adding a compare, a branch, and an ISB/isync after every relaxed load... Me, I agree with you, much preferring the no-store-speculation approach. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 16:50 ` Paul E. McKenney @ 2014-02-07 16:55 ` Will Deacon 2014-02-07 17:06 ` Peter Zijlstra 2014-02-07 18:02 ` Paul E. McKenney 2014-02-07 18:44 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Will Deacon @ 2014-02-07 16:55 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc Hi Paul, On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > Hopefully some discussion of out-of-thin-air values as well. > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > a wooden stake through its hart. > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > is sorted. > > There actually is a proposal being put forward, but it might not make ARM > and Power people happy because it involves adding a compare, a branch, > and an ISB/isync after every relaxed load... Me, I agree with you, > much preferring the no-store-speculation approach. Can you elaborate a bit on this please? We don't permit speculative stores in the ARM architecture, so it seems counter-intuitive that GCC needs to emit any additional instructions to prevent that from happening. Stores can, of course, be observed out-of-order but that's a lot more reasonable :) Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 16:55 ` Will Deacon @ 2014-02-07 17:06 ` Peter Zijlstra 2014-02-07 17:13 ` Will Deacon ` (2 more replies) 2014-02-07 18:02 ` Paul E. McKenney 1 sibling, 3 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-07 17:06 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > Hi Paul, > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > a wooden stake through its hart. > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > is sorted. > > > > There actually is a proposal being put forward, but it might not make ARM > > and Power people happy because it involves adding a compare, a branch, > > and an ISB/isync after every relaxed load... Me, I agree with you, > > much preferring the no-store-speculation approach. > > Can you elaborate a bit on this please? We don't permit speculative stores > in the ARM architecture, so it seems counter-intuitive that GCC needs to > emit any additional instructions to prevent that from happening. > > Stores can, of course, be observed out-of-order but that's a lot more > reasonable :) This is more about the compiler speculating on stores; imagine: if (x) y = 1; else y = 2; The compiler is allowed to change that into: y = 2; if (x) y = 1; Which is of course a big problem when you want to rely on the ordering. There's further problems where things like memset() can write outside the specified address range. Examples are memset() using single instructions to wipe entire cachelines and then 'restoring' the tail bit. While valid for single threaded, its a complete disaster for concurrent code. There's more, but it all boils down to doing stores you don't expect in a 'sane' concurrent environment and/or don't respect the control flow. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 17:06 ` Peter Zijlstra @ 2014-02-07 17:13 ` Will Deacon 2014-02-07 17:20 ` Peter Zijlstra 2014-02-07 18:03 ` Paul E. McKenney 2014-02-07 17:46 ` Joseph S. Myers 2014-02-07 18:43 ` Torvald Riegel 2 siblings, 2 replies; 285+ messages in thread From: Will Deacon @ 2014-02-07 17:13 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 05:06:54PM +0000, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > This is more about the compiler speculating on stores; imagine: > > if (x) > y = 1; > else > y = 2; > > The compiler is allowed to change that into: > > y = 2; > if (x) > y = 1; > > Which is of course a big problem when you want to rely on the ordering. Understood, but that doesn't explain why Paul wants to add ISB/isync instructions which affect the *CPU* rather than the compiler! Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 17:13 ` Will Deacon @ 2014-02-07 17:20 ` Peter Zijlstra 2014-02-07 18:03 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-07 17:20 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 05:13:36PM +0000, Will Deacon wrote: > Understood, but that doesn't explain why Paul wants to add ISB/isync > instructions which affect the *CPU* rather than the compiler! I doubt Paul wants it, but yeah, I'm curious about that proposal as well, sounds like someone took a big toke from the bong again; it seems a favourite past time amongst committees. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 17:13 ` Will Deacon 2014-02-07 17:20 ` Peter Zijlstra @ 2014-02-07 18:03 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 18:03 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 05:13:36PM +0000, Will Deacon wrote: > On Fri, Feb 07, 2014 at 05:06:54PM +0000, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > > Hi Paul, > > > > > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > > a wooden stake through its hart. > > > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > > > is sorted. > > > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > > and Power people happy because it involves adding a compare, a branch, > > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > > much preferring the no-store-speculation approach. > > > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > > emit any additional instructions to prevent that from happening. > > > > > > Stores can, of course, be observed out-of-order but that's a lot more > > > reasonable :) > > > > This is more about the compiler speculating on stores; imagine: > > > > if (x) > > y = 1; > > else > > y = 2; > > > > The compiler is allowed to change that into: > > > > y = 2; > > if (x) > > y = 1; > > > > Which is of course a big problem when you want to rely on the ordering. > > Understood, but that doesn't explain why Paul wants to add ISB/isync > instructions which affect the *CPU* rather than the compiler! Hey!!! -I- don't want to add those instructions! Others do. Unfortunately, lots of others. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 17:06 ` Peter Zijlstra 2014-02-07 17:13 ` Will Deacon @ 2014-02-07 17:46 ` Joseph S. Myers 2014-02-07 18:43 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Joseph S. Myers @ 2014-02-07 17:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, 7 Feb 2014, Peter Zijlstra wrote: > There's further problems where things like memset() can write outside > the specified address range. Examples are memset() using single > instructions to wipe entire cachelines and then 'restoring' the tail > bit. If memset (or any C library function) modifies bytes it's not permitted to modify in the abstract machine, that's a simple bug and should be reported as usual. We've made GCC follow that part of the memory model by default (so a store to a non-bit-field structure field doesn't do a read-modify-write to a word containing another field, for example) and I think it's pretty obvious that glibc should do so as well. (Of course, memset is not an atomic operation, and you need to allow for that if you use it on an _Atomic object - which is I think valid, unless the object is also volatile, but perhaps ill-advised.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 17:06 ` Peter Zijlstra 2014-02-07 17:13 ` Will Deacon 2014-02-07 17:46 ` Joseph S. Myers @ 2014-02-07 18:43 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-07 18:43 UTC (permalink / raw) To: Peter Zijlstra Cc: Will Deacon, Paul E. McKenney, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, 2014-02-07 at 18:06 +0100, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > This is more about the compiler speculating on stores; imagine: > > if (x) > y = 1; > else > y = 2; > > The compiler is allowed to change that into: > > y = 2; > if (x) > y = 1; If you write the example like that, this is indeed allowed because it's all sequential code (and there's no volatiles in there, at least you didn't show them :). A store to y would happen in either case. You cannot observe the difference between both examples in a data-race-free program. Are there supposed to be atomic/non-sequential accesses in there? If so, please update the example. > Which is of course a big problem when you want to rely on the ordering. > > There's further problems where things like memset() can write outside > the specified address range. Examples are memset() using single > instructions to wipe entire cachelines and then 'restoring' the tail > bit. As Joseph said, this would be a bug IMO. > While valid for single threaded, its a complete disaster for concurrent > code. > > There's more, but it all boils down to doing stores you don't expect in > a 'sane' concurrent environment and/or don't respect the control flow. A few of those got fixed already, because they violated the memory model's requirements. If you have further examples that are valid code in the C11/C++11 model, please report them. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 16:55 ` Will Deacon 2014-02-07 17:06 ` Peter Zijlstra @ 2014-02-07 18:02 ` Paul E. McKenney 2014-02-10 0:27 ` Torvald Riegel 2014-02-10 11:48 ` Peter Zijlstra 1 sibling, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 18:02 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > Hi Paul, > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > a wooden stake through its hart. > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > is sorted. > > > > There actually is a proposal being put forward, but it might not make ARM > > and Power people happy because it involves adding a compare, a branch, > > and an ISB/isync after every relaxed load... Me, I agree with you, > > much preferring the no-store-speculation approach. > > Can you elaborate a bit on this please? We don't permit speculative stores > in the ARM architecture, so it seems counter-intuitive that GCC needs to > emit any additional instructions to prevent that from happening. Requiring a compare/branch/ISB after each relaxed load enables a simple(r) proof that out-of-thin-air values cannot be observed in the face of any compiler optimization that refrains from reordering a prior relaxed load with a subsequent relaxed store. > Stores can, of course, be observed out-of-order but that's a lot more > reasonable :) So let me try an example. I am sure that Torvald Riegel will jump in with any needed corrections or amplifications: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); One would intuitively expect r1 == r2 == 0 as the only possible outcome. But suppose that the compiler used specialization optimizations, as it would if there was a function that has a very lightweight implementation for some values and a very heavyweight one for other. In particular, suppose that the lightweight implementation was for the value 42. Then the compiler might do something like the following: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); if (r1 == 42) atomic_store_explicit(42, y, memory_order_relaxed); else atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); Suddenly we have an explicit constant 42 showing up. Of course, if the compiler carefully avoided speculative stores (as both Peter and I believe that it should if its code generation is to be regarded as anything other than an act of vandalism, the words in the standard notwithstanding), there would be no problem. But currently, a number of compiler writers see absolutely nothing wrong with transforming the optimized-for-42 version above with something like this: Initial state: x == y == 0 T1: r1 = atomic_load_explicit(x, memory_order_relaxed); atomic_store_explicit(42, y, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); And then it is a short and uncontroversial step to the following: Initial state: x == y == 0 T1: atomic_store_explicit(42, y, memory_order_relaxed); r1 = atomic_load_explicit(x, memory_order_relaxed); if (r1 != 42) atomic_store_explicit(r1, y, memory_order_relaxed); T2: r2 = atomic_load_explicit(y, memory_order_relaxed); atomic_store_explicit(r2, x, memory_order_relaxed); This can of course result in r1 == r2 == 42, even though the constant 42 never appeared in the original code. This is one way to generate an out-of-thin-air value. As near as I can tell, compiler writers hate the idea of prohibiting speculative-store optimizations because it requires them to introduce both control and data dependency tracking into their compilers. Many of them seem to hate dependency tracking with a purple passion. At least, such a hatred would go a long way towards explaining the incomplete and high-overhead implementations of memory_order_consume, the long and successful use of idioms based on the memory_order_consume pattern notwithstanding [*]. ;-) That said, the Java guys are talking about introducing something vaguely resembling memory_order_consume (and thus resembling the rcu_assign_pointer() and rcu_dereference() portions of RCU) to solve Java out-of-thin-air issues involving initialization, so perhaps there is hope. Thanx, Paul [*] http://queue.acm.org/detail.cfm?id=2488549 http://www.rdrop.com/users/paulmck/RCU/rclockpdcsproof.pdf ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 18:02 ` Paul E. McKenney @ 2014-02-10 0:27 ` Torvald Riegel 2014-02-10 0:56 ` Linus Torvalds ` (4 more replies) 2014-02-10 11:48 ` Peter Zijlstra 1 sibling, 5 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-10 0:27 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > Hi Paul, > > > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > a wooden stake through its hart. > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > > is sorted. > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > and Power people happy because it involves adding a compare, a branch, > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > much preferring the no-store-speculation approach. > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > emit any additional instructions to prevent that from happening. > > Requiring a compare/branch/ISB after each relaxed load enables a simple(r) > proof that out-of-thin-air values cannot be observed in the face of any > compiler optimization that refrains from reordering a prior relaxed load > with a subsequent relaxed store. > > > Stores can, of course, be observed out-of-order but that's a lot more > > reasonable :) > > So let me try an example. I am sure that Torvald Riegel will jump in > with any needed corrections or amplifications: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > One would intuitively expect r1 == r2 == 0 as the only possible outcome. > But suppose that the compiler used specialization optimizations, as it > would if there was a function that has a very lightweight implementation > for some values and a very heavyweight one for other. In particular, > suppose that the lightweight implementation was for the value 42. > Then the compiler might do something like the following: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > if (r1 == 42) > atomic_store_explicit(42, y, memory_order_relaxed); > else > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > Suddenly we have an explicit constant 42 showing up. Of course, if > the compiler carefully avoided speculative stores (as both Peter and > I believe that it should if its code generation is to be regarded as > anything other than an act of vandalism, the words in the standard > notwithstanding), there would be no problem. But currently, a number > of compiler writers see absolutely nothing wrong with transforming > the optimized-for-42 version above with something like this: > > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. For this to be correct, the compiler would actually have to prove that the speculative store is "as-if correct", which in turn would mean that it needs to be aware of all potential observers, and check whether those observers aren't actually affected by the speculative store. I would guess that the compilers you have in mind don't really do that. If they do, then I don't see why this should be okay, unless you think out-of-thin-air values are something good (which I wouldn't agree with). > And then it is a short and uncontroversial step to the following: > > Initial state: x == y == 0 > > T1: atomic_store_explicit(42, y, memory_order_relaxed); > r1 = atomic_load_explicit(x, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); > > This can of course result in r1 == r2 == 42, even though the constant > 42 never appeared in the original code. This is one way to generate > an out-of-thin-air value. > > As near as I can tell, compiler writers hate the idea of prohibiting > speculative-store optimizations because it requires them to introduce > both control and data dependency tracking into their compilers. I wouldn't characterize the situation like this (although I can't speak for others, obviously). IMHO, it's perfectly fine on sequential / non-synchronizing code, because we know the difference isn't observable by a correct program. For synchronizing code, compilers just shouldn't do it, or they would have to truly prove that speculation is harmless. That will be hard, so I think it should just be avoided. Synchronization code will likely have been tuned anyway (especially if it uses relaxed MO), so I don't see a large need for trying to optimize using speculative atomic stores. Thus, I think there's an easy and practical solution. > Many of > them seem to hate dependency tracking with a purple passion. At least, > such a hatred would go a long way towards explaining the incomplete > and high-overhead implementations of memory_order_consume, the long > and successful use of idioms based on the memory_order_consume pattern > notwithstanding [*]. ;-) I still think that's different because it blurs the difference between sequential code and synchronizing code (ie, atomic accesses). With consume MO, the simple solution above doesn't work anymore, because suddenly synchronizing code does affect optimizations in sequential code, even if that wouldn't reorder across the synchronizing code (which would be clearly "visible" to the implementation of the optimization). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:27 ` Torvald Riegel @ 2014-02-10 0:56 ` Linus Torvalds 2014-02-10 1:16 ` Torvald Riegel 2014-02-10 3:21 ` Paul E. McKenney ` (3 subsequent siblings) 4 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-10 0:56 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > I wouldn't characterize the situation like this (although I can't speak > for others, obviously). IMHO, it's perfectly fine on sequential / > non-synchronizing code, because we know the difference isn't observable > by a correct program. What BS is that? If you use an "atomic_store_explicit()", by definition you're either (a) f*cking insane (b) not doing sequential non-synchronizing code and a compiler that assumes that the programmer is insane may actually be correct more often than not, but it's still a shit compiler. Agreed? So I don't see how any sane person can say that speculative writes are ok. They are clearly not ok. Speculative stores are a bad idea in general. They are completely invalid for anything that says "atomic". This is not even worth discussing. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:56 ` Linus Torvalds @ 2014-02-10 1:16 ` Torvald Riegel 2014-02-10 1:24 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-10 1:16 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, 2014-02-09 at 16:56 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > I wouldn't characterize the situation like this (although I can't speak > > for others, obviously). IMHO, it's perfectly fine on sequential / > > non-synchronizing code, because we know the difference isn't observable > > by a correct program. > > What BS is that? If you use an "atomic_store_explicit()", by > definition you're either > > (a) f*cking insane > (b) not doing sequential non-synchronizing code > > and a compiler that assumes that the programmer is insane may actually > be correct more often than not, but it's still a shit compiler. > Agreed? Due to all the expletives, I can't really understand what you are saying. Nor does what I guess it might mean seem to relate to what I said. (a) seems to say that you don't like requiring programmers to mark atomic accesses specially. Is that the case? If so, I have to disagree. If you're writing concurrent code, marking the atomic accesses is the least of your problem. Instead, for the concurrent code I've written, it rather improved readability; it made code more verbose, but in turn it made the synchronization easier to see. Beside this question of taste (and I don't care what the Kernel style guides are), there is a real technical argument here, though: Requiring all synchronizing memory accesses to be annotated makes the compiler aware what is sequential code, and what is not. Without it, one can't exploit data-race-freedom. So unless we want to slow down optimization of sequential code, we need to tell the compiler what is what. If every access is potentially synchronizing, then this doesn't just prevent speculative stores. (b) seems as if you are saying that if there is a specially annotated atomic access in the code, that this isn't sequential/non-synchronizing code anymore. I don't agree with that, obviously. > So I don't see how any sane person can say that speculative writes are > ok. They are clearly not ok. We are discussing programs written against the C11/C++11 memory model here. At least that's the part that got forwarded to gcc@gcc.gnu.org, and was subject of the nearest emails in the thread. This memory model requires programs to be data-race-free. Thus, we can optimize accordingly. If you don't like that, then this isn't C11/C++11 anymore. Which would be fine, but then complain about that specifically. > Speculative stores are a bad idea in general. I disagree in the context of C11/C++11 programs. At least from a correctness point of view. > They are completely > invalid for anything that says "atomic". I agree, as you will see when you read the other emails I posted as part of this thread. But those two things are separate. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 1:16 ` Torvald Riegel @ 2014-02-10 1:24 ` Linus Torvalds 2014-02-10 1:46 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-10 1:24 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel <triegel@redhat.com> wrote: > > (a) seems to say that you don't like requiring programmers to mark > atomic accesses specially. Is that the case? In Paul's example, they were marked specially. And you seemed to argue that Paul's example could possibly return anything but 0/0. If you really think that, I hope to God that you have nothing to do with the C standard or any actual compiler I ever use. Because such a standard or compiler would be shit. It's sadly not too uncommon. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 1:24 ` Linus Torvalds @ 2014-02-10 1:46 ` Torvald Riegel 2014-02-10 2:04 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-10 1:46 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, 2014-02-09 at 17:24 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 5:16 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > (a) seems to say that you don't like requiring programmers to mark > > atomic accesses specially. Is that the case? > > In Paul's example, they were marked specially. > > And you seemed to argue that Paul's example could possibly return > anything but 0/0. Just read my reply to Paul again. Here's an excerpt: > Initial state: x == y == 0 > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > atomic_store_explicit(42, y, memory_order_relaxed); > if (r1 != 42) > atomic_store_explicit(r1, y, memory_order_relaxed); > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > atomic_store_explicit(r2, x, memory_order_relaxed); Intuitively, this is wrong because this let's the program take a step the abstract machine wouldn't do. This is different to the sequential code that Peter posted because it uses atomics, and thus one can't easily assume that the difference is not observable. IOW, I wrote that such a compiler transformation would be wrong in my opinion. Thus, it should *not* return 42. If you still see how what I wrote could be misunderstood, please let me know because I really don't see it. Yes, I don't do so by swearing or calling other insane, and I try to see the reasoning that those compilers' authors might have had, even if I don't agree with it. In my personal opinion, that's a good thing. > If you really think that, I hope to God that you have nothing to do > with the C standard or any actual compiler I ever use. > > Because such a standard or compiler would be shit. It's sadly not too uncommon. Thanks for the kind words. Perhaps writing that was quicker than reading what I actually wrote. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 1:46 ` Torvald Riegel @ 2014-02-10 2:04 ` Linus Torvalds 0 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-10 2:04 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 9, 2014 at 5:46 PM, Torvald Riegel <triegel@redhat.com> wrote: > > IOW, I wrote that such a compiler transformation would be wrong in my > opinion. Thus, it should *not* return 42. Ahh, I am happy to have misunderstood. The "intuitively" threw me, because I thought that was building up to a "but", and misread the rest. I then react stronly, because I've seen so much total crap (the type-based C aliasing rules topping my list) etc coming out of standards groups because it allows them to generate wrong code that goes faster, that I just assume compiler people are out to do stupid things in the name of "..but the standard allows it". Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:27 ` Torvald Riegel 2014-02-10 0:56 ` Linus Torvalds @ 2014-02-10 3:21 ` Paul E. McKenney 2014-02-10 3:45 ` Paul E. McKenney ` (2 subsequent siblings) 4 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-10 3:21 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: > > > Hi Paul, > > > > > > On Fri, Feb 07, 2014 at 04:50:28PM +0000, Paul E. McKenney wrote: > > > > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > > > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > > > > Hopefully some discussion of out-of-thin-air values as well. > > > > > > > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > > > > a wooden stake through its hart. > > > > > > > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > > > > is sorted. > > > > > > > > There actually is a proposal being put forward, but it might not make ARM > > > > and Power people happy because it involves adding a compare, a branch, > > > > and an ISB/isync after every relaxed load... Me, I agree with you, > > > > much preferring the no-store-speculation approach. > > > > > > Can you elaborate a bit on this please? We don't permit speculative stores > > > in the ARM architecture, so it seems counter-intuitive that GCC needs to > > > emit any additional instructions to prevent that from happening. > > > > Requiring a compare/branch/ISB after each relaxed load enables a simple(r) > > proof that out-of-thin-air values cannot be observed in the face of any > > compiler optimization that refrains from reordering a prior relaxed load > > with a subsequent relaxed store. > > > > > Stores can, of course, be observed out-of-order but that's a lot more > > > reasonable :) > > > > So let me try an example. I am sure that Torvald Riegel will jump in > > with any needed corrections or amplifications: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > One would intuitively expect r1 == r2 == 0 as the only possible outcome. > > But suppose that the compiler used specialization optimizations, as it > > would if there was a function that has a very lightweight implementation > > for some values and a very heavyweight one for other. In particular, > > suppose that the lightweight implementation was for the value 42. > > Then the compiler might do something like the following: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > if (r1 == 42) > > atomic_store_explicit(42, y, memory_order_relaxed); > > else > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > Suddenly we have an explicit constant 42 showing up. Of course, if > > the compiler carefully avoided speculative stores (as both Peter and > > I believe that it should if its code generation is to be regarded as > > anything other than an act of vandalism, the words in the standard > > notwithstanding), there would be no problem. But currently, a number > > of compiler writers see absolutely nothing wrong with transforming > > the optimized-for-42 version above with something like this: > > > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(42, y, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. > > For this to be correct, the compiler would actually have to prove that > the speculative store is "as-if correct", which in turn would mean that > it needs to be aware of all potential observers, and check whether those > observers aren't actually affected by the speculative store. > > I would guess that the compilers you have in mind don't really do that. > If they do, then I don't see why this should be okay, unless you think > out-of-thin-air values are something good (which I wouldn't agree with). OK, we agree that pulling the atomic store to y out of its "if" statement is a bad thing. Very good! Now we just have to convince others on the committee. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:27 ` Torvald Riegel 2014-02-10 0:56 ` Linus Torvalds 2014-02-10 3:21 ` Paul E. McKenney @ 2014-02-10 3:45 ` Paul E. McKenney 2014-02-10 11:46 ` Peter Zijlstra 2014-02-10 19:09 ` Linus Torvalds 4 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-10 3:45 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > On Fri, 2014-02-07 at 10:02 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 04:55:48PM +0000, Will Deacon wrote: [ . . . ] > > And then it is a short and uncontroversial step to the following: > > > > Initial state: x == y == 0 > > > > T1: atomic_store_explicit(42, y, memory_order_relaxed); > > r1 = atomic_load_explicit(x, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > > > This can of course result in r1 == r2 == 42, even though the constant > > 42 never appeared in the original code. This is one way to generate > > an out-of-thin-air value. > > > > As near as I can tell, compiler writers hate the idea of prohibiting > > speculative-store optimizations because it requires them to introduce > > both control and data dependency tracking into their compilers. > > I wouldn't characterize the situation like this (although I can't speak > for others, obviously). IMHO, it's perfectly fine on sequential / > non-synchronizing code, because we know the difference isn't observable > by a correct program. For synchronizing code, compilers just shouldn't > do it, or they would have to truly prove that speculation is harmless. > That will be hard, so I think it should just be avoided. > > Synchronization code will likely have been tuned anyway (especially if > it uses relaxed MO), so I don't see a large need for trying to optimize > using speculative atomic stores. > > Thus, I think there's an easy and practical solution. I like this approach, but there has been resistance to it in the past. Definitely worth a good try, though! > > Many of > > them seem to hate dependency tracking with a purple passion. At least, > > such a hatred would go a long way towards explaining the incomplete > > and high-overhead implementations of memory_order_consume, the long > > and successful use of idioms based on the memory_order_consume pattern > > notwithstanding [*]. ;-) > > I still think that's different because it blurs the difference between > sequential code and synchronizing code (ie, atomic accesses). With > consume MO, the simple solution above doesn't work anymore, because > suddenly synchronizing code does affect optimizations in sequential > code, even if that wouldn't reorder across the synchronizing code (which > would be clearly "visible" to the implementation of the optimization). I understand that memory_order_consume is a bit harder on compiler writers than the other memory orders, but it is also pretty valuable. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:27 ` Torvald Riegel ` (2 preceding siblings ...) 2014-02-10 3:45 ` Paul E. McKenney @ 2014-02-10 11:46 ` Peter Zijlstra 2014-02-10 19:09 ` Linus Torvalds 4 siblings, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-10 11:46 UTC (permalink / raw) To: Torvald Riegel Cc: paulmck, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 01:27:51AM +0100, Torvald Riegel wrote: > > Initial state: x == y == 0 > > > > T1: r1 = atomic_load_explicit(x, memory_order_relaxed); > > atomic_store_explicit(42, y, memory_order_relaxed); > > if (r1 != 42) > > atomic_store_explicit(r1, y, memory_order_relaxed); > > > > T2: r2 = atomic_load_explicit(y, memory_order_relaxed); > > atomic_store_explicit(r2, x, memory_order_relaxed); > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. Yeah, my bad for not being familiar with the atrocious crap C11 made of atomics :/ ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:27 ` Torvald Riegel ` (3 preceding siblings ...) 2014-02-10 11:46 ` Peter Zijlstra @ 2014-02-10 19:09 ` Linus Torvalds 2014-02-11 15:59 ` Paul E. McKenney 2014-02-12 5:39 ` Torvald Riegel 4 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-10 19:09 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > Intuitively, this is wrong because this let's the program take a step > the abstract machine wouldn't do. This is different to the sequential > code that Peter posted because it uses atomics, and thus one can't > easily assume that the difference is not observable. Btw, what is the definition of "observable" for the atomics? Because I'm hoping that it's not the same as for volatiles, where "observable" is about the virtual machine itself, and as such volatile accesses cannot be combined or optimized at all. Now, I claim that atomic accesses cannot be done speculatively for writes, and not re-done for reads (because the value could change), but *combining* them would be possible and good. For example, we often have multiple independent atomic accesses that could certainly be combined: testing the individual bits of an atomic value with helper functions, causing things like "load atomic, test bit, load same atomic, test another bit". The two atomic loads could be done as a single load without possibly changing semantics on a real machine, but if "visibility" is defined in the same way it is for "volatile", that wouldn't be a valid transformation. Right now we use "volatile" semantics for these kinds of things, and they really can hurt. Same goes for multiple writes (possibly due to setting bits): combining multiple accesses into a single one is generally fine, it's *adding* write accesses speculatively that is broken by design.. At the same time, you can't combine atomic loads or stores infinitely - "visibility" on a real machine definitely is about timeliness. Removing all but the last write when there are multiple consecutive writes is generally fine, even if you unroll a loop to generate those writes. But if what remains is a loop, it might be a busy-loop basically waiting for something, so it would be wrong ("untimely") to hoist a store in a loop entirely past the end of the loop, or hoist a load in a loop to before the loop. Does the standard allow for that kind of behavior? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 19:09 ` Linus Torvalds @ 2014-02-11 15:59 ` Paul E. McKenney 2014-02-12 6:06 ` Torvald Riegel 2014-02-12 5:39 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-11 15:59 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > Intuitively, this is wrong because this let's the program take a step > > the abstract machine wouldn't do. This is different to the sequential > > code that Peter posted because it uses atomics, and thus one can't > > easily assume that the difference is not observable. > > Btw, what is the definition of "observable" for the atomics? > > Because I'm hoping that it's not the same as for volatiles, where > "observable" is about the virtual machine itself, and as such volatile > accesses cannot be combined or optimized at all. > > Now, I claim that atomic accesses cannot be done speculatively for > writes, and not re-done for reads (because the value could change), > but *combining* them would be possible and good. > > For example, we often have multiple independent atomic accesses that > could certainly be combined: testing the individual bits of an atomic > value with helper functions, causing things like "load atomic, test > bit, load same atomic, test another bit". The two atomic loads could > be done as a single load without possibly changing semantics on a real > machine, but if "visibility" is defined in the same way it is for > "volatile", that wouldn't be a valid transformation. Right now we use > "volatile" semantics for these kinds of things, and they really can > hurt. > > Same goes for multiple writes (possibly due to setting bits): > combining multiple accesses into a single one is generally fine, it's > *adding* write accesses speculatively that is broken by design.. > > At the same time, you can't combine atomic loads or stores infinitely > - "visibility" on a real machine definitely is about timeliness. > Removing all but the last write when there are multiple consecutive > writes is generally fine, even if you unroll a loop to generate those > writes. But if what remains is a loop, it might be a busy-loop > basically waiting for something, so it would be wrong ("untimely") to > hoist a store in a loop entirely past the end of the loop, or hoist a > load in a loop to before the loop. > > Does the standard allow for that kind of behavior? You asked! ;-) So the current standard allows merging of both loads and stores, unless of course ordring constraints prevent the merging. Volatile semantics may be used to prevent this merging, if desired, for example, for real-time code. Infinite merging is intended to be prohibited, but I am not certain that the current wording is bullet-proof (1.10p24 and 1.10p25). The only prohibition against speculative stores that I can see is in a non-normative note, and it can be argued to apply only to things that are not atomics (1.10p22). I don't see any prohibition against reordering a store to precede a load preceding a conditional branch -- which would not be speculative if the branch was know to be taken and the load hit in the store buffer. In a system where stores could be reordered, some other CPU might perceive the store as happening before the load that controlled the conditional branch. This needs to be addressed. Why this hole? At the time, the current formalizations of popular CPU architectures did not exist, and it was not clear that all popular hardware avoided speculative stores. There is also fun with "out of thin air" values, which everyone agrees should be prohibited, but where there is not agreement on how to prohibit them in a mathematically constructive manner. The current draft contains a clause simply stating that out-of-thin-air values are prohibited, which doesn't help someone constructing tools to analyze C++ code. One proposal requires that subsequent atomic stores never be reordered before prior atomic loads, which requires useless ordering code to be emitted on ARM and PowerPC (you may have seen Will Deacon's and Peter Zijlstra's reaction to this proposal a few days ago). Note that Itanium already pays this price in order to provide full single-variable cache coherence. This out-of-thin-air discussion is also happening in the Java community in preparation for a new rev of the Java memory model. There will also be some discussions on memory_order_consume, which is intended to (eventually) implement rcu_dereference(). The compiler writers don't like tracking dependencies, but there may be some ways of constraining optimizations to preserve the common dependencies that, while providing some syntax to force preservation of dependencies that would normally be optimized out. One example of this is where you have an RCU-protected array that might sometimes contain only a single element. In the single-element case, the compiler knows a priori which element will be used, and will therefore optimize the dependency away, so that the reader might see pre-initialization state. But this is rare, so if added syntax needs to be added in this case, I believe we should be OK with it. (If syntax is needed for plain old dereferences, it is thumbs down all the way as far as I am concerned. Ditto for things like stripping the bottom bits off of a decorated pointer.) No doubt other memory-model issues will come up, but those are the ones I know about at the moment. As I said to begin with, hey, you asked! That said, I would very much appreciate any thoughts or suggestions on handling these issues. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-11 15:59 ` Paul E. McKenney @ 2014-02-12 6:06 ` Torvald Riegel 2014-02-12 9:19 ` Peter Zijlstra 2014-02-12 17:39 ` Paul E. McKenney 0 siblings, 2 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-12 6:06 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > Intuitively, this is wrong because this let's the program take a step > > > the abstract machine wouldn't do. This is different to the sequential > > > code that Peter posted because it uses atomics, and thus one can't > > > easily assume that the difference is not observable. > > > > Btw, what is the definition of "observable" for the atomics? > > > > Because I'm hoping that it's not the same as for volatiles, where > > "observable" is about the virtual machine itself, and as such volatile > > accesses cannot be combined or optimized at all. > > > > Now, I claim that atomic accesses cannot be done speculatively for > > writes, and not re-done for reads (because the value could change), > > but *combining* them would be possible and good. > > > > For example, we often have multiple independent atomic accesses that > > could certainly be combined: testing the individual bits of an atomic > > value with helper functions, causing things like "load atomic, test > > bit, load same atomic, test another bit". The two atomic loads could > > be done as a single load without possibly changing semantics on a real > > machine, but if "visibility" is defined in the same way it is for > > "volatile", that wouldn't be a valid transformation. Right now we use > > "volatile" semantics for these kinds of things, and they really can > > hurt. > > > > Same goes for multiple writes (possibly due to setting bits): > > combining multiple accesses into a single one is generally fine, it's > > *adding* write accesses speculatively that is broken by design.. > > > > At the same time, you can't combine atomic loads or stores infinitely > > - "visibility" on a real machine definitely is about timeliness. > > Removing all but the last write when there are multiple consecutive > > writes is generally fine, even if you unroll a loop to generate those > > writes. But if what remains is a loop, it might be a busy-loop > > basically waiting for something, so it would be wrong ("untimely") to > > hoist a store in a loop entirely past the end of the loop, or hoist a > > load in a loop to before the loop. > > > > Does the standard allow for that kind of behavior? > > You asked! ;-) > > So the current standard allows merging of both loads and stores, unless of > course ordring constraints prevent the merging. Volatile semantics may be > used to prevent this merging, if desired, for example, for real-time code. Agreed. > Infinite merging is intended to be prohibited, but I am not certain that > the current wording is bullet-proof (1.10p24 and 1.10p25). Yeah, maybe not. But it at least seems to rather clearly indicate the intent ;) > The only prohibition against speculative stores that I can see is in a > non-normative note, and it can be argued to apply only to things that are > not atomics (1.10p22). I think this one is specifically about speculative stores that would affect memory locations that the abstract machine would not write to, and that might be observable or create data races. While a compiler could potentially prove that such stores aren't leading to a difference in the behavior of the program (e.g., by proving that there are no observers anywhere and this isn't overlapping with any volatile locations), I think that this is hard in general and most compilers will just not do such things. In GCC, bugs in that category were fixed after researchers doing fuzz-testing found them (IIRC, speculative stores by loops). > I don't see any prohibition against reordering > a store to precede a load preceding a conditional branch -- which would > not be speculative if the branch was know to be taken and the load > hit in the store buffer. In a system where stores could be reordered, > some other CPU might perceive the store as happening before the load > that controlled the conditional branch. This needs to be addressed. I don't know the specifics of your example, but from how I understand it, I don't see a problem if the compiler can prove that the store will always happen. To be more specific, if the compiler can prove that the store will happen anyway, and the region of code can be assumed to always run atomically (e.g., there's no loop or such in there), then it is known that we have one atomic region of code that will always perform the store, so we might as well do the stuff in the region in some order. Now, if any of the memory accesses are atomic, then the whole region of code containing those accesses is often not atomic because other threads might observe intermediate results in a data-race-free way. (I know that this isn't a very precise formulation, but I hope it brings my line of reasoning across.) > Why this hole? At the time, the current formalizations of popular > CPU architectures did not exist, and it was not clear that all popular > hardware avoided speculative stores. I'm not quite sure which hole you see there. Can you elaborate? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 6:06 ` Torvald Riegel @ 2014-02-12 9:19 ` Peter Zijlstra 2014-02-12 17:42 ` Paul E. McKenney 2014-02-14 5:07 ` Torvald Riegel 2014-02-12 17:39 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-12 9:19 UTC (permalink / raw) To: Torvald Riegel Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc > I don't know the specifics of your example, but from how I understand > it, I don't see a problem if the compiler can prove that the store will > always happen. > > To be more specific, if the compiler can prove that the store will > happen anyway, and the region of code can be assumed to always run > atomically (e.g., there's no loop or such in there), then it is known > that we have one atomic region of code that will always perform the > store, so we might as well do the stuff in the region in some order. > > Now, if any of the memory accesses are atomic, then the whole region of > code containing those accesses is often not atomic because other threads > might observe intermediate results in a data-race-free way. > > (I know that this isn't a very precise formulation, but I hope it brings > my line of reasoning across.) So given something like: if (x) y = 3; assuming both x and y are atomic (so don't gimme crap for now knowing the C11 atomic incantations); and you can prove x is always true; you don't see a problem with not emitting the conditional? Avoiding the conditional changes the result; see that control dependency email from earlier. In the above example the load of X and the store to Y are strictly ordered, due to control dependencies. Not emitting the condition and maybe not even emitting the load completely wrecks this. Its therefore an invalid optimization to take out the conditional or speculate the store, since it takes out the dependency. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 9:19 ` Peter Zijlstra @ 2014-02-12 17:42 ` Paul E. McKenney 2014-02-12 18:12 ` Peter Zijlstra 2014-02-14 5:07 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-12 17:42 UTC (permalink / raw) To: Peter Zijlstra Cc: Torvald Riegel, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 12, 2014 at 10:19:07AM +0100, Peter Zijlstra wrote: > > I don't know the specifics of your example, but from how I understand > > it, I don't see a problem if the compiler can prove that the store will > > always happen. > > > > To be more specific, if the compiler can prove that the store will > > happen anyway, and the region of code can be assumed to always run > > atomically (e.g., there's no loop or such in there), then it is known > > that we have one atomic region of code that will always perform the > > store, so we might as well do the stuff in the region in some order. > > > > Now, if any of the memory accesses are atomic, then the whole region of > > code containing those accesses is often not atomic because other threads > > might observe intermediate results in a data-race-free way. > > > > (I know that this isn't a very precise formulation, but I hope it brings > > my line of reasoning across.) > > So given something like: > > if (x) > y = 3; > > assuming both x and y are atomic (so don't gimme crap for now knowing > the C11 atomic incantations); and you can prove x is always true; you > don't see a problem with not emitting the conditional? You need volatile semantics to force the compiler to ignore any proofs it might otherwise attempt to construct. Hence all the ACCESS_ONCE() calls in my email to Torvald. (Hopefully I translated your example reasonably.) Thanx, Paul > Avoiding the conditional changes the result; see that control dependency > email from earlier. In the above example the load of X and the store to > Y are strictly ordered, due to control dependencies. Not emitting the > condition and maybe not even emitting the load completely wrecks this. > > Its therefore an invalid optimization to take out the conditional or > speculate the store, since it takes out the dependency. > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 17:42 ` Paul E. McKenney @ 2014-02-12 18:12 ` Peter Zijlstra 2014-02-17 18:18 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-12 18:12 UTC (permalink / raw) To: Paul E. McKenney Cc: Torvald Riegel, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > You need volatile semantics to force the compiler to ignore any proofs > it might otherwise attempt to construct. Hence all the ACCESS_ONCE() > calls in my email to Torvald. (Hopefully I translated your example > reasonably.) My brain gave out for today; but it did appear to have the right structure. I would prefer it C11 would not require the volatile casts. It should simply _never_ speculate with atomic writes, volatile or not. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 18:12 ` Peter Zijlstra @ 2014-02-17 18:18 ` Paul E. McKenney 2014-02-17 20:39 ` Richard Biener 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-17 18:18 UTC (permalink / raw) To: Peter Zijlstra Cc: Torvald Riegel, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: > On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > > You need volatile semantics to force the compiler to ignore any proofs > > it might otherwise attempt to construct. Hence all the ACCESS_ONCE() > > calls in my email to Torvald. (Hopefully I translated your example > > reasonably.) > > My brain gave out for today; but it did appear to have the right > structure. I can relate. ;-) > I would prefer it C11 would not require the volatile casts. It should > simply _never_ speculate with atomic writes, volatile or not. I agree with not needing volatiles to prevent speculated writes. However, they will sometimes be needed to prevent excessive load/store combining. The compiler doesn't have the runtime feedback mechanisms that the hardware has, and thus will need help from the developer from time to time. Or maybe the Linux kernel simply waits to transition to C11 relaxed atomics until the compiler has learned to be sufficiently conservative in its load-store combining decisions. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 18:18 ` Paul E. McKenney @ 2014-02-17 20:39 ` Richard Biener 2014-02-17 22:14 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Richard Biener @ 2014-02-17 20:39 UTC (permalink / raw) To: paulmck, Paul E. McKenney, Peter Zijlstra Cc: Torvald Riegel, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: >> > You need volatile semantics to force the compiler to ignore any >proofs >> > it might otherwise attempt to construct. Hence all the >ACCESS_ONCE() >> > calls in my email to Torvald. (Hopefully I translated your example >> > reasonably.) >> >> My brain gave out for today; but it did appear to have the right >> structure. > >I can relate. ;-) > >> I would prefer it C11 would not require the volatile casts. It should >> simply _never_ speculate with atomic writes, volatile or not. > >I agree with not needing volatiles to prevent speculated writes. >However, >they will sometimes be needed to prevent excessive load/store >combining. >The compiler doesn't have the runtime feedback mechanisms that the >hardware has, and thus will need help from the developer from time >to time. > >Or maybe the Linux kernel simply waits to transition to C11 relaxed >atomics >until the compiler has learned to be sufficiently conservative in its >load-store combining decisions. Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time. Richard. > Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 20:39 ` Richard Biener @ 2014-02-17 22:14 ` Paul E. McKenney 2014-02-17 22:27 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-17 22:14 UTC (permalink / raw) To: Richard Biener Cc: Peter Zijlstra, Torvald Riegel, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote: > On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: > >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > >> > You need volatile semantics to force the compiler to ignore any > >proofs > >> > it might otherwise attempt to construct. Hence all the > >ACCESS_ONCE() > >> > calls in my email to Torvald. (Hopefully I translated your example > >> > reasonably.) > >> > >> My brain gave out for today; but it did appear to have the right > >> structure. > > > >I can relate. ;-) > > > >> I would prefer it C11 would not require the volatile casts. It should > >> simply _never_ speculate with atomic writes, volatile or not. > > > >I agree with not needing volatiles to prevent speculated writes. > >However, > >they will sometimes be needed to prevent excessive load/store > >combining. > >The compiler doesn't have the runtime feedback mechanisms that the > >hardware has, and thus will need help from the developer from time > >to time. > > > >Or maybe the Linux kernel simply waits to transition to C11 relaxed > >atomics > >until the compiler has learned to be sufficiently conservative in its > >load-store combining decisions. > > Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time. Perhaps we solve this chicken-and-egg problem by creating a test suite? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:14 ` Paul E. McKenney @ 2014-02-17 22:27 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 22:27 UTC (permalink / raw) To: paulmck Cc: Richard Biener, Peter Zijlstra, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 14:14 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 09:39:54PM +0100, Richard Biener wrote: > > On February 17, 2014 7:18:15 PM GMT+01:00, "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote: > > >On Wed, Feb 12, 2014 at 07:12:05PM +0100, Peter Zijlstra wrote: > > >> On Wed, Feb 12, 2014 at 09:42:09AM -0800, Paul E. McKenney wrote: > > >> > You need volatile semantics to force the compiler to ignore any > > >proofs > > >> > it might otherwise attempt to construct. Hence all the > > >ACCESS_ONCE() > > >> > calls in my email to Torvald. (Hopefully I translated your example > > >> > reasonably.) > > >> > > >> My brain gave out for today; but it did appear to have the right > > >> structure. > > > > > >I can relate. ;-) > > > > > >> I would prefer it C11 would not require the volatile casts. It should > > >> simply _never_ speculate with atomic writes, volatile or not. > > > > > >I agree with not needing volatiles to prevent speculated writes. > > >However, > > >they will sometimes be needed to prevent excessive load/store > > >combining. > > >The compiler doesn't have the runtime feedback mechanisms that the > > >hardware has, and thus will need help from the developer from time > > >to time. > > > > > >Or maybe the Linux kernel simply waits to transition to C11 relaxed > > >atomics > > >until the compiler has learned to be sufficiently conservative in its > > >load-store combining decisions. > > > > Sounds backwards. Currently the compiler does nothing to the atomics. I'm sure we'll eventually add something. But if testing coverage is zero outside then surely things get worse, not better with time. > > Perhaps we solve this chicken-and-egg problem by creating a test suite? Perhaps. The test suite might also be a good set of examples showing which cases we expect to be optimized in a certain way, and which not. I suppose the uses of (the equivalent) of atomics in the kernel would be a good start. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 9:19 ` Peter Zijlstra 2014-02-12 17:42 ` Paul E. McKenney @ 2014-02-14 5:07 ` Torvald Riegel 2014-02-14 9:50 ` Peter Zijlstra 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-14 5:07 UTC (permalink / raw) To: Peter Zijlstra Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-12 at 10:19 +0100, Peter Zijlstra wrote: > > I don't know the specifics of your example, but from how I understand > > it, I don't see a problem if the compiler can prove that the store will > > always happen. > > > > To be more specific, if the compiler can prove that the store will > > happen anyway, and the region of code can be assumed to always run > > atomically (e.g., there's no loop or such in there), then it is known > > that we have one atomic region of code that will always perform the > > store, so we might as well do the stuff in the region in some order. > > > > Now, if any of the memory accesses are atomic, then the whole region of > > code containing those accesses is often not atomic because other threads > > might observe intermediate results in a data-race-free way. > > > > (I know that this isn't a very precise formulation, but I hope it brings > > my line of reasoning across.) > > So given something like: > > if (x) > y = 3; > > assuming both x and y are atomic (so don't gimme crap for now knowing > the C11 atomic incantations); and you can prove x is always true; you > don't see a problem with not emitting the conditional? That depends on what your goal is. It would be correct as far as the standard is specified; this makes sense if all you want is indeed a program that does what the abstract machine might do, and produces the same output / side effects. If you're trying to preserve the branch in the code emitted / executed by the implementation, then it would not be correct. But those branches aren't specified as being part of the observable side effects. In the common case, this makes sense because it enables optimizations that are useful; this line of reasoning also allows the compiler to merge some atomic accesses in the way that Linus would like to see it. > Avoiding the conditional changes the result; see that control dependency > email from earlier. It does not regarding how the standard defines "result". > In the above example the load of X and the store to > Y are strictly ordered, due to control dependencies. Not emitting the > condition and maybe not even emitting the load completely wrecks this. I think you're trying to solve this backwards. You are looking at this with an implicit wishlist of what the compiler should do (or how you want to use the hardware), but this is not a viable specification that one can write a compiler against. We do need clear rules for what the compiler is allowed to do or not (e.g., a memory model that models multi-threaded executions). Otherwise it's all hand-waving, and we're getting nowhere. Thus, the way to approach this is to propose a feature or change to the standard, make sure that this is consistent and has no unintended side effects for other aspects of compilation or other code, and then ask the compiler to implement it. IOW, we need a patch for where this all starts: in the rules and requirements for compilation. Paul and I are at the C++ meeting currently, and we had sessions in which the concurrency study group talked about memory model issues like dependency tracking and memory_order_consume. Paul shared uses of atomics (or likewise) in the kernel, and we discussed how the memory model currently handles various cases and why, how one could express other requirements consistently, and what is actually implementable in practice. I can't speak for Paul, but I thought those discussions were productive. > Its therefore an invalid optimization to take out the conditional or > speculate the store, since it takes out the dependency. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 5:07 ` Torvald Riegel @ 2014-02-14 9:50 ` Peter Zijlstra 2014-02-14 19:19 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-14 9:50 UTC (permalink / raw) To: Torvald Riegel Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > That depends on what your goal is. A compiler that we don't need to fight in order to generate sane code would be nice. But as Linus said; we can continue to ignore you lot and go on as we've done. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 9:50 ` Peter Zijlstra @ 2014-02-14 19:19 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-14 19:19 UTC (permalink / raw) To: Peter Zijlstra Cc: paulmck, Linus Torvalds, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-14 at 10:50 +0100, Peter Zijlstra wrote: > On Thu, Feb 13, 2014 at 09:07:55PM -0800, Torvald Riegel wrote: > > That depends on what your goal is. First, I don't know why you quoted that, but without the context, quoting it doesn't make sense. Let me repeat the point. The standard is the rule set for the compiler. Period. The compiler does not just serve the semantics that you might have in your head. It does have to do something meaningful for all of its users. Thus, the goal for the compiler is to properly compile programs in the language as specified. If there is a deficiency in the standard (bug or missing feature) -- and thus the specification, we need to have a patch for the standard that fixes this deficiency. If you think that this is the case, that's where you fix it. If your goal is to do wishful thinking, imagine some kind of semantics in your head, and then assume that magically, implementations will do just that, then that's bound to fail. > A compiler that we don't need to fight in order to generate sane code > would be nice. But as Linus said; we can continue to ignore you lot and > go on as we've done. I don't see why it's so hard to understand that you need to specify semantics, and the place (or at least the base) for that is the standard. Aren't you guys the ones replying "send a patch"? :) This isn't any different. If you're uncomfortable working with the standard, then say so, and reach out to people that aren't. You can surely ignore the specification of the language(s) that you are depending on. But that won't help you. If you want a change, get involved. (Oh, and claiming that the other side doesn't get it doesn't count as getting involved.) There's no fight between people here. It's just a technical problem that we have to solve in the right way. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 6:06 ` Torvald Riegel 2014-02-12 9:19 ` Peter Zijlstra @ 2014-02-12 17:39 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-12 17:39 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 11, 2014 at 10:06:34PM -0800, Torvald Riegel wrote: > On Tue, 2014-02-11 at 07:59 -0800, Paul E. McKenney wrote: > > On Mon, Feb 10, 2014 at 11:09:24AM -0800, Linus Torvalds wrote: > > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > > > Intuitively, this is wrong because this let's the program take a step > > > > the abstract machine wouldn't do. This is different to the sequential > > > > code that Peter posted because it uses atomics, and thus one can't > > > > easily assume that the difference is not observable. > > > > > > Btw, what is the definition of "observable" for the atomics? > > > > > > Because I'm hoping that it's not the same as for volatiles, where > > > "observable" is about the virtual machine itself, and as such volatile > > > accesses cannot be combined or optimized at all. > > > > > > Now, I claim that atomic accesses cannot be done speculatively for > > > writes, and not re-done for reads (because the value could change), > > > but *combining* them would be possible and good. > > > > > > For example, we often have multiple independent atomic accesses that > > > could certainly be combined: testing the individual bits of an atomic > > > value with helper functions, causing things like "load atomic, test > > > bit, load same atomic, test another bit". The two atomic loads could > > > be done as a single load without possibly changing semantics on a real > > > machine, but if "visibility" is defined in the same way it is for > > > "volatile", that wouldn't be a valid transformation. Right now we use > > > "volatile" semantics for these kinds of things, and they really can > > > hurt. > > > > > > Same goes for multiple writes (possibly due to setting bits): > > > combining multiple accesses into a single one is generally fine, it's > > > *adding* write accesses speculatively that is broken by design.. > > > > > > At the same time, you can't combine atomic loads or stores infinitely > > > - "visibility" on a real machine definitely is about timeliness. > > > Removing all but the last write when there are multiple consecutive > > > writes is generally fine, even if you unroll a loop to generate those > > > writes. But if what remains is a loop, it might be a busy-loop > > > basically waiting for something, so it would be wrong ("untimely") to > > > hoist a store in a loop entirely past the end of the loop, or hoist a > > > load in a loop to before the loop. > > > > > > Does the standard allow for that kind of behavior? > > > > You asked! ;-) > > > > So the current standard allows merging of both loads and stores, unless of > > course ordring constraints prevent the merging. Volatile semantics may be > > used to prevent this merging, if desired, for example, for real-time code. > > Agreed. > > > Infinite merging is intended to be prohibited, but I am not certain that > > the current wording is bullet-proof (1.10p24 and 1.10p25). > > Yeah, maybe not. But it at least seems to rather clearly indicate the > intent ;) That is my hope. ;-) > > The only prohibition against speculative stores that I can see is in a > > non-normative note, and it can be argued to apply only to things that are > > not atomics (1.10p22). > > I think this one is specifically about speculative stores that would > affect memory locations that the abstract machine would not write to, > and that might be observable or create data races. While a compiler > could potentially prove that such stores aren't leading to a difference > in the behavior of the program (e.g., by proving that there are no > observers anywhere and this isn't overlapping with any volatile > locations), I think that this is hard in general and most compilers will > just not do such things. In GCC, bugs in that category were fixed after > researchers doing fuzz-testing found them (IIRC, speculative stores by > loops). And that is my fear. ;-) > > I don't see any prohibition against reordering > > a store to precede a load preceding a conditional branch -- which would > > not be speculative if the branch was know to be taken and the load > > hit in the store buffer. In a system where stores could be reordered, > > some other CPU might perceive the store as happening before the load > > that controlled the conditional branch. This needs to be addressed. > > I don't know the specifics of your example, but from how I understand > it, I don't see a problem if the compiler can prove that the store will > always happen. The current Documentation/memory-barriers.txt formulation requires that both the load and the store have volatile semantics. Does that help? > To be more specific, if the compiler can prove that the store will > happen anyway, and the region of code can be assumed to always run > atomically (e.g., there's no loop or such in there), then it is known > that we have one atomic region of code that will always perform the > store, so we might as well do the stuff in the region in some order. And it would be very hard to write a program that proved that the store had been reordered prior to the load in this case. > Now, if any of the memory accesses are atomic, then the whole region of > code containing those accesses is often not atomic because other threads > might observe intermediate results in a data-race-free way. > > (I know that this isn't a very precise formulation, but I hope it brings > my line of reasoning across.) > > > Why this hole? At the time, the current formalizations of popular > > CPU architectures did not exist, and it was not clear that all popular > > hardware avoided speculative stores. > > I'm not quite sure which hole you see there. Can you elaborate? Here is one attempt, based on Peter's example later in this thread: #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) atomic_int x, y; /* Default initialization to zero. */ int r1; void T0(void) { if (atomic_load(&ACCESS_ONCE(x), memory_order_relaxed)) atomic_store(&ACCESS_ONCE(y), 1, memory_order_relaxed); } void T1(void) { r1 = atomic_load(y, memory_order_seq_cst); /* Might also need an atomic_thread_fence() here... */ atomic_store(x, 1, memory_order_seq_cst); } assert(r1 == 0); Peter and I would like the assertion to never trigger in this case, but the current C11 does not seem to guarantee this. I believe that the volatile casts forbid the compiler from deciding to omit T0's "if" even in cases where it could prove that x was always zero. And given the store to x, it should not be able to prove constant x here, right? Again, I believe that current C11 does not guarantee that the assertion will never trigger, but it would be good if one of the successors to C11 did make that guarantee. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 19:09 ` Linus Torvalds 2014-02-11 15:59 ` Paul E. McKenney @ 2014-02-12 5:39 ` Torvald Riegel 2014-02-12 18:07 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-12 5:39 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > Intuitively, this is wrong because this let's the program take a step > > the abstract machine wouldn't do. This is different to the sequential > > code that Peter posted because it uses atomics, and thus one can't > > easily assume that the difference is not observable. > > Btw, what is the definition of "observable" for the atomics? > > Because I'm hoping that it's not the same as for volatiles, where > "observable" is about the virtual machine itself, and as such volatile > accesses cannot be combined or optimized at all. No, atomics aren't an observable behavior of the abstract machine (unless they are volatile). See 1.8.p8 (citing the C++ standard). > Now, I claim that atomic accesses cannot be done speculatively for > writes, and not re-done for reads (because the value could change), Agreed, unless the compiler can prove that this doesn't make a difference in the program at hand and it's not volatile atomics. In general, that will be hard and thus won't happen often I suppose, but if correctly proved it would fall under the as-if rule I think. > but *combining* them would be possible and good. Agreed. > For example, we often have multiple independent atomic accesses that > could certainly be combined: testing the individual bits of an atomic > value with helper functions, causing things like "load atomic, test > bit, load same atomic, test another bit". The two atomic loads could > be done as a single load without possibly changing semantics on a real > machine, but if "visibility" is defined in the same way it is for > "volatile", that wouldn't be a valid transformation. Right now we use > "volatile" semantics for these kinds of things, and they really can > hurt. Agreed. In your example, the compiler would have to prove that the abstract machine would always be able to run the two loads atomically (ie, as one load) without running into impossible/disallowed behavior of the program. But if there's no loop or branch or such in-between, this should be straight-forward because any hardware oddity or similar could merge those loads and it wouldn't be disallowed by the standard (considering that we're talking about a finite number of loads), so the compiler would be allowed to do it as well. > Same goes for multiple writes (possibly due to setting bits): > combining multiple accesses into a single one is generally fine, it's > *adding* write accesses speculatively that is broken by design.. Agreed. As Paul points out, this being correct assumes that there are no other ordering guarantees or memory accesses "interfering", but if the stores are to the same memory location and adjacent to each other in the program, then I don't see a reason why they wouldn't be combinable. > At the same time, you can't combine atomic loads or stores infinitely > - "visibility" on a real machine definitely is about timeliness. > Removing all but the last write when there are multiple consecutive > writes is generally fine, even if you unroll a loop to generate those > writes. But if what remains is a loop, it might be a busy-loop > basically waiting for something, so it would be wrong ("untimely") to > hoist a store in a loop entirely past the end of the loop, or hoist a > load in a loop to before the loop. Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, although those might not be bullet-proof as Paul points out. Forward progress is rather vaguely specified in the standard, but at least parts of the committee (and people in ISO C++ SG1, in particular) are working on trying to improve this. > Does the standard allow for that kind of behavior? I think the standard requires (or intends to require) the behavior that you (and I) seem to prefer in these examples. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 5:39 ` Torvald Riegel @ 2014-02-12 18:07 ` Paul E. McKenney 2014-02-12 20:22 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-12 18:07 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 11, 2014 at 09:39:24PM -0800, Torvald Riegel wrote: > On Mon, 2014-02-10 at 11:09 -0800, Linus Torvalds wrote: > > On Sun, Feb 9, 2014 at 4:27 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > Intuitively, this is wrong because this let's the program take a step > > > the abstract machine wouldn't do. This is different to the sequential > > > code that Peter posted because it uses atomics, and thus one can't > > > easily assume that the difference is not observable. > > > > Btw, what is the definition of "observable" for the atomics? > > > > Because I'm hoping that it's not the same as for volatiles, where > > "observable" is about the virtual machine itself, and as such volatile > > accesses cannot be combined or optimized at all. > > No, atomics aren't an observable behavior of the abstract machine > (unless they are volatile). See 1.8.p8 (citing the C++ standard). Us Linux-kernel hackers will often need to use volatile semantics in combination with C11 atomics in most cases. The C11 atomics do cover some of the reasons we currently use ACCESS_ONCE(), but not all of them -- in particular, it allows load/store merging. > > Now, I claim that atomic accesses cannot be done speculatively for > > writes, and not re-done for reads (because the value could change), > > Agreed, unless the compiler can prove that this doesn't make a > difference in the program at hand and it's not volatile atomics. In > general, that will be hard and thus won't happen often I suppose, but if > correctly proved it would fall under the as-if rule I think. > > > but *combining* them would be possible and good. > > Agreed. In some cases, agreed. But many uses in the Linux kernel will need volatile semantics in combination with C11 atomics. Which is OK, for the foreseeable future, anyway. > > For example, we often have multiple independent atomic accesses that > > could certainly be combined: testing the individual bits of an atomic > > value with helper functions, causing things like "load atomic, test > > bit, load same atomic, test another bit". The two atomic loads could > > be done as a single load without possibly changing semantics on a real > > machine, but if "visibility" is defined in the same way it is for > > "volatile", that wouldn't be a valid transformation. Right now we use > > "volatile" semantics for these kinds of things, and they really can > > hurt. > > Agreed. In your example, the compiler would have to prove that the > abstract machine would always be able to run the two loads atomically > (ie, as one load) without running into impossible/disallowed behavior of > the program. But if there's no loop or branch or such in-between, this > should be straight-forward because any hardware oddity or similar could > merge those loads and it wouldn't be disallowed by the standard > (considering that we're talking about a finite number of loads), so the > compiler would be allowed to do it as well. As long as they are not marked volatile, agreed. Thanx, Paul > > Same goes for multiple writes (possibly due to setting bits): > > combining multiple accesses into a single one is generally fine, it's > > *adding* write accesses speculatively that is broken by design.. > > Agreed. As Paul points out, this being correct assumes that there are > no other ordering guarantees or memory accesses "interfering", but if > the stores are to the same memory location and adjacent to each other in > the program, then I don't see a reason why they wouldn't be combinable. > > > At the same time, you can't combine atomic loads or stores infinitely > > - "visibility" on a real machine definitely is about timeliness. > > Removing all but the last write when there are multiple consecutive > > writes is generally fine, even if you unroll a loop to generate those > > writes. But if what remains is a loop, it might be a busy-loop > > basically waiting for something, so it would be wrong ("untimely") to > > hoist a store in a loop entirely past the end of the loop, or hoist a > > load in a loop to before the loop. > > Agreed. That's what 1.10p24 and 1.10p25 are meant to specify for loads, > although those might not be bullet-proof as Paul points out. Forward > progress is rather vaguely specified in the standard, but at least parts > of the committee (and people in ISO C++ SG1, in particular) are working > on trying to improve this. > > > Does the standard allow for that kind of behavior? > > I think the standard requires (or intends to require) the behavior that > you (and I) seem to prefer in these examples. > > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 18:07 ` Paul E. McKenney @ 2014-02-12 20:22 ` Linus Torvalds 2014-02-13 0:23 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-12 20:22 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Us Linux-kernel hackers will often need to use volatile semantics in > combination with C11 atomics in most cases. The C11 atomics do cover > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > in particular, it allows load/store merging. I really disagree with the "will need to use volatile". We should never need to use volatile (outside of whatever MMIO we do using C) if C11 defines atomics correctly. Allowing load/store merging is *fine*. All sane CPU's do that anyway - it's called a cache - and there's no actual reason to think that "ACCESS_ONCE()" has to mean our current "volatile". Now, it's possible that the C standards simply get atomics _wrong_, so that they create visible semantics that are different from what a CPU cache already does, but that's a plain bug in the standard if so. But merging loads and stores is fine. And I *guarantee* it is fine, exactly because CPU's already do it, so claiming that the compiler couldn't do it is just insanity. Now, there are things that are *not* fine, like speculative stores that could be visible to other threads. Those are *bugs* (either in the compiler or in the standard), and anybody who claims otherwise is not worth discussing with. But I really really disagree with the "we might have to use 'volatile'". Because if we *ever* have to use 'volatile' with the standard C atomic types, then we're just better off ignoring the atomic types entirely, because they are obviously broken shit - and we're better off doing it ourselves the way we have forever. Seriously. This is not even hyperbole. It really is as simple as that. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 20:22 ` Linus Torvalds @ 2014-02-13 0:23 ` Paul E. McKenney 2014-02-13 20:03 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-13 0:23 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > combination with C11 atomics in most cases. The C11 atomics do cover > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > in particular, it allows load/store merging. > > I really disagree with the "will need to use volatile". > > We should never need to use volatile (outside of whatever MMIO we do > using C) if C11 defines atomics correctly. > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > it's called a cache - and there's no actual reason to think that > "ACCESS_ONCE()" has to mean our current "volatile". > > Now, it's possible that the C standards simply get atomics _wrong_, so > that they create visible semantics that are different from what a CPU > cache already does, but that's a plain bug in the standard if so. > > But merging loads and stores is fine. And I *guarantee* it is fine, > exactly because CPU's already do it, so claiming that the compiler > couldn't do it is just insanity. Agreed, both CPUs and compilers can merge loads and stores. But CPUs normally get their stores pushed through the store buffer in reasonable time, and CPUs also use things like invalidations to ensure that a store is seen in reasonable time by readers. Compilers don't always have these two properties, so we do need to be more careful of load and store merging by compilers. > Now, there are things that are *not* fine, like speculative stores > that could be visible to other threads. Those are *bugs* (either in > the compiler or in the standard), and anybody who claims otherwise is > not worth discussing with. And as near as I can tell, volatile semantics are required in C11 to avoid speculative stores. I might be wrong about this, and hope that I am wrong. But I am currently not seeing it in the current standard. (Though I expect that most compilers would avoid speculating stores, especially in the near term. > But I really really disagree with the "we might have to use > 'volatile'". Because if we *ever* have to use 'volatile' with the > standard C atomic types, then we're just better off ignoring the > atomic types entirely, because they are obviously broken shit - and > we're better off doing it ourselves the way we have forever. > > Seriously. This is not even hyperbole. It really is as simple as that. Agreed, if we are talking about replacing ACCESS_ONCE() with C11 relaxed atomics any time soon. But someone porting Linux to a new CPU architecture might use a carefully chosen subset of C11 atomics to implement some of the Linux atomic operations, especially non-value-returning atomics such as atomic_inc(). Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-13 0:23 ` Paul E. McKenney @ 2014-02-13 20:03 ` Torvald Riegel 2014-02-14 2:01 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-13 20:03 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > > in particular, it allows load/store merging. > > > > I really disagree with the "will need to use volatile". > > > > We should never need to use volatile (outside of whatever MMIO we do > > using C) if C11 defines atomics correctly. > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > it's called a cache - and there's no actual reason to think that > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > that they create visible semantics that are different from what a CPU > > cache already does, but that's a plain bug in the standard if so. > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > exactly because CPU's already do it, so claiming that the compiler > > couldn't do it is just insanity. > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > normally get their stores pushed through the store buffer in reasonable > time, and CPUs also use things like invalidations to ensure that a > store is seen in reasonable time by readers. Compilers don't always > have these two properties, so we do need to be more careful of load > and store merging by compilers. The standard's _wording_ is a little vague about forward-progress guarantees, but I believe the vast majority of the people involved do want compilers to not prevent forward progress. There is of course a difference whether a compiler establishes _eventual_ forward progress in the sense of after 10 years or forward progress in a small bounded interval of time, but this is a QoI issue, and good compilers won't want to introduce unnecessary latencies. I believe that it is fine if the standard merely talks about eventual forward progress. > > Now, there are things that are *not* fine, like speculative stores > > that could be visible to other threads. Those are *bugs* (either in > > the compiler or in the standard), and anybody who claims otherwise is > > not worth discussing with. > > And as near as I can tell, volatile semantics are required in C11 to > avoid speculative stores. I might be wrong about this, and hope that > I am wrong. But I am currently not seeing it in the current standard. > (Though I expect that most compilers would avoid speculating stores, > especially in the near term. This really depends on how we define speculative stores. The memory model is absolutely clear that programs have to behave as if executed by the virtual machine, and that rules out speculative stores to volatiles and other locations. Under certain circumstances, there will be "speculative" stores in the sense that they will happen at different times as if you had a trivial implementation of the abstract machine. But to be allowed to do that, the compiler has to prove that such a transformation still fulfills the as-if rule. IOW, the abstract machine is what currently defines disallowed speculative stores. If you want to put *further* constraints on what implementations are allowed to do, I suppose it is best to talk about those and see how we can add rules that allow programmers to express those constraints. For example, control dependencies might be such a case. I don't have a specific suggestion -- maybe the control dependencies are best tackled similar to consume dependencies (even though we don't have a good solution for those yets). But using volatile accesses for that seems to be a big hammer, or even the wrong one. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-13 20:03 ` Torvald Riegel @ 2014-02-14 2:01 ` Paul E. McKenney 2014-02-14 4:43 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-14 2:01 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > > > in particular, it allows load/store merging. > > > > > > I really disagree with the "will need to use volatile". > > > > > > We should never need to use volatile (outside of whatever MMIO we do > > > using C) if C11 defines atomics correctly. > > > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > > it's called a cache - and there's no actual reason to think that > > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > > that they create visible semantics that are different from what a CPU > > > cache already does, but that's a plain bug in the standard if so. > > > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > > exactly because CPU's already do it, so claiming that the compiler > > > couldn't do it is just insanity. > > > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > > normally get their stores pushed through the store buffer in reasonable > > time, and CPUs also use things like invalidations to ensure that a > > store is seen in reasonable time by readers. Compilers don't always > > have these two properties, so we do need to be more careful of load > > and store merging by compilers. > > The standard's _wording_ is a little vague about forward-progress > guarantees, but I believe the vast majority of the people involved do > want compilers to not prevent forward progress. There is of course a > difference whether a compiler establishes _eventual_ forward progress in > the sense of after 10 years or forward progress in a small bounded > interval of time, but this is a QoI issue, and good compilers won't want > to introduce unnecessary latencies. I believe that it is fine if the > standard merely talks about eventual forward progress. The compiler will need to earn my trust on this one. ;-) > > > Now, there are things that are *not* fine, like speculative stores > > > that could be visible to other threads. Those are *bugs* (either in > > > the compiler or in the standard), and anybody who claims otherwise is > > > not worth discussing with. > > > > And as near as I can tell, volatile semantics are required in C11 to > > avoid speculative stores. I might be wrong about this, and hope that > > I am wrong. But I am currently not seeing it in the current standard. > > (Though I expect that most compilers would avoid speculating stores, > > especially in the near term. > > This really depends on how we define speculative stores. The memory > model is absolutely clear that programs have to behave as if executed by > the virtual machine, and that rules out speculative stores to volatiles > and other locations. Under certain circumstances, there will be > "speculative" stores in the sense that they will happen at different > times as if you had a trivial implementation of the abstract machine. > But to be allowed to do that, the compiler has to prove that such a > transformation still fulfills the as-if rule. Agreed, although the as-if rule would ignore control dependencies, since these are not yet part of the standard (as you in fact note below). I nevertheless consider myself at least somewhat reassured that current C11 won't speculate stores. My remaining concerns involve the compiler proving to itself that a given branch is always taken, thus motivating it to optimize the branch away -- though this is more properly a control-dependency concern. > IOW, the abstract machine is what currently defines disallowed > speculative stores. If you want to put *further* constraints on what > implementations are allowed to do, I suppose it is best to talk about > those and see how we can add rules that allow programmers to express > those constraints. For example, control dependencies might be such a > case. I don't have a specific suggestion -- maybe the control > dependencies are best tackled similar to consume dependencies (even > though we don't have a good solution for those yets). But using > volatile accesses for that seems to be a big hammer, or even the wrong > one. In current compilers, the two hammers we have are volatile and barrier(). But yes, it would be good to have something more focused. One option would be to propose memory_order_control loads to see how loudly the committee screams. One use case might be as follows: if (atomic_load(x, memory_order_control)) atomic_store(y, memory_order_relaxed); This could also be written: r1 = atomic_load(x, memory_order_control); if (r1) atomic_store(y, memory_order_relaxed); A branch depending on the memory_order_control load could not be optimized out, though I suppose that the compiler could substitute a memory-barrier instruction for the branch. Seems like it would take a very large number of branches to equal the overhead of the memory barrier, though. Another option would be to flag the conditional expression, prohibiting the compiler from optimizing out any conditional branches. Perhaps something like this: r1 = atomic_load(x, memory_order_control); if (control_dependency(r1)) atomic_store(y, memory_order_relaxed); Other thoughts? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 2:01 ` Paul E. McKenney @ 2014-02-14 4:43 ` Torvald Riegel 2014-02-14 17:29 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-14 4:43 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 12:03:57PM -0800, Torvald Riegel wrote: > > On Wed, 2014-02-12 at 16:23 -0800, Paul E. McKenney wrote: > > > On Wed, Feb 12, 2014 at 12:22:53PM -0800, Linus Torvalds wrote: > > > > On Wed, Feb 12, 2014 at 10:07 AM, Paul E. McKenney > > > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > > > Us Linux-kernel hackers will often need to use volatile semantics in > > > > > combination with C11 atomics in most cases. The C11 atomics do cover > > > > > some of the reasons we currently use ACCESS_ONCE(), but not all of them -- > > > > > in particular, it allows load/store merging. > > > > > > > > I really disagree with the "will need to use volatile". > > > > > > > > We should never need to use volatile (outside of whatever MMIO we do > > > > using C) if C11 defines atomics correctly. > > > > > > > > Allowing load/store merging is *fine*. All sane CPU's do that anyway - > > > > it's called a cache - and there's no actual reason to think that > > > > "ACCESS_ONCE()" has to mean our current "volatile". > > > > > > > > Now, it's possible that the C standards simply get atomics _wrong_, so > > > > that they create visible semantics that are different from what a CPU > > > > cache already does, but that's a plain bug in the standard if so. > > > > > > > > But merging loads and stores is fine. And I *guarantee* it is fine, > > > > exactly because CPU's already do it, so claiming that the compiler > > > > couldn't do it is just insanity. > > > > > > Agreed, both CPUs and compilers can merge loads and stores. But CPUs > > > normally get their stores pushed through the store buffer in reasonable > > > time, and CPUs also use things like invalidations to ensure that a > > > store is seen in reasonable time by readers. Compilers don't always > > > have these two properties, so we do need to be more careful of load > > > and store merging by compilers. > > > > The standard's _wording_ is a little vague about forward-progress > > guarantees, but I believe the vast majority of the people involved do > > want compilers to not prevent forward progress. There is of course a > > difference whether a compiler establishes _eventual_ forward progress in > > the sense of after 10 years or forward progress in a small bounded > > interval of time, but this is a QoI issue, and good compilers won't want > > to introduce unnecessary latencies. I believe that it is fine if the > > standard merely talks about eventual forward progress. > > The compiler will need to earn my trust on this one. ;-) > > > > > Now, there are things that are *not* fine, like speculative stores > > > > that could be visible to other threads. Those are *bugs* (either in > > > > the compiler or in the standard), and anybody who claims otherwise is > > > > not worth discussing with. > > > > > > And as near as I can tell, volatile semantics are required in C11 to > > > avoid speculative stores. I might be wrong about this, and hope that > > > I am wrong. But I am currently not seeing it in the current standard. > > > (Though I expect that most compilers would avoid speculating stores, > > > especially in the near term. > > > > This really depends on how we define speculative stores. The memory > > model is absolutely clear that programs have to behave as if executed by > > the virtual machine, and that rules out speculative stores to volatiles > > and other locations. Under certain circumstances, there will be > > "speculative" stores in the sense that they will happen at different > > times as if you had a trivial implementation of the abstract machine. > > But to be allowed to do that, the compiler has to prove that such a > > transformation still fulfills the as-if rule. > > Agreed, although the as-if rule would ignore control dependencies, since > these are not yet part of the standard (as you in fact note below). > I nevertheless consider myself at least somewhat reassured that current > C11 won't speculate stores. My remaining concerns involve the compiler > proving to itself that a given branch is always taken, thus motivating > it to optimize the branch away -- though this is more properly a > control-dependency concern. > > > IOW, the abstract machine is what currently defines disallowed > > speculative stores. If you want to put *further* constraints on what > > implementations are allowed to do, I suppose it is best to talk about > > those and see how we can add rules that allow programmers to express > > those constraints. For example, control dependencies might be such a > > case. I don't have a specific suggestion -- maybe the control > > dependencies are best tackled similar to consume dependencies (even > > though we don't have a good solution for those yets). But using > > volatile accesses for that seems to be a big hammer, or even the wrong > > one. > > In current compilers, the two hammers we have are volatile and barrier(). > But yes, it would be good to have something more focused. One option > would be to propose memory_order_control loads to see how loudly the > committee screams. One use case might be as follows: > > if (atomic_load(x, memory_order_control)) > atomic_store(y, memory_order_relaxed); > > This could also be written: > > r1 = atomic_load(x, memory_order_control); > if (r1) > atomic_store(y, memory_order_relaxed); > > A branch depending on the memory_order_control load could not be optimized > out, though I suppose that the compiler could substitute a memory-barrier > instruction for the branch. Seems like it would take a very large number > of branches to equal the overhead of the memory barrier, though. > > Another option would be to flag the conditional expression, prohibiting > the compiler from optimizing out any conditional branches. Perhaps > something like this: > > r1 = atomic_load(x, memory_order_control); > if (control_dependency(r1)) > atomic_store(y, memory_order_relaxed); That's the one I had in mind and talked to you about earlier today. My gut feeling is that this is preferably over the other because it "marks" the if-statement, so the compiler knows exactly which branches matter. I'm not sure one would need the other memory order for that, if indeed all you want is relaxed -> branch -> relaxed. But maybe there are corner cases (see the weaker-than-relaxed discussion in SG1 today). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 4:43 ` Torvald Riegel @ 2014-02-14 17:29 ` Paul E. McKenney 2014-02-14 19:21 ` Torvald Riegel 2014-02-14 19:50 ` Linus Torvalds 0 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-14 17:29 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: [ . . . ] > > Another option would be to flag the conditional expression, prohibiting > > the compiler from optimizing out any conditional branches. Perhaps > > something like this: > > > > r1 = atomic_load(x, memory_order_control); > > if (control_dependency(r1)) > > atomic_store(y, memory_order_relaxed); > > That's the one I had in mind and talked to you about earlier today. My > gut feeling is that this is preferably over the other because it "marks" > the if-statement, so the compiler knows exactly which branches matter. > I'm not sure one would need the other memory order for that, if indeed > all you want is relaxed -> branch -> relaxed. But maybe there are > corner cases (see the weaker-than-relaxed discussion in SG1 today). Linus, Peter, any objections to marking places where we are relying on ordering from control dependencies against later stores? This approach seems to me to have significant documentation benefits. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 17:29 ` Paul E. McKenney @ 2014-02-14 19:21 ` Torvald Riegel 2014-02-14 19:50 ` Linus Torvalds 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-14 19:21 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-14 at 09:29 -0800, Paul E. McKenney wrote: > On Thu, Feb 13, 2014 at 08:43:01PM -0800, Torvald Riegel wrote: > > On Thu, 2014-02-13 at 18:01 -0800, Paul E. McKenney wrote: > > [ . . . ] > > > > Another option would be to flag the conditional expression, prohibiting > > > the compiler from optimizing out any conditional branches. Perhaps > > > something like this: > > > > > > r1 = atomic_load(x, memory_order_control); > > > if (control_dependency(r1)) > > > atomic_store(y, memory_order_relaxed); > > > > That's the one I had in mind and talked to you about earlier today. My > > gut feeling is that this is preferably over the other because it "marks" > > the if-statement, so the compiler knows exactly which branches matter. > > I'm not sure one would need the other memory order for that, if indeed > > all you want is relaxed -> branch -> relaxed. But maybe there are > > corner cases (see the weaker-than-relaxed discussion in SG1 today). > > Linus, Peter, any objections to marking places where we are relying on > ordering from control dependencies against later stores? This approach > seems to me to have significant documentation benefits. Let me note that at least as I'm concerned, that's just a quick idea. At least I haven't looked at (1) how to properly specify the semantics of this, (2) whether it has any bad effects on unrelated code, (3) and whether there are pitfalls for compiler implementations. It looks not too bad at first glance, though. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 17:29 ` Paul E. McKenney 2014-02-14 19:21 ` Torvald Riegel @ 2014-02-14 19:50 ` Linus Torvalds 2014-02-14 20:02 ` Linus Torvalds 2014-02-15 17:30 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-14 19:50 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Linus, Peter, any objections to marking places where we are relying on > ordering from control dependencies against later stores? This approach > seems to me to have significant documentation benefits. Quite frankly, I think it's stupid, and the "documentation" is not a benefit, it's just wrong. How would you figure out whether your added "documentation" holds true for particular branches but not others? How could you *ever* trust a compiler that makes the dependency meaningful? Again, let's keep this simple and sane: - if a compiler ever generates code where an atomic store movement is "visible" in any way, then that compiler is broken shit. I don't understand why you even argue this. Seriously, Paul, you seem to *want* to think that "broken shit" is acceptable, and that we should then add magic markers to say "now you need to *not* be broken shit". Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER. And if a compiler can *prove* that whatever code movement it does cannot make a difference, then let it do so. No amount of "documentation" should matter. Seriously, this whole discussion has been completely moronic. I don't understand why you even bring shit like this up: > > r1 = atomic_load(x, memory_order_control); > > if (control_dependency(r1)) > > atomic_store(y, memory_order_relaxed); I mean, really? Anybody who writes code like that, or any compiler where that "control_dependency()" marker makes any difference what-so-ever for code generation should just be retroactively aborted. There is absolutely *zero* reason for that "control_dependency()" crap. If you ever find a reason for it, it is either because the compiler is buggy, or because the standard is so shit that we should never *ever* use the atomics. Seriously. This thread has devolved into some kind of "just what kind of idiotic compiler cesspool crap could we accept". Get away from that f*cking mindset. We don't accept *any* crap. Why are we still discussing this idiocy? It's irrelevant. If the standard really allows random store speculation, the standard doesn't matter, and sane people shouldn't waste their time arguing about it. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 19:50 ` Linus Torvalds @ 2014-02-14 20:02 ` Linus Torvalds 2014-02-15 2:08 ` Paul E. McKenney 2014-02-15 17:45 ` Torvald Riegel 2014-02-15 17:30 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-14 20:02 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Why are we still discussing this idiocy? It's irrelevant. If the > standard really allows random store speculation, the standard doesn't > matter, and sane people shouldn't waste their time arguing about it. Btw, the other part of this coin is that our manual types (using volatile and various architecture-specific stuff) and our manual barriers and inline asm accesses are generally *fine*. The C11 stuff doesn't buy us anything. The argument that "new architectures might want to use it" is prue and utter bollocks, since unless the standard gets the thing *right*, nobody sane would ever use it for some new architecture, when the sane thing to do is to just fill in the normal barriers and inline asms. So I'm very very serious: either the compiler and the standard gets things right, or we don't use it. There is no middle ground where "we might use it for one or two architectures and add random hints". That's just stupid. The only "middle ground" is about which compiler version we end up trusting _if_ it turns out that the compiler and standard do get things right. From Torvald's explanations (once I don't mis-read them ;), my take-away so far has actually been that the standard *does* get things right, but I do know from over-long personal experience that compiler people sometimes want to be legalistic and twist the documentation to the breaking point, at which point we just go "we'd be crazy do use that". See our use of "-fno-strict-aliasing", for example. The C standard aliasing rules are a mistake, stupid, and wrong, and gcc uses those stupid type-based alias rules even when statically *proving* the aliasing gives the opposite result. End result: we turn the shit off. Exact same deal wrt atomics. We are *not* going to add crazy "this here is a control dependency" crap. There's a test, the compiler *sees* the control dependency for chrissake, and it still generates crap, we turn that broken "optimization" off. It really is that simple. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 20:02 ` Linus Torvalds @ 2014-02-15 2:08 ` Paul E. McKenney 2014-02-15 2:44 ` Linus Torvalds 2014-02-15 17:45 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-15 2:08 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 12:02:23PM -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Why are we still discussing this idiocy? It's irrelevant. If the > > standard really allows random store speculation, the standard doesn't > > matter, and sane people shouldn't waste their time arguing about it. > > Btw, the other part of this coin is that our manual types (using > volatile and various architecture-specific stuff) and our manual > barriers and inline asm accesses are generally *fine*. > > The C11 stuff doesn't buy us anything. The argument that "new > architectures might want to use it" is prue and utter bollocks, since > unless the standard gets the thing *right*, nobody sane would ever use > it for some new architecture, when the sane thing to do is to just > fill in the normal barriers and inline asms. > > So I'm very very serious: either the compiler and the standard gets > things right, or we don't use it. There is no middle ground where "we > might use it for one or two architectures and add random hints". > That's just stupid. > > The only "middle ground" is about which compiler version we end up > trusting _if_ it turns out that the compiler and standard do get > things right. From Torvald's explanations (once I don't mis-read them > ;), my take-away so far has actually been that the standard *does* get > things right, but I do know from over-long personal experience that > compiler people sometimes want to be legalistic and twist the > documentation to the breaking point, at which point we just go "we'd > be crazy do use that". > > See our use of "-fno-strict-aliasing", for example. The C standard > aliasing rules are a mistake, stupid, and wrong, and gcc uses those > stupid type-based alias rules even when statically *proving* the > aliasing gives the opposite result. End result: we turn the shit off. > > Exact same deal wrt atomics. We are *not* going to add crazy "this > here is a control dependency" crap. There's a test, the compiler > *sees* the control dependency for chrissake, and it still generates > crap, we turn that broken "optimization" off. It really is that > simple. >From what I can see at the moment, the standard -generally- avoids speculative stores, but there are a few corner cases where it might allow them. I will be working with the committee to see exactly what the situation is. Might be that I am confused and that everything really is OK, might be that I am right but the corner cases are things that no sane kernel developer would do anyway, it might be that the standard needs a bit of repair, or it might be that the corner cases are somehow inherent and problematic (but I hope not!). I will let you know what I find, but it will probably be a few months. In the meantime, agreed, we keep doing what we have been doing. And maybe in the long term as well, for that matter. One way of looking at the discussion between Torvald and myself would be as a seller (Torvald) and a buyer (me) haggling over the fine print in a proposed contract (the standard). Whether that makes you feel better or worse about the situation I cannot say. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 2:08 ` Paul E. McKenney @ 2014-02-15 2:44 ` Linus Torvalds 2014-02-15 2:48 ` Linus Torvalds 2014-02-15 18:07 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-15 2:44 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > One way of looking at the discussion between Torvald and myself would be > as a seller (Torvald) and a buyer (me) haggling over the fine print in > a proposed contract (the standard). Whether that makes you feel better > or worse about the situation I cannot say. ;-) Oh, I'm perfectly fine with that. But we're not desperate to buy, and I actually think the C11 people are - or at least should be - *way* more desperate to sell their model to us than we are to take it. Which means that as a buyer you should say "this is what we want, if you don't give us this, we'll just walk away". Not try to see how much we can pay for it. Because there is very little upside for us, and _unless_ the C11 standard gets things right it's just extra complexity for us, coupled with new compiler fragility and years of code generation bugs. Why would we want that extra complexity and inevitable compiler bugs? If we then have to fight compiler writers that point to the standard and say "..but look, the standard says we can do this", then at that point it went from "extra complexity and compiler bugs" to a whole 'nother level of frustration and pain. So just walk away unless the C11 standard gives us exactly what we want. Not "something kind of like what we'd use". EXACTLY. Because I'm not in the least interested in fighting compiler people that have a crappy standard they can point to. Been there, done that, got the T-shirt and learnt my lesson. And the thing is, I suspect that the Linux kernel is the most complete - and most serious - user of true atomics that the C11 people can sell their solution to. If we don't buy it, they have no serious user. Sure, they'll have lots of random other one-off users for their atomics, where each user wants one particular thing, but I suspect that we'll have the only really unified portable code base that handles pretty much *all* the serious odd cases that the C11 atomics can actually talk about to each other. Oh, they'll push things through with or without us, and it will be a collection of random stuff, where they tried to please everybody, with particularly compiler/architecture people who have no f*cking clue about how their stuff is used pushing to make it easy/efficient for their particular compiler/architecture. But we have real optimized uses of pretty much all relevant cases that people actually care about. We can walk away from them, and not really lose anything but a small convenience (and it's a convenience *only* if the standard gets things right). And conversely, the C11 people can walk away from us too. But if they can't make us happy (and by "make us happy", I really mean no stupid games on our part) I personally think they'll have a stronger standard, and a real use case, and real arguments. I'm assuming they want that. That's why I complain when you talk about things like marking control dependencies explicitly. That's *us* bending over backwards. And as a buyer, we have absolutely zero reason to do that. Tell the C11 people: "no speculative writes". Full stop. End of story. Because we're not buying anything else. Similarly, if we need to mark atomics "volatile", then now the C11 atomics are no longer even a "small convenience", now they are just extra complexity wrt what we already have. So just make it clear that if the C11 standard needs to mark atomics volatile in order to get non-speculative and non-reloading behavior, then the C11 atomics are useless to us, and we're not buying. Remember: a compiler can *always* do "as if" optimizations - if a compiler writer can prove that the end result acts 100% the same using an optimized sequence, then they can do whatever the hell they want. That's not the issue. But if we can *ever* see semantic impact of speculative writes, the compiler is buggy, and the compiler writers need to be aware that it is buggy. No ifs, buts, maybes about it. So I'm perfectly fine with you seeing yourself as a buyer. But I want you to be a really *picky* and anal buyer - one that knows he has the upper hand, and can walk away with no downside. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 2:44 ` Linus Torvalds @ 2014-02-15 2:48 ` Linus Torvalds 2014-02-15 6:35 ` Paul E. McKenney 2014-02-15 18:07 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-15 2:48 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > And conversely, the C11 people can walk away from us too. But if they > can't make us happy (and by "make us happy", I really mean no stupid > games on our part) I personally think they'll have a stronger > standard, and a real use case, and real arguments. I'm assuming they > want that. I should have somebody who proof-reads my emails before I send them out. I obviously meant "if they *can* make us happy" (not "can't"). Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 2:48 ` Linus Torvalds @ 2014-02-15 6:35 ` Paul E. McKenney 2014-02-15 6:58 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-15 6:35 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > And conversely, the C11 people can walk away from us too. But if they > > can't make us happy (and by "make us happy", I really mean no stupid > > games on our part) I personally think they'll have a stronger > > standard, and a real use case, and real arguments. I'm assuming they > > want that. > > I should have somebody who proof-reads my emails before I send them out. > > I obviously meant "if they *can* make us happy" (not "can't"). Understood. My next step is to take a more detailed look at the piece of the standard that should support RCU. Depending on how that turns out, I might look at other parts of the standard vs. Linux's atomics and memory-ordering needs. Should be interesting. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 6:35 ` Paul E. McKenney @ 2014-02-15 6:58 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-15 6:58 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 14, 2014 at 10:35:44PM -0800, Paul E. McKenney wrote: > On Fri, Feb 14, 2014 at 06:48:02PM -0800, Linus Torvalds wrote: > > On Fri, Feb 14, 2014 at 6:44 PM, Linus Torvalds > > <torvalds@linux-foundation.org> wrote: > > > > > > And conversely, the C11 people can walk away from us too. But if they > > > can't make us happy (and by "make us happy", I really mean no stupid > > > games on our part) I personally think they'll have a stronger > > > standard, and a real use case, and real arguments. I'm assuming they > > > want that. > > > > I should have somebody who proof-reads my emails before I send them out. > > > > I obviously meant "if they *can* make us happy" (not "can't"). > > Understood. My next step is to take a more detailed look at the piece > of the standard that should support RCU. Depending on how that turns > out, I might look at other parts of the standard vs. Linux's atomics > and memory-ordering needs. Should be interesting. ;-) And perhaps a better way to represent the roles is that I am not the buyer, but rather the purchasing agent for the -potential- buyer. -You- are of course the potential buyer. If I were to see myself as the buyer, then I must confess that the concerns you implicitly expressed in your prior email would be all too well-founded! Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 2:44 ` Linus Torvalds 2014-02-15 2:48 ` Linus Torvalds @ 2014-02-15 18:07 ` Torvald Riegel 2014-02-17 18:59 ` Joseph S. Myers 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-15 18:07 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-14 at 18:44 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 6:08 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > One way of looking at the discussion between Torvald and myself would be > > as a seller (Torvald) and a buyer (me) haggling over the fine print in > > a proposed contract (the standard). Whether that makes you feel better > > or worse about the situation I cannot say. ;-) > > Oh, I'm perfectly fine with that. But we're not desperate to buy, and > I actually think the C11 people are - or at least should be - *way* > more desperate to sell their model to us than we are to take it. > > Which means that as a buyer you should say "this is what we want, if > you don't give us this, we'll just walk away". Not try to see how much > we can pay for it. Because there is very little upside for us, and > _unless_ the C11 standard gets things right it's just extra complexity > for us, coupled with new compiler fragility and years of code > generation bugs. I think there is an upside to you, mainly in that it allows compiler testing tools, potentially verification tools for atomics, and tools like cppmem that show allowed executions of code. I agree that the Linux community has been working well without this, and it's big enough to make running it's own show viable. This will be different for smaller projects, though. > Why would we want that extra complexity and inevitable compiler bugs? > If we then have to fight compiler writers that point to the standard > and say "..but look, the standard says we can do this", then at that > point it went from "extra complexity and compiler bugs" to a whole > 'nother level of frustration and pain. I see your point, but flip side of the coin is that if you get the standard to say what you want, then you can tell the compiler writers to look at the standard. Or show them bugs revealed by fuzz testing and such. > So just walk away unless the C11 standard gives us exactly what we > want. Not "something kind of like what we'd use". EXACTLY. Because I'm > not in the least interested in fighting compiler people that have a > crappy standard they can point to. Been there, done that, got the > T-shirt and learnt my lesson. > > And the thing is, I suspect that the Linux kernel is the most complete > - and most serious - user of true atomics that the C11 people can sell > their solution to. I agree, but there are likely also other big projects that could make use of C11 atomics on the userspace side (e.g., certain databases, ...). > If we don't buy it, they have no serious user. I disagree with that. That obviously depends on one's definition of "serious", but if you combine all C/C++ programs that use low-level atomics, then this is serious use as well. There's lots of shared-memory synchronization in userspace as well. > Sure, they'll have lots > of random other one-off users for their atomics, where each user wants > one particular thing, but I suspect that we'll have the only really > unified portable code base glibc is a counterexample that comes to mind, although it's a smaller code base. (It's currently not using C11 atomics, but transitioning there makes sense, and some thing I want to get to eventually.) > that handles pretty much *all* the serious > odd cases that the C11 atomics can actually talk about to each other. You certainly have lots of odd cases, but I would disagree with the assumption that only the Linux kernel will do full "testing" of the implementations. If you have plenty of userspace programs using the atomics, that's a pretty big test suite, and one that should help bring the compilers up to speed. So that might be a benefit even to the Linux kernel if it would use the C11 atomics. > Oh, they'll push things through with or without us, and it will be a > collection of random stuff, where they tried to please everybody, with > particularly compiler/architecture people who have no f*cking clue > about how their stuff is used pushing to make it easy/efficient for > their particular compiler/architecture. I'll ignore this... :) > But we have real optimized uses of pretty much all relevant cases that > people actually care about. You certainly cover a lot of cases. Finding out whether you cover all that "people care about" would require you to actually ask all people, which I'm sure you've done ;) > We can walk away from them, and not really lose anything but a small > convenience (and it's a convenience *only* if the standard gets things > right). > > And conversely, the C11 people can walk away from us too. But if they > can't make us happy (and by "make us happy", I really mean no stupid > games on our part) I personally think they'll have a stronger > standard, and a real use case, and real arguments. I'm assuming they > want that. I agree. > That's why I complain when you talk about things like marking control > dependencies explicitly. That's *us* bending over backwards. And as a > buyer, we have absolutely zero reason to do that. As I understood the situation, it was rather like the buyer trying to wrestle in an additional feature without having a proper specification of what the feature should actually do. That's why we've been talking about what it should actually do, and how ... > Tell the C11 people: "no speculative writes". Full stop. End of story. > Because we're not buying anything else. > > Similarly, if we need to mark atomics "volatile", then now the C11 > atomics are no longer even a "small convenience", now they are just > extra complexity wrt what we already have. So just make it clear that > if the C11 standard needs to mark atomics volatile in order to get > non-speculative and non-reloading behavior, then the C11 atomics are > useless to us, and we're not buying. I hope what I wrote previously removes those concerns. > Remember: a compiler can *always* do "as if" optimizations - if a > compiler writer can prove that the end result acts 100% the same using > an optimized sequence, then they can do whatever the hell they want. > That's not the issue. Good to know that we agree on that. > But if we can *ever* see semantic impact of > speculative writes, the compiler is buggy, and the compiler writers > need to be aware that it is buggy. No ifs, buts, maybes about it. Agreed. > So I'm perfectly fine with you seeing yourself as a buyer. But I want > you to be a really *picky* and anal buyer - one that knows he has the > upper hand, and can walk away with no downside. FWIW, I don't see myself as being the seller. Instead, I'm hope we can improve the situation for everyone involved, whether that involves C11 or something else. I see a potential mutual benefit, and I want to try to exploit it. Regarding the distributions I'm concerned about, the Linux kernel and GCC are very much in the same boat. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 18:07 ` Torvald Riegel @ 2014-02-17 18:59 ` Joseph S. Myers 2014-02-17 19:19 ` Will Deacon 2014-02-17 19:41 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Joseph S. Myers @ 2014-02-17 18:59 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, 15 Feb 2014, Torvald Riegel wrote: > glibc is a counterexample that comes to mind, although it's a smaller > code base. (It's currently not using C11 atomics, but transitioning > there makes sense, and some thing I want to get to eventually.) glibc is using C11 atomics (GCC builtins rather than _Atomic / <stdatomic.h>, but using __atomic_* with explicitly specified memory model rather than the older __sync_*) on AArch64, plus in certain cases on ARM and MIPS. -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 18:59 ` Joseph S. Myers @ 2014-02-17 19:19 ` Will Deacon 2014-02-17 19:41 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Will Deacon @ 2014-02-17 19:19 UTC (permalink / raw) To: Joseph S. Myers Cc: Torvald Riegel, Linus Torvalds, Paul McKenney, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 06:59:31PM +0000, Joseph S. Myers wrote: > On Sat, 15 Feb 2014, Torvald Riegel wrote: > > > glibc is a counterexample that comes to mind, although it's a smaller > > code base. (It's currently not using C11 atomics, but transitioning > > there makes sense, and some thing I want to get to eventually.) > > glibc is using C11 atomics (GCC builtins rather than _Atomic / > <stdatomic.h>, but using __atomic_* with explicitly specified memory model > rather than the older __sync_*) on AArch64, plus in certain cases on ARM > and MIPS. Hmm, actually that results in a change in behaviour for the __sync_* primitives on AArch64. The documentation for those states that: `In most cases, these built-in functions are considered a full barrier. That is, no memory operand is moved across the operation, either forward or backward. Further, instructions are issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.' which is stronger than simply mapping them to memory_model_seq_cst, which seems to be what the AArch64 compiler is doing (so you get acquire + release instead of a full fence). Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 18:59 ` Joseph S. Myers 2014-02-17 19:19 ` Will Deacon @ 2014-02-17 19:41 ` Torvald Riegel 2014-02-17 23:12 ` Joseph S. Myers 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 19:41 UTC (permalink / raw) To: Joseph S. Myers Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 18:59 +0000, Joseph S. Myers wrote: > On Sat, 15 Feb 2014, Torvald Riegel wrote: > > > glibc is a counterexample that comes to mind, although it's a smaller > > code base. (It's currently not using C11 atomics, but transitioning > > there makes sense, and some thing I want to get to eventually.) > > glibc is using C11 atomics (GCC builtins rather than _Atomic / > <stdatomic.h>, but using __atomic_* with explicitly specified memory model > rather than the older __sync_*) on AArch64, plus in certain cases on ARM > and MIPS. I think the major steps remaining is moving the other architectures over, and rechecking concurrent code (e.g., for the code that I have seen, it was either asm variants (eg, on x86), or built before C11; ARM pthread_once was lacking memory_barriers (see "pthread_once unification" patches I posted)). We also need/should to move towards using relaxed-MO atomic loads instead of plain loads. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 19:41 ` Torvald Riegel @ 2014-02-17 23:12 ` Joseph S. Myers 0 siblings, 0 replies; 285+ messages in thread From: Joseph S. Myers @ 2014-02-17 23:12 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 17 Feb 2014, Torvald Riegel wrote: > On Mon, 2014-02-17 at 18:59 +0000, Joseph S. Myers wrote: > > On Sat, 15 Feb 2014, Torvald Riegel wrote: > > > > > glibc is a counterexample that comes to mind, although it's a smaller > > > code base. (It's currently not using C11 atomics, but transitioning > > > there makes sense, and some thing I want to get to eventually.) > > > > glibc is using C11 atomics (GCC builtins rather than _Atomic / > > <stdatomic.h>, but using __atomic_* with explicitly specified memory model > > rather than the older __sync_*) on AArch64, plus in certain cases on ARM > > and MIPS. > > I think the major steps remaining is moving the other architectures > over, and rechecking concurrent code (e.g., for the code that I have I don't think we'll be ready to require GCC >= 4.7 to build glibc for another year or two, although probably we could move the requirement up from 4.4 to 4.6. (And some platforms only had the C11 atomics optimized later than 4.7.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 20:02 ` Linus Torvalds 2014-02-15 2:08 ` Paul E. McKenney @ 2014-02-15 17:45 ` Torvald Riegel 2014-02-15 18:49 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-15 17:45 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-14 at 12:02 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 11:50 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Why are we still discussing this idiocy? It's irrelevant. If the > > standard really allows random store speculation, the standard doesn't > > matter, and sane people shouldn't waste their time arguing about it. > > Btw, the other part of this coin is that our manual types (using > volatile and various architecture-specific stuff) and our manual > barriers and inline asm accesses are generally *fine*. AFAICT, it does work for you, but hasn't been exactly pain-free. I think a major benefit of C11's memory model is that it gives a *precise* specification for how a compiler is allowed to optimize. There is a formalization of the model, which allows things like the cppmem tool by the Cambridge group. It also allows meaningful fuzz testing: http://www.di.ens.fr/~zappa/projects/cmmtest/ ; this did reveal several GCC compiler bugs. I also think that reasoning about this model is easier than reasoning about how lots of different, concrete compiler optimizations would interact. > The C11 stuff doesn't buy us anything. The argument that "new > architectures might want to use it" is prue and utter bollocks, since > unless the standard gets the thing *right*, nobody sane would ever use > it for some new architecture, when the sane thing to do is to just > fill in the normal barriers and inline asms. > > So I'm very very serious: either the compiler and the standard gets > things right, or we don't use it. There is no middle ground where "we > might use it for one or two architectures and add random hints". > That's just stupid. > > The only "middle ground" is about which compiler version we end up > trusting _if_ it turns out that the compiler and standard do get > things right. From Torvald's explanations (once I don't mis-read them > ;), my take-away so far has actually been that the standard *does* get > things right, but I do know from over-long personal experience that > compiler people sometimes want to be legalistic and twist the > documentation to the breaking point, at which point we just go "we'd > be crazy do use that". I agree that compilers want to optimize, and sometimes there's probably a little too much emphasis on applying an optimization vs. not surprising users. But we have to draw a line (e.g., what is undefined behavior and what is not), because we need this to be actually able to optimize. Therefore, we need to get the rules into a shape that both allows optimizations and isn't full of surprising corner cases. The rules are the standard, so it's the standard we have to get right. According to my experience, a lot of thought goes into how to design the standard's language and library so that they are intuitive yet efficient. If you see issues in the standard, please bring them up. Either report the defects directly and get involved yourself, or reach out to somebody that is participating in the standards process. The standard certainly isn't perfect, so there is room to contribute. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 17:45 ` Torvald Riegel @ 2014-02-15 18:49 ` Linus Torvalds 2014-02-17 19:55 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-15 18:49 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote: > > I think a major benefit of C11's memory model is that it gives a > *precise* specification for how a compiler is allowed to optimize. Clearly it does *not*. This whole discussion is proof of that. It's not at all clear, and the standard apparently is at least debatably allowing things that shouldn't be allowed. It's also a whole lot more complicated than "volatile", so the likelihood of a compiler writer actually getting it right - even if the standard does - is lower. They've gotten "volatile" wrong too, after all (particularly in C++). Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 18:49 ` Linus Torvalds @ 2014-02-17 19:55 ` Torvald Riegel 2014-02-17 20:18 ` Linus Torvalds 2014-02-17 20:23 ` Paul E. McKenney 0 siblings, 2 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 19:55 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > I think a major benefit of C11's memory model is that it gives a > > *precise* specification for how a compiler is allowed to optimize. > > Clearly it does *not*. This whole discussion is proof of that. It's > not at all clear, It might not be an easy-to-understand specification, but as far as I'm aware it is precise. The Cambridge group's formalization certainly is precise. From that, one can derive (together with the usual rules for as-if etc.) what a compiler is allowed to do (assuming that the standard is indeed precise). My replies in this discussion have been based on reasoning about the standard, and not secret knowledge (with the exception of no-out-of-thin-air, which is required in the standard's prose but not yet formalized). I agree that I'm using the formalization as a kind of placeholder for the standard's prose (which isn't all that easy to follow for me either), but I guess there's no way around an ISO standard using prose. If you see a case in which the standard isn't precise, please bring it up or open a C++ CWG issue for it. > and the standard apparently is at least debatably > allowing things that shouldn't be allowed. Which example do you have in mind here? Haven't we resolved all the debated examples, or did I miss any? > It's also a whole lot more > complicated than "volatile", so the likelihood of a compiler writer > actually getting it right - even if the standard does - is lower. It's not easy, that's for sure, but none of the high-performance alternatives are easy either. There are testing tools out there based on the formalization of the model, and we've found bugs with them. And the alternative of using something not specified by the standard is even worse, I think, because then you have to guess what a compiler might do, without having any constraints; IOW, one is resorting to "no sane compiler would do that", and that doesn't seem to very robust either. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 19:55 ` Torvald Riegel @ 2014-02-17 20:18 ` Linus Torvalds 2014-02-17 21:21 ` Torvald Riegel ` (2 more replies) 2014-02-17 20:23 ` Paul E. McKenney 1 sibling, 3 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-17 20:18 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote: > > Which example do you have in mind here? Haven't we resolved all the > debated examples, or did I miss any? Well, Paul seems to still think that the standard possibly allows speculative writes or possibly value speculation in ways that break the hardware-guaranteed orderings. And personally, I can't read standards paperwork. It is invariably written in some basically impossible-to-understand lawyeristic mode, and then it is read by people (compiler writers) that intentionally try to mis-use the words and do language-lawyering ("that depends on what the meaning of 'is' is"). The whole "lvalue vs rvalue expression vs 'what is a volatile access'" thing for C++ was/is a great example of that. So quite frankly, as a result I refuse to have anything to do with the process directly. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 20:18 ` Linus Torvalds @ 2014-02-17 21:21 ` Torvald Riegel 2014-02-17 22:02 ` Linus Torvalds 2014-02-17 23:10 ` Alec Teal 2014-02-18 3:00 ` Paul E. McKenney 2 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 21:21 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > Which example do you have in mind here? Haven't we resolved all the > > debated examples, or did I miss any? > > Well, Paul seems to still think that the standard possibly allows > speculative writes or possibly value speculation in ways that break > the hardware-guaranteed orderings. That's true, I just didn't see any specific examples so far. > And personally, I can't read standards paperwork. It is invariably > written in some basically impossible-to-understand lawyeristic mode, Yeah, it's not the most intuitive form for things like the memory model. > and then it is read by people (compiler writers) that intentionally > try to mis-use the words and do language-lawyering ("that depends on > what the meaning of 'is' is"). That assumption about people working on compilers is a little too broad, don't you think? I think that it is important to stick to a specification, in the same way that one wouldn't expect a program with undefined behavior make any sense of it, magically, in cases where stuff is undefined. However, that of course doesn't include trying to exploit weasel-wording (BTW, both users and compiler writers try to do it). IMHO, weasel-wording in a standard is a problem in itself even if not exploited, and often it indicates that there is a real issue. There might be reasons to have weasel-wording (e.g., because there's no known better way to express it like in case of the not really precise no-out-of-thin-air rule today), but nonetheless those aren't ideal. > The whole "lvalue vs rvalue expression > vs 'what is a volatile access'" thing for C++ was/is a great example > of that. I'm not aware of the details of this. > So quite frankly, as a result I refuse to have anything to do with the > process directly. That's unfortunate. Then please work with somebody that isn't uncomfortable with participating directly in the process. But be warned, it may very well be a person working on compilers :) Have you looked at the formalization of the model by Batty et al.? The overview of this is prose, but the formalized model itself is all formal relations and logic. So there should be no language-lawyering issues with that form. (For me, the formalized model is much easier to reason about.) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 21:21 ` Torvald Riegel @ 2014-02-17 22:02 ` Linus Torvalds 2014-02-17 22:25 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-17 22:02 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote: > On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote: >> and then it is read by people (compiler writers) that intentionally >> try to mis-use the words and do language-lawyering ("that depends on >> what the meaning of 'is' is"). > > That assumption about people working on compilers is a little too broad, > don't you think? Let's just say that *some* are that way, and those are the ones that I end up butting heads with. The sane ones I never have to argue with - point them at a bug, and they just say "yup, bug". The insane ones say "we don't need to fix that, because if you read this copy of the standards that have been translated to chinese and back, it clearly says that this is acceptable". >> The whole "lvalue vs rvalue expression >> vs 'what is a volatile access'" thing for C++ was/is a great example >> of that. > > I'm not aware of the details of this. The argument was that an lvalue doesn't actually "access" the memory (an rvalue does), so this: volatile int *p = ...; *p; doesn't need to generate a load from memory, because "*p" is still an lvalue (since you could assign things to it). This isn't an issue in C, because in C, expression statements are always rvalues, but C++ changed that. The people involved with the C++ standards have generally been totally clueless about their subtle changes. I may have misstated something, but basically some C++ people tried very hard to make "volatile" useless. We had other issues too. Like C compiler people who felt that the type-based aliasing should always override anything else, even if the variable accessed (through different types) was statically clearly aliasing and used the exact same pointer. That made it impossible to do a syntactically clean model of "this aliases", since the _only_ exception to the type-based aliasing rule was to generate a union for every possible access pairing. We turned off type-based aliasing (as I've mentioned before, I think it's a fundamentally broken feature to begin with, and a horrible horrible hack that adds no value for anybody but the HPC people). Gcc eventually ended up having some sane syntax for overriding it, but by then I was too disgusted with the people involved to even care. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:02 ` Linus Torvalds @ 2014-02-17 22:25 ` Torvald Riegel 2014-02-17 22:47 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 22:25 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote: > > On Mon, 2014-02-17 at 12:18 -0800, Linus Torvalds wrote: > >> and then it is read by people (compiler writers) that intentionally > >> try to mis-use the words and do language-lawyering ("that depends on > >> what the meaning of 'is' is"). > > > > That assumption about people working on compilers is a little too broad, > > don't you think? > > Let's just say that *some* are that way, and those are the ones that I > end up butting heads with. > > The sane ones I never have to argue with - point them at a bug, and > they just say "yup, bug". The insane ones say "we don't need to fix > that, because if you read this copy of the standards that have been > translated to chinese and back, it clearly says that this is > acceptable". > > >> The whole "lvalue vs rvalue expression > >> vs 'what is a volatile access'" thing for C++ was/is a great example > >> of that. > > > > I'm not aware of the details of this. > > The argument was that an lvalue doesn't actually "access" the memory > (an rvalue does), so this: > > volatile int *p = ...; > > *p; > > doesn't need to generate a load from memory, because "*p" is still an > lvalue (since you could assign things to it). > > This isn't an issue in C, because in C, expression statements are > always rvalues, but C++ changed that. Huhh. I can see the problems that this creates in terms of C/C++ compatibility. > The people involved with the C++ > standards have generally been totally clueless about their subtle > changes. This isn't a fair characterization. There are many people that do care, and certainly not all are clueless. But it's a limited set of people, bugs happen, and not all of them will have the same goals. I think one way to prevent such problems in the future could be to have someone in the kernel community volunteer to look through standard revisions before they are published. The standard needs to be fixed, because compilers need to conform to the standard (e.g., a compiler's extension "fixing" the above wouldn't be conforming anymore because it emits more volatile reads than specified). Or maybe those of us working on the standard need to flag potential changes of interest to the kernel folks. But that may be less reliable than someone from the kernel side looking at them; I don't know. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:25 ` Torvald Riegel @ 2014-02-17 22:47 ` Linus Torvalds 2014-02-17 23:41 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-17 22:47 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel <triegel@redhat.com> wrote: > On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote: >> >> The argument was that an lvalue doesn't actually "access" the memory >> (an rvalue does), so this: >> >> volatile int *p = ...; >> >> *p; >> >> doesn't need to generate a load from memory, because "*p" is still an >> lvalue (since you could assign things to it). >> >> This isn't an issue in C, because in C, expression statements are >> always rvalues, but C++ changed that. > > Huhh. I can see the problems that this creates in terms of C/C++ > compatibility. That's not the biggest problem. The biggest problem is that you have compiler writers that don't care about sane *use* of the features they write a compiler for, they just care about the standard. So they don't care about C vs C++ compatibility. Even more importantly, they don't care about the *user* that uses only C++ and the fact that their reading of the standard results in *meaningless* behavior. They point to the standard and say "that's what the standard says, suck it", and silently generate code (or in this case, avoid generating code) that makes no sense. So it's not about C++ being incompatible with C, it's about C++ having insane and bad semantics unless you just admit that "oh, ok, I need to not just read the standard, I also need to use my brain, and admit that a C++ statement expression needs to act as if it is an "access" wrt volatile variables". In other words, as a compiler person, you do need to read more than the paper of standard. You need to also take into account what is reasonable behavior even when the standard could possibly be read some other way. And some compiler people don't. The "volatile access in statement expression" did get resolved, sanely, at least in gcc. I think gcc warns about some remaining cases. Btw, afaik, C++11 actually clarifies the standard to require the reads, because everybody *knew* that not requiring the read was insane and meaningless behavior, and clearly against the intent of "volatile". But that didn't stop compiler writers from saying "hey, the standard allows my insane and meaningless behavior, so I'll implement it and not consider it a bug". Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:47 ` Linus Torvalds @ 2014-02-17 23:41 ` Torvald Riegel 2014-02-18 0:18 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 23:41 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 14:47 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 2:25 PM, Torvald Riegel <triegel@redhat.com> wrote: > > On Mon, 2014-02-17 at 14:02 -0800, Linus Torvalds wrote: > >> > >> The argument was that an lvalue doesn't actually "access" the memory > >> (an rvalue does), so this: > >> > >> volatile int *p = ...; > >> > >> *p; > >> > >> doesn't need to generate a load from memory, because "*p" is still an > >> lvalue (since you could assign things to it). > >> > >> This isn't an issue in C, because in C, expression statements are > >> always rvalues, but C++ changed that. > > > > Huhh. I can see the problems that this creates in terms of C/C++ > > compatibility. > > That's not the biggest problem. > > The biggest problem is that you have compiler writers that don't care > about sane *use* of the features they write a compiler for, they just > care about the standard. > > So they don't care about C vs C++ compatibility. Even more > importantly, they don't care about the *user* that uses only C++ and > the fact that their reading of the standard results in *meaningless* > behavior. They point to the standard and say "that's what the standard > says, suck it", and silently generate code (or in this case, avoid > generating code) that makes no sense. There's an underlying problem here that's independent from the actual instance that you're worried about here: "no sense" is a ultimately a matter of taste/objectives/priorities as long as the respective specification is logically consistent. If you want to be independent of your sanity being different from other people's sanity (e.g., compiler writers), you need to make sure that the specification is precise and says what you want. IOW, think about the specification being the program, and the people being computers; you better want a well-defined program in this case. > So it's not about C++ being incompatible with C, it's about C++ having > insane and bad semantics unless you just admit that "oh, ok, I need to > not just read the standard, I also need to use my brain, and admit > that a C++ statement expression needs to act as if it is an "access" > wrt volatile variables". 1) I agree that (IMO) a good standard strives for being easy to understand. 2) In practice, there is a trade-off between "Easy to understand" and actually producing a specification. A standard is not a tutorial. And that's for good reason, because (a) there might be more than one way to teach something and that should be allowed and (b) that the standard should carry the full precision but still be compact enough to be manageable. 3) Implementations can try to be nice to users by helping them avoiding error-prone corner cases or such. A warning for common problems is such a case. But an implementation has to draw a line somewhere, demarcating cases where it fully exploits what the standard says (eg, to allow optimizations) from cases where it is more conservative and does what the standard allows but in a potentially more intuitive way. That's especially the case if it's being asked to produce high-performance code. 4) There will be arguments for where the line actually is, simply because different users will have different goals. 5) The way to reduce 4) is to either make the standard more specific, or to provide better user documentation. If the standard has strict requirements, then there will be less misunderstanding. 6) To achieve 5), one way is to get involved in the standards process. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 23:41 ` Torvald Riegel @ 2014-02-18 0:18 ` Linus Torvalds 2014-02-18 1:26 ` Paul E. McKenney 2014-02-18 15:38 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 0:18 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote: > > There's an underlying problem here that's independent from the actual > instance that you're worried about here: "no sense" is a ultimately a > matter of taste/objectives/priorities as long as the respective > specification is logically consistent. Yes. But I don't think it's "independent". Exactly *because* some people will read standards without applying "does the resulting code generation actually make sense for the programmer that wrote the code", the standard has to be pretty clear. The standard often *isn't* pretty clear. It wasn't clear enough when it came to "volatile", and yet that was a *much* simpler concept than atomic accesses and memory ordering. And most of the time it's not a big deal. But because the C standard generally tries to be very portable, and cover different machines, there tends to be a mindset that anything inherently unportable is "undefined" or "implementation defined", and then the compiler writer is basically given free reign to do anything they want (with "implementation defined" at least requiring that it is reliably the same thing). And when it comes to memory ordering, *everything* is basically non-portable, because different CPU's very much have different rules. I worry that that means that the standard then takes the stance that "well, compiler re-ordering is no worse than CPU re-ordering, so we let the compiler do anything". And then we have to either add "volatile" to make sure the compiler doesn't do that, or use an overly strict memory model at the compiler level that makes it all pointless. So I really really hope that the standard doesn't give compiler writers free hands to do anything that they can prove is "equivalent" in the virtual C machine model. That's not how you get reliable results. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 0:18 ` Linus Torvalds @ 2014-02-18 1:26 ` Paul E. McKenney 2014-02-18 15:38 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 1:26 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 04:18:52PM -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > There's an underlying problem here that's independent from the actual > > instance that you're worried about here: "no sense" is a ultimately a > > matter of taste/objectives/priorities as long as the respective > > specification is logically consistent. > > Yes. But I don't think it's "independent". > > Exactly *because* some people will read standards without applying > "does the resulting code generation actually make sense for the > programmer that wrote the code", the standard has to be pretty clear. > > The standard often *isn't* pretty clear. It wasn't clear enough when > it came to "volatile", and yet that was a *much* simpler concept than > atomic accesses and memory ordering. > > And most of the time it's not a big deal. But because the C standard > generally tries to be very portable, and cover different machines, > there tends to be a mindset that anything inherently unportable is > "undefined" or "implementation defined", and then the compiler writer > is basically given free reign to do anything they want (with > "implementation defined" at least requiring that it is reliably the > same thing). > > And when it comes to memory ordering, *everything* is basically > non-portable, because different CPU's very much have different rules. > I worry that that means that the standard then takes the stance that > "well, compiler re-ordering is no worse than CPU re-ordering, so we > let the compiler do anything". And then we have to either add > "volatile" to make sure the compiler doesn't do that, or use an overly > strict memory model at the compiler level that makes it all pointless. For whatever it is worth, this line of reasoning has been one reason why I have been objecting strenuously every time someone on the committee suggests eliminating "volatile" from the standard. Thanx, Paul > So I really really hope that the standard doesn't give compiler > writers free hands to do anything that they can prove is "equivalent" > in the virtual C machine model. That's not how you get reliable > results. > > Linus > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 0:18 ` Linus Torvalds 2014-02-18 1:26 ` Paul E. McKenney @ 2014-02-18 15:38 ` Torvald Riegel 2014-02-18 16:55 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 15:38 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > There's an underlying problem here that's independent from the actual > > instance that you're worried about here: "no sense" is a ultimately a > > matter of taste/objectives/priorities as long as the respective > > specification is logically consistent. > > Yes. But I don't think it's "independent". > > Exactly *because* some people will read standards without applying > "does the resulting code generation actually make sense for the > programmer that wrote the code", the standard has to be pretty clear. > > The standard often *isn't* pretty clear. It wasn't clear enough when > it came to "volatile", and yet that was a *much* simpler concept than > atomic accesses and memory ordering. > > And most of the time it's not a big deal. But because the C standard > generally tries to be very portable, and cover different machines, > there tends to be a mindset that anything inherently unportable is > "undefined" or "implementation defined", and then the compiler writer > is basically given free reign to do anything they want (with > "implementation defined" at least requiring that it is reliably the > same thing). Yes, that's how it works in general. And this makes sense, because all optimizations rely on that. Second, you can't keep something consistent (eg, between compilers) if it isn't specified. So if we want stricter rules, those need to be specified somewhere. > And when it comes to memory ordering, *everything* is basically > non-portable, because different CPU's very much have different rules. Well, the current set of memory orders (and the memory model as a whole) is portable, even though it might not allow to exploit all hardware properties, and thus might perform sub-optimally in some cases. > I worry that that means that the standard then takes the stance that > "well, compiler re-ordering is no worse than CPU re-ordering, so we > let the compiler do anything". And then we have to either add > "volatile" to make sure the compiler doesn't do that, or use an overly > strict memory model at the compiler level that makes it all pointless. Using "volatile" is not a good option, I think, because synchronization between threads should be orthogonal to observable output of the abstract machine. The current memory model might not allow to exploit all hardware properties, I agree. But then why don't we work on how to extend it to do so? We need to specify the behavior we want anyway, and this can't be independent of the language semantics, so it has to be conceptually integrated with the standard anyway. > So I really really hope that the standard doesn't give compiler > writers free hands to do anything that they can prove is "equivalent" > in the virtual C machine model. It does, but it also doesn't mean this can't be extended. So let's focus on whether we can find an extension. > That's not how you get reliable > results. In this general form, that's obviously a false claim. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 15:38 ` Torvald Riegel @ 2014-02-18 16:55 ` Paul E. McKenney 2014-02-18 19:57 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 16:55 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > There's an underlying problem here that's independent from the actual > > > instance that you're worried about here: "no sense" is a ultimately a > > > matter of taste/objectives/priorities as long as the respective > > > specification is logically consistent. > > > > Yes. But I don't think it's "independent". > > > > Exactly *because* some people will read standards without applying > > "does the resulting code generation actually make sense for the > > programmer that wrote the code", the standard has to be pretty clear. > > > > The standard often *isn't* pretty clear. It wasn't clear enough when > > it came to "volatile", and yet that was a *much* simpler concept than > > atomic accesses and memory ordering. > > > > And most of the time it's not a big deal. But because the C standard > > generally tries to be very portable, and cover different machines, > > there tends to be a mindset that anything inherently unportable is > > "undefined" or "implementation defined", and then the compiler writer > > is basically given free reign to do anything they want (with > > "implementation defined" at least requiring that it is reliably the > > same thing). > > Yes, that's how it works in general. And this makes sense, because all > optimizations rely on that. Second, you can't keep something consistent > (eg, between compilers) if it isn't specified. So if we want stricter > rules, those need to be specified somewhere. > > > And when it comes to memory ordering, *everything* is basically > > non-portable, because different CPU's very much have different rules. > > Well, the current set of memory orders (and the memory model as a whole) > is portable, even though it might not allow to exploit all hardware > properties, and thus might perform sub-optimally in some cases. > > > I worry that that means that the standard then takes the stance that > > "well, compiler re-ordering is no worse than CPU re-ordering, so we > > let the compiler do anything". And then we have to either add > > "volatile" to make sure the compiler doesn't do that, or use an overly > > strict memory model at the compiler level that makes it all pointless. > > Using "volatile" is not a good option, I think, because synchronization > between threads should be orthogonal to observable output of the > abstract machine. Are you thinking of "volatile" -instead- of atomics? My belief is that given the current standard there will be times that we need to use "volatile" -in- -addition- to atomics. > The current memory model might not allow to exploit all hardware > properties, I agree. > > But then why don't we work on how to extend it to do so? We need to > specify the behavior we want anyway, and this can't be independent of > the language semantics, so it has to be conceptually integrated with the > standard anyway. > > > So I really really hope that the standard doesn't give compiler > > writers free hands to do anything that they can prove is "equivalent" > > in the virtual C machine model. > > It does, but it also doesn't mean this can't be extended. So let's > focus on whether we can find an extension. > > > That's not how you get reliable > > results. > > In this general form, that's obviously a false claim. These two sentences starkly illustrate the difference in perspective between you two. You are talking past each other. Not sure how to fix this at the moment, but what else is new? ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 16:55 ` Paul E. McKenney @ 2014-02-18 19:57 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 19:57 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 08:55 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 04:38:40PM +0100, Torvald Riegel wrote: > > On Mon, 2014-02-17 at 16:18 -0800, Linus Torvalds wrote: > > > On Mon, Feb 17, 2014 at 3:41 PM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > > > There's an underlying problem here that's independent from the actual > > > > instance that you're worried about here: "no sense" is a ultimately a > > > > matter of taste/objectives/priorities as long as the respective > > > > specification is logically consistent. > > > > > > Yes. But I don't think it's "independent". > > > > > > Exactly *because* some people will read standards without applying > > > "does the resulting code generation actually make sense for the > > > programmer that wrote the code", the standard has to be pretty clear. > > > > > > The standard often *isn't* pretty clear. It wasn't clear enough when > > > it came to "volatile", and yet that was a *much* simpler concept than > > > atomic accesses and memory ordering. > > > > > > And most of the time it's not a big deal. But because the C standard > > > generally tries to be very portable, and cover different machines, > > > there tends to be a mindset that anything inherently unportable is > > > "undefined" or "implementation defined", and then the compiler writer > > > is basically given free reign to do anything they want (with > > > "implementation defined" at least requiring that it is reliably the > > > same thing). > > > > Yes, that's how it works in general. And this makes sense, because all > > optimizations rely on that. Second, you can't keep something consistent > > (eg, between compilers) if it isn't specified. So if we want stricter > > rules, those need to be specified somewhere. > > > > > And when it comes to memory ordering, *everything* is basically > > > non-portable, because different CPU's very much have different rules. > > > > Well, the current set of memory orders (and the memory model as a whole) > > is portable, even though it might not allow to exploit all hardware > > properties, and thus might perform sub-optimally in some cases. > > > > > I worry that that means that the standard then takes the stance that > > > "well, compiler re-ordering is no worse than CPU re-ordering, so we > > > let the compiler do anything". And then we have to either add > > > "volatile" to make sure the compiler doesn't do that, or use an overly > > > strict memory model at the compiler level that makes it all pointless. > > > > Using "volatile" is not a good option, I think, because synchronization > > between threads should be orthogonal to observable output of the > > abstract machine. > > Are you thinking of "volatile" -instead- of atomics? My belief is that > given the current standard there will be times that we need to use > "volatile" -in- -addition- to atomics. No, I was talking about having to use "volatile" in addition to atomics to get synchronization properties for the atomics. ISTM that this would be unfortunate because the objective for "volatile" (ie, declaring what is output of the abstract machine) is orthogonal to synchronization between threads. So if we want to preserve control dependencies and get ordering guarantees through that, I wouldn't prefer it through hacks based on volatile. For example, if we have a macro or helper function that contains synchronization code, but the compiler can prove that the macro/function is used just on data only accessible to a single thread, then the "volatile" notation on it would prevent optimizations. > > The current memory model might not allow to exploit all hardware > > properties, I agree. > > > > But then why don't we work on how to extend it to do so? We need to > > specify the behavior we want anyway, and this can't be independent of > > the language semantics, so it has to be conceptually integrated with the > > standard anyway. > > > > > So I really really hope that the standard doesn't give compiler > > > writers free hands to do anything that they can prove is "equivalent" > > > in the virtual C machine model. > > > > It does, but it also doesn't mean this can't be extended. So let's > > focus on whether we can find an extension. > > > > > That's not how you get reliable > > > results. > > > > In this general form, that's obviously a false claim. > > These two sentences starkly illustrate the difference in perspective > between you two. You are talking past each other. Not sure how to fix > this at the moment, but what else is new? ;-) Well, we're still talking, and we are making little steps in clarifying things, so there is progress :) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 20:18 ` Linus Torvalds 2014-02-17 21:21 ` Torvald Riegel @ 2014-02-17 23:10 ` Alec Teal 2014-02-18 0:05 ` Linus Torvalds 2014-02-18 3:00 ` Paul E. McKenney 2 siblings, 1 reply; 285+ messages in thread From: Alec Teal @ 2014-02-17 23:10 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 17/02/14 20:18, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel<triegel@redhat.com> wrote: >> Which example do you have in mind here? Haven't we resolved all the >> debated examples, or did I miss any? > Well, Paul seems to still think that the standard possibly allows > speculative writes or possibly value speculation in ways that break > the hardware-guaranteed orderings. > > And personally, I can't read standards paperwork. It is invariably Can't => Don't - evidently. > written in some basically impossible-to-understand lawyeristic mode, You mean "unambiguous" - try reading a patent (Apple have 1000s of trivial ones, I tried reading one once thinking "how could they have phrased it so this got approved", their technique was to make the reader want to start cutting themselves to prove they wern't numb to everything) > and then it is read by people (compiler writers) that intentionally > try to mis-use the words and do language-lawyering ("that depends on > what the meaning of 'is' is"). The whole "lvalue vs rvalue expression > vs 'what is a volatile access'" thing for C++ was/is a great example > of that. I'm not going to teach you what rvalues and lvalues, but! http://lmgtfy.com/?q=what+are+rvalues might help. > > So quite frankly, as a result I refuse to have anything to do with the > process directly. Is this goodbye? > > Linus That aside, what is the problem? If the compiler has created code that that has different program states than what would be created without optimisation please file a bug report and/or send something to the mailing list USING A CIVIL TONE, there's no need for swear-words and profanities all the time - use them when you want to emphasise something. Additionally if you are always angry, start calling that state "normal" then reserve such words for when you are outraged. There are so many emails from you bitching about stuff, I've lost track of what you're bitching about you bitch that much about it. Like this standards stuff above (notice I said stuff, not "crap" or "shit"). What exactly is your problem, if the compiler is doing something the standard does not permit, or optimising something wrongly (read: "puts the program in a different state than if the optimisation was not applied") that is REALLY serious, you are right to report it; but whining like a n00b on Stack-overflow when a question gets closed is not helping. I tried reading back though the emails (I dismissed them previously) but there's just so much ranting, and rants about the standard too (I would trash this if I deemed the effort required to delete was less than the storage of the bytes the message takes up) standardised behaviour is VERY important. So start again, what is the serious problem, have you got any code that would let me replicate it, what is your version of GCC? Oh and lastly! Optimisations are not as casual as "oh, we could do this and it'd work better" unlike kernel work or any other software that is being improved, it is very formal (and rightfully so). I seriously recommend you read the first 40 pages at least of a book called "Compiler Design, Analysis and Transformation" it's not about the parsing phases or anything, but it develops a good introduction and later a good foundation for exploring the field further. Compilers do not operate on what I call "A-level logic" and to show what I mean I use the shovel-to-the-face of real analysis, "of course 1/x tends towards 0, it's not gonna be 5!!" = A-level logic. "Let epsilon > 0 be given, then there exists an N...." - formal proof. So when one says "the compiler can prove" it's not some silly thing powered by A-level logic, it is the implementation of something that can be proven to be correct (in the sense of the program states mentioned before) So yeah, calm down and explain - no lashing out at standards bodies, what is the problem? Alec ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 23:10 ` Alec Teal @ 2014-02-18 0:05 ` Linus Torvalds 2014-02-18 15:31 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 0:05 UTC (permalink / raw) To: Alec Teal Cc: Torvald Riegel, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 3:10 PM, Alec Teal <a.teal@warwick.ac.uk> wrote: > > You mean "unambiguous" - try reading a patent (Apple have 1000s of trivial > ones, I tried reading one once thinking "how could they have phrased it so > this got approved", their technique was to make the reader want to start > cutting themselves to prove they wern't numb to everything) Oh, I agree, patent language is worse. > I'm not going to teach you what rvalues and lvalues, but! I know what lvalues and rvalues are. I *understand* the thinking that goes on behind the "let's not do the access, because it's not an rvalue, so there is no 'access' to the object". I understand it from a technical perspective. I don't understand the compiler writer that uses a *technicality* to argue against generating sane code that is obviously what the user actually asked for. See the difference? > So start again, what is the serious problem, have you got any code that > would let me replicate it, what is your version of GCC? The volatile problem is long fixed. The people who argued for the "legalistically correct", but insane behavior lost (and as mentioned, I think C++11 actually fixed the legalistic reading too). I'm bringing it up because I've had too many cases where compiler writers pointed to standard and said "that is ambiguous or undefined, so we can do whatever the hell we want, regardless of whether that's sensible, or regardless of whether there is a sensible way to get the behavior you want or not". > Oh and lastly! Optimisations are not as casual as "oh, we could do this and > it'd work better" unlike kernel work or any other software that is being > improved, it is very formal (and rightfully so) Alec, I know compilers. I don't do code generation (quite frankly, register allocation and instruction choice is when I give up), but I did actually write my own for static analysis, including turning things into SSA etc. No, I'm not a "compiler person", but I actually do know enough that I understand what goes on. And exactly because I know enough, I would *really* like atomics to be well-defined, and have very clear - and *local* - rules about how they can be combined and optimized. None of this "if you can prove that the read has value X" stuff. And things like value speculation should simply not be allowed, because that actually breaks the dependency chain that the CPU architects give guarantees for. Instead, make the rules be very clear, and very simple, like my suggestion. You can never remove a load because you can "prove" it has some value, but you can combine two consecutive atomic accesses/ For example, CPU people actually do tend to give guarantees for certain things, like stores that are causally related being visible in a particular order. If the compiler starts doing value speculation on atomic accesses, you are quite possibly breaking things like that. It's just not a good idea. Don't do it. Write the standard so that it clearly is disallowed. Because you may think that a C standard is machine-independent, but that isn't really the case. The people who write code still write code for a particular machine. Our code works (in the general case) on different byte orderings, different register sizes, different memory ordering models. But in each *instance* we still end up actually coding for each machine. So the rules for atomics should be simple and *specific* enough that when you write code for a particular architecture, you can take the architecture memory ordering *and* the C atomics orderings into account, and do the right thing for that architecture. And that very much means that doing things like value speculation MUST NOT HAPPEN. See? Even if you can prove that your code is "equivalent", it isn't. So for example, let's say that you have a pointer, and you have some reason to believe that the pointer has a particular value. So you rewrite following the pointer from this: value = ptr->val; into value = speculated->value; tmp = ptr; if (unlikely(tmp != speculated)) value = tmp->value; and maybe you can now make the critical code-path for the speculated case go faster (since now there is no data dependency for the speculated case, and the actual pointer chasing load is now no longer in the critical path), and you made things faster because your profiling showed that the speculated case was true 99% of the time. Wonderful, right? And clearly, the code "provably" does the same thing. EXCEPT THAT IS NOT TRUE AT ALL. It very much does not do the same thing at all, and by doing value speculation and "proving" something was true, the only thing you did was to make incorrect code run faster. Because now the causally related load of value from the pointer isn't actually causally related at all, and you broke the memory ordering. This is why I don't like it when I see Torvald talk about "proving" things. It's bullshit. You can "prove" pretty much anything, and in the process lose sight of the bigger issue, namely that there is code that depends on When it comes to atomic accesses, you don't play those kinds of games, exactly because the ordering of the accesses matter in ways that are not really sanely describable at a source code level. The *only* kinds of games you play are like the ones I described - combining accesses under very strict local rules. And the strict local rules really should be of the type "a store followed by a load to the same location with the same memory ordering constraints can be combined". Never *ever* of the kind "if you can prove X". I hope my example made it clear *why* I react so strongly when Torvald starts talking about "if you can prove the value is 1". Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 0:05 ` Linus Torvalds @ 2014-02-18 15:31 ` Torvald Riegel 2014-02-18 16:49 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 15:31 UTC (permalink / raw) To: Linus Torvalds Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > And exactly because I know enough, I would *really* like atomics to be > well-defined, and have very clear - and *local* - rules about how they > can be combined and optimized. "Local"? > None of this "if you can prove that the read has value X" stuff. And > things like value speculation should simply not be allowed, because > that actually breaks the dependency chain that the CPU architects give > guarantees for. Instead, make the rules be very clear, and very > simple, like my suggestion. You can never remove a load because you > can "prove" it has some value, but you can combine two consecutive > atomic accesses/ Sorry, but the rules *are* very clear. I *really* suggest to look at the formalization by Batty et al. And in these rules, proving that a read will always return value X has a well-defined meaning, and you can use it. That simply follows from how the model is built. What you seem to want just isn't covered by the model as it is today -- you can't infer from that that the model itself would be wrong. The dependency chains aren't modeled in the way you envision it (except in what consume_mo tries, but that seems to be hard to implement); they are there on the level of the program logic as modeled by the abstract machine and the respective execution/output rules, but they are not built to represent those specific ordering guarantees the HW gives you. I would also be cautious claiming that the rules you suggested would be very clear and very simple. I haven't seen a memory model spec from you that would be viable as the standard model for C/C++, nor have I seen proof that this would actually be easier to understand for programmers in general. > For example, CPU people actually do tend to give guarantees for > certain things, like stores that are causally related being visible in > a particular order. Yes, but that's not part of the model so far. If you want to exploit this, please make a suggestion for how to extend the model to cover this. You certainly expect compilers to actually optimize code, which usually removes or reorders stuff. Now, we could say that we just don't want that for atomic accesses, but even this isn't as clear-cut: How do we actually detect where non-atomic code is intended by the programmer to express (1) a dependency that needs to be transformed into a dependency in the generated code from (2) a dependency that is just program logic? IOW, consider the consume_mo example "*(p + flag - flag)", where flag is coming from a consume_mo load: Can you give a complete set of rules (for a full program) for when the compiler is allowed to optimize out flag-flag and when not? (And it should be practically implementable in a compiler, and not prevent optimizations where people expect them.) > If the compiler starts doing value speculation on > atomic accesses, you are quite possibly breaking things like that. > It's just not a good idea. Don't do it. Write the standard so that it > clearly is disallowed. You never want value speculation for anything possibly originating from an atomic load? Or what are the concrete rules you have in mind? > Because you may think that a C standard is machine-independent, but > that isn't really the case. The people who write code still write code > for a particular machine. Our code works (in the general case) on > different byte orderings, different register sizes, different memory > ordering models. But in each *instance* we still end up actually > coding for each machine. That's how *you* do it. I'm sure you are aware that for lots of programmers, having a machine-dependent standard is not helpful at all. As the standard is written, code is portable. It can certainly be beneficial for programmers to optimize for different machines, but the core of the memory model will always remain portable, because that's what arguably most programmers need. That doesn't prevent machine-dependent extensions, but those then should be well integrated with the rest of the model. > So the rules for atomics should be simple and *specific* enough that > when you write code for a particular architecture, you can take the > architecture memory ordering *and* the C atomics orderings into > account, and do the right thing for that architecture. That would be an extension, but not viable as a *general* requirement as far as the standard is concerned. > And that very much means that doing things like value speculation MUST > NOT HAPPEN. See? Even if you can prove that your code is "equivalent", > it isn't. I'm kind of puzzled why you keep stating generalizations of assertions that clearly aren't well-founded (ie, which might be true for the kernel or your wishes for how the world should be, but aren't true in general or for the objectives of others). It's not helpful to just pretend other people's opinions or worlds didn't exist in discussions such as this one. It's clear that the standard has to consider all users of C/C++. The kernel is a big user of it, but also just that. Why don't you just say that you do not *want* value speculation to happen, because it does X and you'd like to exploit HW guarantee Y etc.? That would be constructive, and actually correct in contrast to those other broad claims. (FWIW: Personally, I don't care whether you say something in a nice tone or in some other way WITH BIG LETTERS. I scan both for logically consistent reasoning, and filter out the rest. Therefore, avoiding excessive amounts of the rest would make the discussion more efficient for me, and I'd appreciate if everyone could work towards making it efficient. It's a tricky topic we're discussing, so I'd like to use my brain for the actual technical bits and not for all the noise.) > So for example, let's say that you have a pointer, and you have some > reason to believe that the pointer has a particular value. So you > rewrite following the pointer from this: > > value = ptr->val; > > into > > value = speculated->value; > tmp = ptr; > if (unlikely(tmp != speculated)) > value = tmp->value; > > and maybe you can now make the critical code-path for the speculated > case go faster (since now there is no data dependency for the > speculated case, and the actual pointer chasing load is now no longer > in the critical path), and you made things faster because your > profiling showed that the speculated case was true 99% of the time. > Wonderful, right? And clearly, the code "provably" does the same > thing. Please try to see that any proof is based on the language semantics as specified. Thus, you have to distinguish between (1) disagreeing with the proof being correct and (2) disliking the language semantics. I believe you mean the latter (let me know if you actually mean (1)); in that case, it's the language semantics you want to be different, and we need to discuss this. > EXCEPT THAT IS NOT TRUE AT ALL. > > It very much does not do the same thing at all, and by doing value > speculation and "proving" something was true, the only thing you did > was to make incorrect code run faster. Because now the causally > related load of value from the pointer isn't actually causally related > at all, and you broke the memory ordering. That's not how the language is specified, sorry. > This is why I don't like it when I see Torvald talk about "proving" > things. It's bullshit. You can "prove" pretty much anything, and in > the process lose sight of the bigger issue, namely that there is code > that depends on Depends on what? We really need to keep separate things separate here. We started the discussion with the topic of what behavior the memory model *as specified today* allows. In that model, the proofs are correct. This model does not let programmers exploit the control dependencies and such that you have in mind. It does let programmers write correct code, but this code might be somewhat less efficient (e.g., because it would use memory_order_acquire instead of a weaker MO and control dependencies). If you want to discuss that, fine; we'd then need to see how we can specify this properly and offer to programmers, and how to extend the specification of the language semantics accordingly. That Paul is looking through the uses of synchronization and how they could (or could not) map to the memory model will be good input to this discussion. > When it comes to atomic accesses, you don't play those kinds of games, > exactly because the ordering of the accesses matter in ways that are > not really sanely describable at a source code level. They obviously need to be specifiable for a language that wants to provide portable atomics. For machine-dependent extensions, this could take machine properties into accout, of course. But even those need to be modeled properly, at the source code level. How else is a C/C++ compiler make sense out of it? > The *only* kinds > of games you play are like the ones I described - combining accesses > under very strict local rules. > > And the strict local rules really should be of the type "a store > followed by a load to the same location with the same memory ordering > constraints can be combined". Never *ever* of the kind "if you can > prove X". I doubt that's a better way to put it. For example, you haven't defined "followed" -- if you'd do that in a complete way, you'd inevitably come to situations where something is checked for being provably correct. What you do, in fact, is relying on the proof that the adjacent store/load pair could always be executed atomically by the machine, without preventing forward progress. > I hope my example made it clear *why* I react so strongly when Torvald > starts talking about "if you can prove the value is 1". It's getting clearer to me, but as far as I can understand, from my perspective, you're objecting to a straw man while not requesting the thing you actually want. It seems you're coming from a bottom-up perspective, seeing C code as something defined like high-level assembler, which still has machine-dependent parts and such. The standard defines the language semantics kind of the other way around, top-down from an abstract machine, with memory_orders that at least to some extent allow one to exploit different HW mechanisms with different costs for enforcing ordering. I can see why you have this perspective, but the standard needs the top-down approach to suite all it's users. I believe we should continue discussing how we can bridge the gap between both views. None of those views is inherently wrong, they are just different. No strong opinion will change that, but I'm still hopeful that constructive discussion can help bridge the gap. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 15:31 ` Torvald Riegel @ 2014-02-18 16:49 ` Linus Torvalds 2014-02-18 17:16 ` Paul E. McKenney 2014-02-18 21:21 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 16:49 UTC (permalink / raw) To: Torvald Riegel Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: >> And exactly because I know enough, I would *really* like atomics to be >> well-defined, and have very clear - and *local* - rules about how they >> can be combined and optimized. > > "Local"? Yes. So I think that one of the big advantages of atomics over volatile is that they *can* be optimized, and as such I'm not at all against trying to generate much better code than for volatile accesses. But at the same time, that can go too far. For example, one of the things we'd want to use atomics for is page table accesses, where it is very important that we don't generate multiple accesses to the values, because parts of the values can be change *by*hardware* (ie accessed and dirty bits). So imagine that you have some clever global optimizer that sees that the program never ever actually sets the dirty bit at all in any thread, and then uses that kind of non-local knowledge to make optimization decisions. THAT WOULD BE BAD. Do you see what I'm aiming for? Any optimization that tries to prove anything from more than local state is by definition broken, because it assumes that everything is described by the program. But *local* optimizations are fine, as long as they follow the obvious rule of not actually making changes that are semantically visible. (In practice, I'd be impressed as hell for that particular example, and we actually do end up setting the dirty bit by hand in some situations, so the example is slightly made up, but there are other cases that might be more realistic in that sometimes we do things that are hidden from the compiler - in assembly etc - and the compiler might *think* it knows what is going on, but it doesn't actually see all accesses). > Sorry, but the rules *are* very clear. I *really* suggest to look at > the formalization by Batty et al. And in these rules, proving that a > read will always return value X has a well-defined meaning, and you can > use it. That simply follows from how the model is built. What "model"? That's the thing. I have tried to figure out whether the model is some abstract C model, or a model based on the actual hardware that the compiler is compiling for, and whether the model is one that assumes the compiler has complete knowledge of the system (see the example above). And it seems to be a mixture of it all. The definitions of the various orderings obviously very much imply that the compiler has to insert the proper barriers or sequence markers for that architecture, but then there is no apparent way to depend on any *other* architecture ordering guarantees. Like our knowledge that all architectures (with the exception of alpha, which really doesn't end up being something we worry about any more) end up having the load dependency ordering guarantee. > What you seem to want just isn't covered by the model as it is today -- > you can't infer from that that the model itself would be wrong. The > dependency chains aren't modeled in the way you envision it (except in > what consume_mo tries, but that seems to be hard to implement); they are > there on the level of the program logic as modeled by the abstract > machine and the respective execution/output rules, but they are not > built to represent those specific ordering guarantees the HW gives you. So this is a problem. It basically means that we cannot do the kinds of things we do now, which very much involve knowing what the memory ordering of a particular machine is, and combining that knowledge with our knowledge of code generation. Now, *most* of what we do is protected by locking and is all fine. But we do have a few rather subtle places in RCU and in the scheduler where we depend on the actual dependency chain. In *practice*, I seriously doubt any reasonable compiler can actually make a mess of it. The kinds of optimizations that would actually defeat the dependency chain are simply not realistic. And I suspect that will end up being what we rely on - there being no actual sane sequence that a compiler would ever do, even if we wouldn't have guarantees for some of it. And I suspect I can live with that. We _have_ lived with that for the longest time, after all. We very much do things that aren't covered by any existing C standard, and just basically use tricks to coax the compiler into never generating code that doesn't work (with our inline asm barriers etc being a prime example). > I would also be cautious claiming that the rules you suggested would be > very clear and very simple. I haven't seen a memory model spec from you > that would be viable as the standard model for C/C++, nor have I seen > proof that this would actually be easier to understand for programmers > in general. So personally, if I were to write the spec, I would have taken a completely different approach from what the standard obviously does. I'd have taken the approach of specifying the required semantics each atomic op (ie the memory operations would end up having to be annotated with their ordering constraints), and then said that the compiler can generate any code that is equivalent to that _on_the_target_machine_. Why? Just to avoid the whole "ok, which set of rules applies now" problem. >> For example, CPU people actually do tend to give guarantees for >> certain things, like stores that are causally related being visible in >> a particular order. > > Yes, but that's not part of the model so far. If you want to exploit > this, please make a suggestion for how to extend the model to cover > this. See above. This is exactly why I detest the C "model" thing. Now you need ways to describe those CPU guarantees, because if you can't describe them, you can't express them in the model. I would *much* have preferred the C standard to say that you have to generate code that is guaranteed to act the same way - on that machine - as the "naive direct unoptimized translation". IOW, I would *revel* in the fact that different machines are different, and basically just describe the "stupid" code generation. You'd get the guaranteed semantic baseline, but you'd *also* be able to know that whatever architecture guarantees you have would remain. Without having to describe those architecture issues. It would be *so* nice if the C standard had done that for pretty much everything that is "implementation dependent". Not just atomics. [ will look at the rest of your email later ] Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 16:49 ` Linus Torvalds @ 2014-02-18 17:16 ` Paul E. McKenney 2014-02-18 18:23 ` Peter Sewell 2014-02-18 21:40 ` Torvald Riegel 2014-02-18 21:21 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 17:16 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> And exactly because I know enough, I would *really* like atomics to be > >> well-defined, and have very clear - and *local* - rules about how they > >> can be combined and optimized. > > > > "Local"? > > Yes. > > So I think that one of the big advantages of atomics over volatile is > that they *can* be optimized, and as such I'm not at all against > trying to generate much better code than for volatile accesses. > > But at the same time, that can go too far. For example, one of the > things we'd want to use atomics for is page table accesses, where it > is very important that we don't generate multiple accesses to the > values, because parts of the values can be change *by*hardware* (ie > accessed and dirty bits). > > So imagine that you have some clever global optimizer that sees that > the program never ever actually sets the dirty bit at all in any > thread, and then uses that kind of non-local knowledge to make > optimization decisions. THAT WOULD BE BAD. Might as well list other reasons why value proofs via whole-program analysis are unreliable for the Linux kernel: 1. As Linus said, changes from hardware. 2. Assembly code that is not visible to the compiler. Inline asms will -normally- let the compiler know what memory they change, but some just use the "memory" tag. Worse yet, I suspect that most compilers don't look all that carefully at .S files. Any number of other programs contain assembly files. 3. Kernel modules that have not yet been written. Now, the compiler could refrain from trying to prove anything about an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there is currently no way to communicate this information to the compiler other than marking the variable "volatile". Other programs have similar issues, e.g., via dlopen(). 4. Some drivers allow user-mode code to mmap() some of their state. Any changes undertaken by the user-mode code would be invisible to the compiler. 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ And probably other stuff as well. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 17:16 ` Paul E. McKenney @ 2014-02-18 18:23 ` Peter Sewell 2014-02-18 19:00 ` Linus Torvalds 2014-02-18 19:42 ` Paul E. McKenney 2014-02-18 21:40 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Peter Sewell @ 2014-02-18 18:23 UTC (permalink / raw) To: Paul McKenney Cc: Linus Torvalds, Torvald Riegel, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 18 February 2014 17:16, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: >> >> And exactly because I know enough, I would *really* like atomics to be >> >> well-defined, and have very clear - and *local* - rules about how they >> >> can be combined and optimized. >> > >> > "Local"? >> >> Yes. >> >> So I think that one of the big advantages of atomics over volatile is >> that they *can* be optimized, and as such I'm not at all against >> trying to generate much better code than for volatile accesses. >> >> But at the same time, that can go too far. For example, one of the >> things we'd want to use atomics for is page table accesses, where it >> is very important that we don't generate multiple accesses to the >> values, because parts of the values can be change *by*hardware* (ie >> accessed and dirty bits). >> >> So imagine that you have some clever global optimizer that sees that >> the program never ever actually sets the dirty bit at all in any >> thread, and then uses that kind of non-local knowledge to make >> optimization decisions. THAT WOULD BE BAD. > > Might as well list other reasons why value proofs via whole-program > analysis are unreliable for the Linux kernel: > > 1. As Linus said, changes from hardware. > > 2. Assembly code that is not visible to the compiler. > Inline asms will -normally- let the compiler know what > memory they change, but some just use the "memory" tag. > Worse yet, I suspect that most compilers don't look all > that carefully at .S files. > > Any number of other programs contain assembly files. > > 3. Kernel modules that have not yet been written. Now, the > compiler could refrain from trying to prove anything about > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > is currently no way to communicate this information to the > compiler other than marking the variable "volatile". > > Other programs have similar issues, e.g., via dlopen(). > > 4. Some drivers allow user-mode code to mmap() some of their > state. Any changes undertaken by the user-mode code would > be invisible to the compiler. > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > And probably other stuff as well. interesting list. So are you saying that value-range-analysis and such-like (I say glibly, without really knowing what "such-like" refers to here) are fundamentally incompatible with the kernel code, or can you think of some way to tell the compiler a bound on the footprint of the "unseen" changes in each of those cases? Peter > Thanx, Paul > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 18:23 ` Peter Sewell @ 2014-02-18 19:00 ` Linus Torvalds 2014-02-18 19:42 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 19:00 UTC (permalink / raw) To: Peter.Sewell Cc: Paul McKenney, Torvald Riegel, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 10:23 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > > interesting list. So are you saying that value-range-analysis and > such-like (I say glibly, without really knowing what "such-like" > refers to here) are fundamentally incompatible with > the kernel code No, it's fine to do things like value-range-analysis, it's just that you have to do it on a local scope, and not think that you can see every single write to some variable. And as Paul points out, this is actually generally true even outside of kernels. We may be pretty unique in having some things that are literally done by hardware (eg the page table updates), but we're certainly not unique in having loadable modules or code that the compiler doesn't know about by virtue of being compiled separately (assembly files, JITted, whatever). And we are actually perfectly fine with compiler barriers. One of our most common barriers is simply this: #define barrier() __asm__ __volatile__("": : :"memory") which basically tells the compiler "something changes memory in ways you don't understand, so you cannot assume anything about memory contents". Obviously, a compiler is still perfectly fine optimizing things like local variables etc that haven't had their address taken across such a barrier (regardless of whether those local variables might be spilled to the stack frame etc), but equally obviously we'd require that this kind of thing makes sure that any atomic writes have been finalized by the time the barrier happens (it does *not* imply a memory ordering, just a compiler memory access barrier - we have separate things for ordering requirements, and those obviously often generate actual barrier instructions) And we're likely always going to have things like this, but if C11 atomics end up working well for us, we might have *fewer* of them. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 18:23 ` Peter Sewell 2014-02-18 19:00 ` Linus Torvalds @ 2014-02-18 19:42 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 19:42 UTC (permalink / raw) To: Peter Sewell Cc: Linus Torvalds, Torvald Riegel, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 06:23:47PM +0000, Peter Sewell wrote: > On 18 February 2014 17:16, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > >> On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > >> > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> >> And exactly because I know enough, I would *really* like atomics to be > >> >> well-defined, and have very clear - and *local* - rules about how they > >> >> can be combined and optimized. > >> > > >> > "Local"? > >> > >> Yes. > >> > >> So I think that one of the big advantages of atomics over volatile is > >> that they *can* be optimized, and as such I'm not at all against > >> trying to generate much better code than for volatile accesses. > >> > >> But at the same time, that can go too far. For example, one of the > >> things we'd want to use atomics for is page table accesses, where it > >> is very important that we don't generate multiple accesses to the > >> values, because parts of the values can be change *by*hardware* (ie > >> accessed and dirty bits). > >> > >> So imagine that you have some clever global optimizer that sees that > >> the program never ever actually sets the dirty bit at all in any > >> thread, and then uses that kind of non-local knowledge to make > >> optimization decisions. THAT WOULD BE BAD. > > > > Might as well list other reasons why value proofs via whole-program > > analysis are unreliable for the Linux kernel: > > > > 1. As Linus said, changes from hardware. > > > > 2. Assembly code that is not visible to the compiler. > > Inline asms will -normally- let the compiler know what > > memory they change, but some just use the "memory" tag. > > Worse yet, I suspect that most compilers don't look all > > that carefully at .S files. > > > > Any number of other programs contain assembly files. > > > > 3. Kernel modules that have not yet been written. Now, the > > compiler could refrain from trying to prove anything about > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > is currently no way to communicate this information to the > > compiler other than marking the variable "volatile". > > > > Other programs have similar issues, e.g., via dlopen(). > > > > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > And probably other stuff as well. > > interesting list. So are you saying that value-range-analysis and > such-like (I say glibly, without really knowing what "such-like" > refers to here) are fundamentally incompatible with > the kernel code, or can you think of some way to tell the compiler a > bound on the footprint of the "unseen" changes in each of those cases? Other than the "volatile" keyword, no. Well, I suppose you could also create a function that changed the variables in question, then arrange to never call it, but in such a way that the compiler could not prove that it was never called. But ouch! Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 17:16 ` Paul E. McKenney 2014-02-18 18:23 ` Peter Sewell @ 2014-02-18 21:40 ` Torvald Riegel 2014-02-18 21:52 ` Peter Zijlstra 2014-02-18 22:58 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 21:40 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > >> And exactly because I know enough, I would *really* like atomics to be > > >> well-defined, and have very clear - and *local* - rules about how they > > >> can be combined and optimized. > > > > > > "Local"? > > > > Yes. > > > > So I think that one of the big advantages of atomics over volatile is > > that they *can* be optimized, and as such I'm not at all against > > trying to generate much better code than for volatile accesses. > > > > But at the same time, that can go too far. For example, one of the > > things we'd want to use atomics for is page table accesses, where it > > is very important that we don't generate multiple accesses to the > > values, because parts of the values can be change *by*hardware* (ie > > accessed and dirty bits). > > > > So imagine that you have some clever global optimizer that sees that > > the program never ever actually sets the dirty bit at all in any > > thread, and then uses that kind of non-local knowledge to make > > optimization decisions. THAT WOULD BE BAD. > > Might as well list other reasons why value proofs via whole-program > analysis are unreliable for the Linux kernel: > > 1. As Linus said, changes from hardware. This is what's volatile is for, right? (Or the weak-volatile idea I mentioned). Compilers won't be able to prove something about the values of such variables, if marked (weak-)volatile. > 2. Assembly code that is not visible to the compiler. > Inline asms will -normally- let the compiler know what > memory they change, but some just use the "memory" tag. > Worse yet, I suspect that most compilers don't look all > that carefully at .S files. > > Any number of other programs contain assembly files. Are the annotations of changed memory really a problem? If the "memory" tag exists, isn't that supposed to mean all memory? To make a proof about a program for location X, the compiler has to analyze all uses of X. Thus, as soon as X escapes into an .S file, then the compiler will simply not be able to prove a thing (except maybe due to the data-race-free requirement for non-atomics). The attempt to prove something isn't unreliable, simply because a correct compiler won't claim to be able to "prove" something. One reason that could corrupt this is that if program addresses objects other than through the mechanisms defined in the language. For example, if one thread lays out a data structure at a constant fixed memory address, and another one then uses the fixed memory address to get access to the object with a cast (e.g., (void*)0x123). > 3. Kernel modules that have not yet been written. Now, the > compiler could refrain from trying to prove anything about > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > is currently no way to communicate this information to the > compiler other than marking the variable "volatile". Even if the variable is just externally accessible, then the compiler knows that it can't do whole-program analysis about it. It is true that whole-program analysis will not be applicable in this case, but it will not be unreliable. I think that's an important difference. > Other programs have similar issues, e.g., via dlopen(). > > 4. Some drivers allow user-mode code to mmap() some of their > state. Any changes undertaken by the user-mode code would > be invisible to the compiler. A good point, but a compiler that doesn't try to (incorrectly) assume something about the semantics of mmap will simply see that the mmap'ed data will escape to stuff if can't analyze, so it will not be able to make a proof. This is different from, for example, malloc(), which is guaranteed to return "fresh" nonaliasing memory. > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ This might be special, or not, depending on how the JITed code gets access to data. If this is via fixed addresses (e.g., (void*)0x123), then see above. If this is through function calls that the compiler can't analyze, then this is like 4. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:40 ` Torvald Riegel @ 2014-02-18 21:52 ` Peter Zijlstra 2014-02-19 9:52 ` Torvald Riegel 2014-02-18 22:58 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-18 21:52 UTC (permalink / raw) To: Torvald Riegel Cc: paulmck, Linus Torvalds, Alec Teal, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc > > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > A good point, but a compiler that doesn't try to (incorrectly) assume > something about the semantics of mmap will simply see that the mmap'ed > data will escape to stuff if can't analyze, so it will not be able to > make a proof. > > This is different from, for example, malloc(), which is guaranteed to > return "fresh" nonaliasing memory. The kernel side of this is different.. it looks like 'normal' memory, we just happen to allow it to end up in userspace too. But on that point; how do you tell the compiler the difference between malloc() and mmap()? Is that some function attribute? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:52 ` Peter Zijlstra @ 2014-02-19 9:52 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 9:52 UTC (permalink / raw) To: Peter Zijlstra Cc: paulmck, Linus Torvalds, Alec Teal, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 22:52 +0100, Peter Zijlstra wrote: > > > 4. Some drivers allow user-mode code to mmap() some of their > > > state. Any changes undertaken by the user-mode code would > > > be invisible to the compiler. > > > > A good point, but a compiler that doesn't try to (incorrectly) assume > > something about the semantics of mmap will simply see that the mmap'ed > > data will escape to stuff if can't analyze, so it will not be able to > > make a proof. > > > > This is different from, for example, malloc(), which is guaranteed to > > return "fresh" nonaliasing memory. > > The kernel side of this is different.. it looks like 'normal' memory, we > just happen to allow it to end up in userspace too. > > But on that point; how do you tell the compiler the difference between > malloc() and mmap()? Is that some function attribute? Yes: malloc The malloc attribute is used to tell the compiler that a function may be treated as if any non-NULL pointer it returns cannot alias any other pointer valid when the function returns and that the memory has undefined content. This often improves optimization. Standard functions with this property include malloc and calloc. realloc-like functions do not have this property as the memory pointed to does not have undefined content. I'm not quite sure whether GCC assumes malloc() to be indeed C's malloc even if the function attribute isn't used, and/or whether that is different for freestanding environments. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:40 ` Torvald Riegel 2014-02-18 21:52 ` Peter Zijlstra @ 2014-02-18 22:58 ` Paul E. McKenney 2014-02-19 10:59 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 22:58 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > >> And exactly because I know enough, I would *really* like atomics to be > > > >> well-defined, and have very clear - and *local* - rules about how they > > > >> can be combined and optimized. > > > > > > > > "Local"? > > > > > > Yes. > > > > > > So I think that one of the big advantages of atomics over volatile is > > > that they *can* be optimized, and as such I'm not at all against > > > trying to generate much better code than for volatile accesses. > > > > > > But at the same time, that can go too far. For example, one of the > > > things we'd want to use atomics for is page table accesses, where it > > > is very important that we don't generate multiple accesses to the > > > values, because parts of the values can be change *by*hardware* (ie > > > accessed and dirty bits). > > > > > > So imagine that you have some clever global optimizer that sees that > > > the program never ever actually sets the dirty bit at all in any > > > thread, and then uses that kind of non-local knowledge to make > > > optimization decisions. THAT WOULD BE BAD. > > > > Might as well list other reasons why value proofs via whole-program > > analysis are unreliable for the Linux kernel: > > > > 1. As Linus said, changes from hardware. > > This is what's volatile is for, right? (Or the weak-volatile idea I > mentioned). > > Compilers won't be able to prove something about the values of such > variables, if marked (weak-)volatile. Yep. > > 2. Assembly code that is not visible to the compiler. > > Inline asms will -normally- let the compiler know what > > memory they change, but some just use the "memory" tag. > > Worse yet, I suspect that most compilers don't look all > > that carefully at .S files. > > > > Any number of other programs contain assembly files. > > Are the annotations of changed memory really a problem? If the "memory" > tag exists, isn't that supposed to mean all memory? > > To make a proof about a program for location X, the compiler has to > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > the compiler will simply not be able to prove a thing (except maybe due > to the data-race-free requirement for non-atomics). The attempt to > prove something isn't unreliable, simply because a correct compiler > won't claim to be able to "prove" something. I am indeed less worried about inline assembler than I am about files full of assembly. Or files full of other languages. > One reason that could corrupt this is that if program addresses objects > other than through the mechanisms defined in the language. For example, > if one thread lays out a data structure at a constant fixed memory > address, and another one then uses the fixed memory address to get > access to the object with a cast (e.g., (void*)0x123). Or if the program uses gcc linker scripts to get the same effect. > > 3. Kernel modules that have not yet been written. Now, the > > compiler could refrain from trying to prove anything about > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > is currently no way to communicate this information to the > > compiler other than marking the variable "volatile". > > Even if the variable is just externally accessible, then the compiler > knows that it can't do whole-program analysis about it. > > It is true that whole-program analysis will not be applicable in this > case, but it will not be unreliable. I think that's an important > difference. Let me make sure that I understand what you are saying. If my program has "extern int foo;", the compiler will refrain from doing whole-program analysis involving "foo"? Or to ask it another way, when you say "whole-program analysis", are you restricting that analysis to the current translation unit? If so, I was probably not the only person thinking that you instead meant analysis across all translation units linked into the program. ;-) > > Other programs have similar issues, e.g., via dlopen(). > > > > 4. Some drivers allow user-mode code to mmap() some of their > > state. Any changes undertaken by the user-mode code would > > be invisible to the compiler. > > A good point, but a compiler that doesn't try to (incorrectly) assume > something about the semantics of mmap will simply see that the mmap'ed > data will escape to stuff if can't analyze, so it will not be able to > make a proof. > > This is different from, for example, malloc(), which is guaranteed to > return "fresh" nonaliasing memory. As Peter noted, this is the other end of mmap(). The -user- code sees that there is an mmap(), but the kernel code invokes functions that poke values into hardware registers (or into in-memory page tables) that, as a side effect, cause some of the kernel's memory to be accessible to some user program. Presumably the kernel code needs to do something to account for the possibility of usermode access whenever it accesses that memory. Volatile casts, volatile storage class on the declarations, barrier() calls, whatever. I echo Peter's question about how one tags functions like mmap(). I will also remember this for the next time someone on the committee discounts "volatile". ;-) > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > This might be special, or not, depending on how the JITed code gets > access to data. If this is via fixed addresses (e.g., (void*)0x123), > then see above. If this is through function calls that the compiler > can't analyze, then this is like 4. It could well be via the kernel reading its own symbol table, sort of a poor-person's reflection facility. I guess that would be for all intents and purposes equivalent to your (void*)0x123. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 22:58 ` Paul E. McKenney @ 2014-02-19 10:59 ` Torvald Riegel 2014-02-19 15:14 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 10:59 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > > >> And exactly because I know enough, I would *really* like atomics to be > > > > >> well-defined, and have very clear - and *local* - rules about how they > > > > >> can be combined and optimized. > > > > > > > > > > "Local"? > > > > > > > > Yes. > > > > > > > > So I think that one of the big advantages of atomics over volatile is > > > > that they *can* be optimized, and as such I'm not at all against > > > > trying to generate much better code than for volatile accesses. > > > > > > > > But at the same time, that can go too far. For example, one of the > > > > things we'd want to use atomics for is page table accesses, where it > > > > is very important that we don't generate multiple accesses to the > > > > values, because parts of the values can be change *by*hardware* (ie > > > > accessed and dirty bits). > > > > > > > > So imagine that you have some clever global optimizer that sees that > > > > the program never ever actually sets the dirty bit at all in any > > > > thread, and then uses that kind of non-local knowledge to make > > > > optimization decisions. THAT WOULD BE BAD. > > > > > > Might as well list other reasons why value proofs via whole-program > > > analysis are unreliable for the Linux kernel: > > > > > > 1. As Linus said, changes from hardware. > > > > This is what's volatile is for, right? (Or the weak-volatile idea I > > mentioned). > > > > Compilers won't be able to prove something about the values of such > > variables, if marked (weak-)volatile. > > Yep. > > > > 2. Assembly code that is not visible to the compiler. > > > Inline asms will -normally- let the compiler know what > > > memory they change, but some just use the "memory" tag. > > > Worse yet, I suspect that most compilers don't look all > > > that carefully at .S files. > > > > > > Any number of other programs contain assembly files. > > > > Are the annotations of changed memory really a problem? If the "memory" > > tag exists, isn't that supposed to mean all memory? > > > > To make a proof about a program for location X, the compiler has to > > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > > the compiler will simply not be able to prove a thing (except maybe due > > to the data-race-free requirement for non-atomics). The attempt to > > prove something isn't unreliable, simply because a correct compiler > > won't claim to be able to "prove" something. > > I am indeed less worried about inline assembler than I am about files > full of assembly. Or files full of other languages. > > > One reason that could corrupt this is that if program addresses objects > > other than through the mechanisms defined in the language. For example, > > if one thread lays out a data structure at a constant fixed memory > > address, and another one then uses the fixed memory address to get > > access to the object with a cast (e.g., (void*)0x123). > > Or if the program uses gcc linker scripts to get the same effect. > > > > 3. Kernel modules that have not yet been written. Now, the > > > compiler could refrain from trying to prove anything about > > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > > is currently no way to communicate this information to the > > > compiler other than marking the variable "volatile". > > > > Even if the variable is just externally accessible, then the compiler > > knows that it can't do whole-program analysis about it. > > > > It is true that whole-program analysis will not be applicable in this > > case, but it will not be unreliable. I think that's an important > > difference. > > Let me make sure that I understand what you are saying. If my program has > "extern int foo;", the compiler will refrain from doing whole-program > analysis involving "foo"? Yes. If it can't be sure to actually have the whole program available, it can't do whole-program analysis, right? Things like the linker scripts you mention or other stuff outside of the language semantics complicates this somewhat, and maybe some compilers assume too much. There's also the point that data-race-freedom is required for non-atomics even if those are shared with non-C-code. But except those corner cases, a compiler sees whether something escapes and becomes visible/accessible to other entities. > Or to ask it another way, when you say > "whole-program analysis", are you restricting that analysis to the > current translation unit? No. I mean, you can do analysis of the current translation unit, but that will do just that; if the variable, for example, is accessible outside of this translation unit, the compiler can't make a whole-program proof about it, and thus can't do certain optimizations. > If so, I was probably not the only person thinking that you instead meant > analysis across all translation units linked into the program. ;-) That's roughly what I meant, but not just including translation units but truly all parts of the program, including non-C program parts. IOW, literally the whole program :) That's why I said that if you indeed do *whole program* analysis, then things should be fine (modulo corner cases such as linker scripts, later binary rewriting of code produced by the compiler, etc.). Many of the things you worried about *prevent* whole-program analysis, which means that they do not make it any less reliable. Does that clarify my line of thought? > > > Other programs have similar issues, e.g., via dlopen(). > > > > > > 4. Some drivers allow user-mode code to mmap() some of their > > > state. Any changes undertaken by the user-mode code would > > > be invisible to the compiler. > > > > A good point, but a compiler that doesn't try to (incorrectly) assume > > something about the semantics of mmap will simply see that the mmap'ed > > data will escape to stuff if can't analyze, so it will not be able to > > make a proof. > > > > This is different from, for example, malloc(), which is guaranteed to > > return "fresh" nonaliasing memory. > > As Peter noted, this is the other end of mmap(). The -user- code sees > that there is an mmap(), but the kernel code invokes functions that > poke values into hardware registers (or into in-memory page tables) > that, as a side effect, cause some of the kernel's memory to be > accessible to some user program. > > Presumably the kernel code needs to do something to account for the > possibility of usermode access whenever it accesses that memory. > Volatile casts, volatile storage class on the declarations, barrier() > calls, whatever. In this case, there should be another option except volatile: If userspace code is using the C11 memory model as well and lock-free atomics to synchronize, then this should have well-defined semantics without using volatile. On both sides, the compiler will see that mmap() (or similar) is called, so that means the data escapes to something unknown, which could create threads and so on. So first, it can't do whole-program analysis for this state anymore, and has to assume that other C11 threads are accessing this memory. Next, lock-free atomics are specified to be "address-free", meaning that they must work independent of where in memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a "should" and non-normative, but essential IMO). Thus, this then boils down to just a simple case of synchronization. (Of course, the rest of the ABI has to match too for the data exchange to work.) > I echo Peter's question about how one tags functions like mmap(). > > I will also remember this for the next time someone on the committee > discounts "volatile". ;-) > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > This might be special, or not, depending on how the JITed code gets > > access to data. If this is via fixed addresses (e.g., (void*)0x123), > > then see above. If this is through function calls that the compiler > > can't analyze, then this is like 4. > > It could well be via the kernel reading its own symbol table, sort of > a poor-person's reflection facility. I guess that would be for all > intents and purposes equivalent to your (void*)0x123. If it is replacing code generated by the compiler, then yes. If the JIT is just filling in functions that had been undefined yet declared before, then the compiler will have seen the data escape through the function interfaces, and should be aware that there is other stuff. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 10:59 ` Torvald Riegel @ 2014-02-19 15:14 ` Paul E. McKenney 2014-02-19 17:55 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-19 15:14 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > > > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > > > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > > > >> And exactly because I know enough, I would *really* like atomics to be > > > > > >> well-defined, and have very clear - and *local* - rules about how they > > > > > >> can be combined and optimized. > > > > > > > > > > > > "Local"? > > > > > > > > > > Yes. > > > > > > > > > > So I think that one of the big advantages of atomics over volatile is > > > > > that they *can* be optimized, and as such I'm not at all against > > > > > trying to generate much better code than for volatile accesses. > > > > > > > > > > But at the same time, that can go too far. For example, one of the > > > > > things we'd want to use atomics for is page table accesses, where it > > > > > is very important that we don't generate multiple accesses to the > > > > > values, because parts of the values can be change *by*hardware* (ie > > > > > accessed and dirty bits). > > > > > > > > > > So imagine that you have some clever global optimizer that sees that > > > > > the program never ever actually sets the dirty bit at all in any > > > > > thread, and then uses that kind of non-local knowledge to make > > > > > optimization decisions. THAT WOULD BE BAD. > > > > > > > > Might as well list other reasons why value proofs via whole-program > > > > analysis are unreliable for the Linux kernel: > > > > > > > > 1. As Linus said, changes from hardware. > > > > > > This is what's volatile is for, right? (Or the weak-volatile idea I > > > mentioned). > > > > > > Compilers won't be able to prove something about the values of such > > > variables, if marked (weak-)volatile. > > > > Yep. > > > > > > 2. Assembly code that is not visible to the compiler. > > > > Inline asms will -normally- let the compiler know what > > > > memory they change, but some just use the "memory" tag. > > > > Worse yet, I suspect that most compilers don't look all > > > > that carefully at .S files. > > > > > > > > Any number of other programs contain assembly files. > > > > > > Are the annotations of changed memory really a problem? If the "memory" > > > tag exists, isn't that supposed to mean all memory? > > > > > > To make a proof about a program for location X, the compiler has to > > > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > > > the compiler will simply not be able to prove a thing (except maybe due > > > to the data-race-free requirement for non-atomics). The attempt to > > > prove something isn't unreliable, simply because a correct compiler > > > won't claim to be able to "prove" something. > > > > I am indeed less worried about inline assembler than I am about files > > full of assembly. Or files full of other languages. > > > > > One reason that could corrupt this is that if program addresses objects > > > other than through the mechanisms defined in the language. For example, > > > if one thread lays out a data structure at a constant fixed memory > > > address, and another one then uses the fixed memory address to get > > > access to the object with a cast (e.g., (void*)0x123). > > > > Or if the program uses gcc linker scripts to get the same effect. > > > > > > 3. Kernel modules that have not yet been written. Now, the > > > > compiler could refrain from trying to prove anything about > > > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > > > is currently no way to communicate this information to the > > > > compiler other than marking the variable "volatile". > > > > > > Even if the variable is just externally accessible, then the compiler > > > knows that it can't do whole-program analysis about it. > > > > > > It is true that whole-program analysis will not be applicable in this > > > case, but it will not be unreliable. I think that's an important > > > difference. > > > > Let me make sure that I understand what you are saying. If my program has > > "extern int foo;", the compiler will refrain from doing whole-program > > analysis involving "foo"? > > Yes. If it can't be sure to actually have the whole program available, > it can't do whole-program analysis, right? Things like the linker > scripts you mention or other stuff outside of the language semantics > complicates this somewhat, and maybe some compilers assume too much. > There's also the point that data-race-freedom is required for > non-atomics even if those are shared with non-C-code. > > But except those corner cases, a compiler sees whether something escapes > and becomes visible/accessible to other entities. The traditional response to "except those corner cases" is of course "Murphy was an optimist". ;-) That said, point taken -- you expect that the compiler will always be told of anything that would limit its ability to reason about the whole program. > > Or to ask it another way, when you say > > "whole-program analysis", are you restricting that analysis to the > > current translation unit? > > No. I mean, you can do analysis of the current translation unit, but > that will do just that; if the variable, for example, is accessible > outside of this translation unit, the compiler can't make a > whole-program proof about it, and thus can't do certain optimizations. I had to read this several times to find an interpretation that might make sense. That interpretation is "The compiler will do whole-program analysis only on those variables that it believes are accessed only by the current translation unit." Is that what you meant? > > If so, I was probably not the only person thinking that you instead meant > > analysis across all translation units linked into the program. ;-) > > That's roughly what I meant, but not just including translation units > but truly all parts of the program, including non-C program parts. IOW, > literally the whole program :) > > That's why I said that if you indeed do *whole program* analysis, then > things should be fine (modulo corner cases such as linker scripts, later > binary rewriting of code produced by the compiler, etc.). Many of the > things you worried about *prevent* whole-program analysis, which means > that they do not make it any less reliable. Does that clarify my line > of thought? If my interpretation above is correct, yes. It appears that you are much more confident than many kernel folks that the compiler will be informed of everything that might limit its omniscience. > > > > Other programs have similar issues, e.g., via dlopen(). > > > > > > > > 4. Some drivers allow user-mode code to mmap() some of their > > > > state. Any changes undertaken by the user-mode code would > > > > be invisible to the compiler. > > > > > > A good point, but a compiler that doesn't try to (incorrectly) assume > > > something about the semantics of mmap will simply see that the mmap'ed > > > data will escape to stuff if can't analyze, so it will not be able to > > > make a proof. > > > > > > This is different from, for example, malloc(), which is guaranteed to > > > return "fresh" nonaliasing memory. > > > > As Peter noted, this is the other end of mmap(). The -user- code sees > > that there is an mmap(), but the kernel code invokes functions that > > poke values into hardware registers (or into in-memory page tables) > > that, as a side effect, cause some of the kernel's memory to be > > accessible to some user program. > > > > Presumably the kernel code needs to do something to account for the > > possibility of usermode access whenever it accesses that memory. > > Volatile casts, volatile storage class on the declarations, barrier() > > calls, whatever. > > In this case, there should be another option except volatile: If > userspace code is using the C11 memory model as well and lock-free > atomics to synchronize, then this should have well-defined semantics > without using volatile. For user-mode programs that have not yet been written, this could be a reasonable approach. For existing user-mode binaries, C11 won't help, which leaves things like volatile and assembly (including the "memory" qualifier as used in barrier() macro). > On both sides, the compiler will see that mmap() (or similar) is called, > so that means the data escapes to something unknown, which could create > threads and so on. So first, it can't do whole-program analysis for > this state anymore, and has to assume that other C11 threads are > accessing this memory. Next, lock-free atomics are specified to be > "address-free", meaning that they must work independent of where in > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a > "should" and non-normative, but essential IMO). Thus, this then boils > down to just a simple case of synchronization. (Of course, the rest of > the ABI has to match too for the data exchange to work.) The compiler will see mmap() on the user side, but not on the kernel side. On the kernel side, something special is required. Agree that "address-free" would be nice as "shall" rather than "should". > > I echo Peter's question about how one tags functions like mmap(). > > > > I will also remember this for the next time someone on the committee > > discounts "volatile". ;-) > > > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > > > This might be special, or not, depending on how the JITed code gets > > > access to data. If this is via fixed addresses (e.g., (void*)0x123), > > > then see above. If this is through function calls that the compiler > > > can't analyze, then this is like 4. > > > > It could well be via the kernel reading its own symbol table, sort of > > a poor-person's reflection facility. I guess that would be for all > > intents and purposes equivalent to your (void*)0x123. > > If it is replacing code generated by the compiler, then yes. If the JIT > is just filling in functions that had been undefined yet declared > before, then the compiler will have seen the data escape through the > function interfaces, and should be aware that there is other stuff. So one other concern would then be things things like ftrace, kprobes, ksplice, and so on. These rewrite the kernel binary at runtime, though in very limited ways. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 15:14 ` Paul E. McKenney @ 2014-02-19 17:55 ` Torvald Riegel 2014-02-19 22:12 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 17:55 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote: > On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: > > On Tue, 2014-02-18 at 14:58 -0800, Paul E. McKenney wrote: > > > On Tue, Feb 18, 2014 at 10:40:15PM +0100, Torvald Riegel wrote: > > > > xagsmtp4.20140218214207.8481@vmsdvm9.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvm9.vnet.ibm.com (XAGSMTP4 at VMSDVM9) > > > > > > > > On Tue, 2014-02-18 at 09:16 -0800, Paul E. McKenney wrote: > > > > > On Tue, Feb 18, 2014 at 08:49:13AM -0800, Linus Torvalds wrote: > > > > > > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > > > > > > >> And exactly because I know enough, I would *really* like atomics to be > > > > > > >> well-defined, and have very clear - and *local* - rules about how they > > > > > > >> can be combined and optimized. > > > > > > > > > > > > > > "Local"? > > > > > > > > > > > > Yes. > > > > > > > > > > > > So I think that one of the big advantages of atomics over volatile is > > > > > > that they *can* be optimized, and as such I'm not at all against > > > > > > trying to generate much better code than for volatile accesses. > > > > > > > > > > > > But at the same time, that can go too far. For example, one of the > > > > > > things we'd want to use atomics for is page table accesses, where it > > > > > > is very important that we don't generate multiple accesses to the > > > > > > values, because parts of the values can be change *by*hardware* (ie > > > > > > accessed and dirty bits). > > > > > > > > > > > > So imagine that you have some clever global optimizer that sees that > > > > > > the program never ever actually sets the dirty bit at all in any > > > > > > thread, and then uses that kind of non-local knowledge to make > > > > > > optimization decisions. THAT WOULD BE BAD. > > > > > > > > > > Might as well list other reasons why value proofs via whole-program > > > > > analysis are unreliable for the Linux kernel: > > > > > > > > > > 1. As Linus said, changes from hardware. > > > > > > > > This is what's volatile is for, right? (Or the weak-volatile idea I > > > > mentioned). > > > > > > > > Compilers won't be able to prove something about the values of such > > > > variables, if marked (weak-)volatile. > > > > > > Yep. > > > > > > > > 2. Assembly code that is not visible to the compiler. > > > > > Inline asms will -normally- let the compiler know what > > > > > memory they change, but some just use the "memory" tag. > > > > > Worse yet, I suspect that most compilers don't look all > > > > > that carefully at .S files. > > > > > > > > > > Any number of other programs contain assembly files. > > > > > > > > Are the annotations of changed memory really a problem? If the "memory" > > > > tag exists, isn't that supposed to mean all memory? > > > > > > > > To make a proof about a program for location X, the compiler has to > > > > analyze all uses of X. Thus, as soon as X escapes into an .S file, then > > > > the compiler will simply not be able to prove a thing (except maybe due > > > > to the data-race-free requirement for non-atomics). The attempt to > > > > prove something isn't unreliable, simply because a correct compiler > > > > won't claim to be able to "prove" something. > > > > > > I am indeed less worried about inline assembler than I am about files > > > full of assembly. Or files full of other languages. > > > > > > > One reason that could corrupt this is that if program addresses objects > > > > other than through the mechanisms defined in the language. For example, > > > > if one thread lays out a data structure at a constant fixed memory > > > > address, and another one then uses the fixed memory address to get > > > > access to the object with a cast (e.g., (void*)0x123). > > > > > > Or if the program uses gcc linker scripts to get the same effect. > > > > > > > > 3. Kernel modules that have not yet been written. Now, the > > > > > compiler could refrain from trying to prove anything about > > > > > an EXPORT_SYMBOL() or EXPORT_SYMBOL_GPL() variable, but there > > > > > is currently no way to communicate this information to the > > > > > compiler other than marking the variable "volatile". > > > > > > > > Even if the variable is just externally accessible, then the compiler > > > > knows that it can't do whole-program analysis about it. > > > > > > > > It is true that whole-program analysis will not be applicable in this > > > > case, but it will not be unreliable. I think that's an important > > > > difference. > > > > > > Let me make sure that I understand what you are saying. If my program has > > > "extern int foo;", the compiler will refrain from doing whole-program > > > analysis involving "foo"? > > > > Yes. If it can't be sure to actually have the whole program available, > > it can't do whole-program analysis, right? Things like the linker > > scripts you mention or other stuff outside of the language semantics > > complicates this somewhat, and maybe some compilers assume too much. > > There's also the point that data-race-freedom is required for > > non-atomics even if those are shared with non-C-code. > > > > But except those corner cases, a compiler sees whether something escapes > > and becomes visible/accessible to other entities. > > The traditional response to "except those corner cases" is of course > "Murphy was an optimist". ;-) > > That said, point taken -- you expect that the compiler will always be > told of anything that would limit its ability to reason about the > whole program. > > > > Or to ask it another way, when you say > > > "whole-program analysis", are you restricting that analysis to the > > > current translation unit? > > > > No. I mean, you can do analysis of the current translation unit, but > > that will do just that; if the variable, for example, is accessible > > outside of this translation unit, the compiler can't make a > > whole-program proof about it, and thus can't do certain optimizations. > > I had to read this several times to find an interpretation that might > make sense. That interpretation is "The compiler will do whole-program > analysis only on those variables that it believes are accessed only by > the current translation unit." Is that what you meant? Yes. For a pure-C program (ie, one that's perfectly specified by just the C standard), this will be the case; IOW, I'm not aware of any corner case in such a setting. But the kernel is doing more than what the C standard covers, so we'll have to check those things. > > > If so, I was probably not the only person thinking that you instead meant > > > analysis across all translation units linked into the program. ;-) > > > > That's roughly what I meant, but not just including translation units > > but truly all parts of the program, including non-C program parts. IOW, > > literally the whole program :) > > > > That's why I said that if you indeed do *whole program* analysis, then > > things should be fine (modulo corner cases such as linker scripts, later > > binary rewriting of code produced by the compiler, etc.). Many of the > > things you worried about *prevent* whole-program analysis, which means > > that they do not make it any less reliable. Does that clarify my line > > of thought? > > If my interpretation above is correct, yes. It appears that you are much > more confident than many kernel folks that the compiler will be informed of > everything that might limit its omniscience. Maybe. Nonetheless, if it's just a matter of letting the compiler know that there is Other Stuff when the compiler isn't aware of that, but once doing so all is good because the memory model handles this just fine, then this makes me more optimistic than if the model was insufficient. > > > > > Other programs have similar issues, e.g., via dlopen(). > > > > > > > > > > 4. Some drivers allow user-mode code to mmap() some of their > > > > > state. Any changes undertaken by the user-mode code would > > > > > be invisible to the compiler. > > > > > > > > A good point, but a compiler that doesn't try to (incorrectly) assume > > > > something about the semantics of mmap will simply see that the mmap'ed > > > > data will escape to stuff if can't analyze, so it will not be able to > > > > make a proof. > > > > > > > > This is different from, for example, malloc(), which is guaranteed to > > > > return "fresh" nonaliasing memory. > > > > > > As Peter noted, this is the other end of mmap(). The -user- code sees > > > that there is an mmap(), but the kernel code invokes functions that > > > poke values into hardware registers (or into in-memory page tables) > > > that, as a side effect, cause some of the kernel's memory to be > > > accessible to some user program. > > > > > > Presumably the kernel code needs to do something to account for the > > > possibility of usermode access whenever it accesses that memory. > > > Volatile casts, volatile storage class on the declarations, barrier() > > > calls, whatever. > > > > In this case, there should be another option except volatile: If > > userspace code is using the C11 memory model as well and lock-free > > atomics to synchronize, then this should have well-defined semantics > > without using volatile. > > For user-mode programs that have not yet been written, this could be > a reasonable approach. For existing user-mode binaries, C11 won't > help, which leaves things like volatile and assembly (including the > "memory" qualifier as used in barrier() macro). I agree. I also agree that there is the possibility of malicious/incorrect userspace code, and that this shouldn't endanger the kernel. > > On both sides, the compiler will see that mmap() (or similar) is called, > > so that means the data escapes to something unknown, which could create > > threads and so on. So first, it can't do whole-program analysis for > > this state anymore, and has to assume that other C11 threads are > > accessing this memory. Next, lock-free atomics are specified to be > > "address-free", meaning that they must work independent of where in > > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a > > "should" and non-normative, but essential IMO). Thus, this then boils > > down to just a simple case of synchronization. (Of course, the rest of > > the ABI has to match too for the data exchange to work.) > > The compiler will see mmap() on the user side, but not on the kernel > side. On the kernel side, something special is required. Maybe -- you'll certainly know better :) But maybe it's not that hard: For example, if the memory is in current code made available to userspace via calling some function with an asm implementation that the compiler can't analyze, then this should be sufficient. > Agree that "address-free" would be nice as "shall" rather than "should". > > > > I echo Peter's question about how one tags functions like mmap(). > > > > > > I will also remember this for the next time someone on the committee > > > discounts "volatile". ;-) > > > > > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > > > > > This might be special, or not, depending on how the JITed code gets > > > > access to data. If this is via fixed addresses (e.g., (void*)0x123), > > > > then see above. If this is through function calls that the compiler > > > > can't analyze, then this is like 4. > > > > > > It could well be via the kernel reading its own symbol table, sort of > > > a poor-person's reflection facility. I guess that would be for all > > > intents and purposes equivalent to your (void*)0x123. > > > > If it is replacing code generated by the compiler, then yes. If the JIT > > is just filling in functions that had been undefined yet declared > > before, then the compiler will have seen the data escape through the > > function interfaces, and should be aware that there is other stuff. > > So one other concern would then be things things like ftrace, kprobes, > ksplice, and so on. These rewrite the kernel binary at runtime, though > in very limited ways. Yes. Nonetheless, I wouldn't see a problem if they, say, rewrite with C11-compatible code (and same ABI) on a function granularity (and when the function itself isn't executing concurrently) -- this seems to be similar to just having another compiler compile this particular function. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 17:55 ` Torvald Riegel @ 2014-02-19 22:12 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-19 22:12 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 06:55:51PM +0100, Torvald Riegel wrote: > On Wed, 2014-02-19 at 07:14 -0800, Paul E. McKenney wrote: > > On Wed, Feb 19, 2014 at 11:59:08AM +0100, Torvald Riegel wrote: [ . . . ] > > > On both sides, the compiler will see that mmap() (or similar) is called, > > > so that means the data escapes to something unknown, which could create > > > threads and so on. So first, it can't do whole-program analysis for > > > this state anymore, and has to assume that other C11 threads are > > > accessing this memory. Next, lock-free atomics are specified to be > > > "address-free", meaning that they must work independent of where in > > > memory the atomics are mapped (see C++ (e.g., N3690) 29.4p3; that's a > > > "should" and non-normative, but essential IMO). Thus, this then boils > > > down to just a simple case of synchronization. (Of course, the rest of > > > the ABI has to match too for the data exchange to work.) > > > > The compiler will see mmap() on the user side, but not on the kernel > > side. On the kernel side, something special is required. > > Maybe -- you'll certainly know better :) > > But maybe it's not that hard: For example, if the memory is in current > code made available to userspace via calling some function with an asm > implementation that the compiler can't analyze, then this should be > sufficient. The kernel code would need to explicitly tell the compiler what portions of the kernel address space were covered by this. I would not want the compiler to have to work it out based on observing interactions with the page tables. ;-) > > Agree that "address-free" would be nice as "shall" rather than "should". > > > > > > I echo Peter's question about how one tags functions like mmap(). > > > > > > > > I will also remember this for the next time someone on the committee > > > > discounts "volatile". ;-) > > > > > > > > > > 5. JITed code produced based on BPF: https://lwn.net/Articles/437981/ > > > > > > > > > > This might be special, or not, depending on how the JITed code gets > > > > > access to data. If this is via fixed addresses (e.g., (void*)0x123), > > > > > then see above. If this is through function calls that the compiler > > > > > can't analyze, then this is like 4. > > > > > > > > It could well be via the kernel reading its own symbol table, sort of > > > > a poor-person's reflection facility. I guess that would be for all > > > > intents and purposes equivalent to your (void*)0x123. > > > > > > If it is replacing code generated by the compiler, then yes. If the JIT > > > is just filling in functions that had been undefined yet declared > > > before, then the compiler will have seen the data escape through the > > > function interfaces, and should be aware that there is other stuff. > > > > So one other concern would then be things things like ftrace, kprobes, > > ksplice, and so on. These rewrite the kernel binary at runtime, though > > in very limited ways. > > Yes. Nonetheless, I wouldn't see a problem if they, say, rewrite with > C11-compatible code (and same ABI) on a function granularity (and when > the function itself isn't executing concurrently) -- this seems to be > similar to just having another compiler compile this particular > function. Well, they aren't using C11-compatible code yet. They do patch within functions. And in some cases, they make staged sequences of changes to allow the patching to happen concurrently with other CPUs executing the code being patched. Not sure that any of the latter is actually in the kernel at the moment, but it has at least been prototyped and discussed. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 16:49 ` Linus Torvalds 2014-02-18 17:16 ` Paul E. McKenney @ 2014-02-18 21:21 ` Torvald Riegel 2014-02-18 21:40 ` Peter Zijlstra ` (2 more replies) 1 sibling, 3 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 21:21 UTC (permalink / raw) To: Linus Torvalds Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 08:49 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 7:31 AM, Torvald Riegel <triegel@redhat.com> wrote: > > On Mon, 2014-02-17 at 16:05 -0800, Linus Torvalds wrote: > >> And exactly because I know enough, I would *really* like atomics to be > >> well-defined, and have very clear - and *local* - rules about how they > >> can be combined and optimized. > > > > "Local"? > > Yes. > > So I think that one of the big advantages of atomics over volatile is > that they *can* be optimized, and as such I'm not at all against > trying to generate much better code than for volatile accesses. > > But at the same time, that can go too far. For example, one of the > things we'd want to use atomics for is page table accesses, where it > is very important that we don't generate multiple accesses to the > values, because parts of the values can be change *by*hardware* (ie > accessed and dirty bits). > > So imagine that you have some clever global optimizer that sees that > the program never ever actually sets the dirty bit at all in any > thread, and then uses that kind of non-local knowledge to make > optimization decisions. THAT WOULD BE BAD. > > Do you see what I'm aiming for? Yes, I do. But that seems to be "volatile" territory. It crosses the boundaries of the abstract machine, and thus is input/output. Which fraction of your atomic accesses can read values produced by hardware? I would still suppose that lots of synchronization is not affected by this. Do you perhaps want a weaker form of volatile? That is, one that, for example, allows combining of two adjacent loads of the dirty bits, but will make sure that this is treated as if there is some imaginary external thread that it cannot analyze and that may write? I'm trying to be careful here in distinguishing between volatile and synchronization because I believe those are orthogonal, and this is a good thing because it allows for more-than-conservatively optimized code. > Any optimization that tries to prove > anything from more than local state is by definition broken, because > it assumes that everything is described by the program. Well, that's how atomics that aren't volatile are defined in the standard. I can see that you want something else too, but that doesn't mean that the other thing is broken. > But *local* optimizations are fine, as long as they follow the obvious > rule of not actually making changes that are semantically visible. If we assume that there is this imaginary thread called hardware that can write/read to/from such weak-volatile atomics, I believe this should restrict optimizations sufficiently even in the model as specified in the standard. For example, it would prevent a compiler from proving that there is no access by another thread to a variable, so it would prevent the cases in our discussion that you didn't want to get optimized. Yet, I believe the model itself could stay unmodified. Thus, with this "weak volatile", we could let programmers request precisely the semantics that they want, without using volatile too much and without preventing optimizations more than necessary. Thoughts? > (In practice, I'd be impressed as hell for that particular example, > and we actually do end up setting the dirty bit by hand in some > situations, so the example is slightly made up, but there are other > cases that might be more realistic in that sometimes we do things that > are hidden from the compiler - in assembly etc - and the compiler > might *think* it knows what is going on, but it doesn't actually see > all accesses). If something is visible to assembly, then the compiler should see this (or at least be aware that it can't prove anything about such data). Or am I missing anything? > > Sorry, but the rules *are* very clear. I *really* suggest to look at > > the formalization by Batty et al. And in these rules, proving that a > > read will always return value X has a well-defined meaning, and you can > > use it. That simply follows from how the model is built. > > What "model"? The C/C++ memory model was what I was referring to. > That's the thing. I have tried to figure out whether the model is some > abstract C model, or a model based on the actual hardware that the > compiler is compiling for, and whether the model is one that assumes > the compiler has complete knowledge of the system (see the example > above). It's a model as specified in the standard. It's not parametrized by the hardware the program will eventually run on (ignoring implementation-defined behavior, timing, etc.). The compiler has complete knowledge of the system unless for "volatiles" and things coming in from the external world or escaping to it (e.g., stuff escaping into asm statements). > And it seems to be a mixture of it all. The definitions of the various > orderings obviously very much imply that the compiler has to insert > the proper barriers or sequence markers for that architecture, but > then there is no apparent way to depend on any *other* architecture > ordering guarantees. Like our knowledge that all architectures (with > the exception of alpha, which really doesn't end up being something we > worry about any more) end up having the load dependency ordering > guarantee. That is true, things like those other ordering guarantees aren't part of the model currently. But they might perhaps make good extensions. > > What you seem to want just isn't covered by the model as it is today -- > > you can't infer from that that the model itself would be wrong. The > > dependency chains aren't modeled in the way you envision it (except in > > what consume_mo tries, but that seems to be hard to implement); they are > > there on the level of the program logic as modeled by the abstract > > machine and the respective execution/output rules, but they are not > > built to represent those specific ordering guarantees the HW gives you. > > So this is a problem. It basically means that we cannot do the kinds > of things we do now, which very much involve knowing what the memory > ordering of a particular machine is, and combining that knowledge with > our knowledge of code generation. Yes, I agree that the standard is lacking a "feature" in this case (for lack of a better word -- I wouldn't call it a bug). I do hope that this discussion can lead us to a better understanding of this, and how we might perhaps add it as an extension to the C/C++ model while still keeping all the benefits of the model. > Now, *most* of what we do is protected by locking and is all fine. But > we do have a few rather subtle places in RCU and in the scheduler > where we depend on the actual dependency chain. > > In *practice*, I seriously doubt any reasonable compiler can actually > make a mess of it. The kinds of optimizations that would actually > defeat the dependency chain are simply not realistic. And I suspect > that will end up being what we rely on - there being no actual sane > sequence that a compiler would ever do, even if we wouldn't have > guarantees for some of it. Yes, maybe. I would prefer if we could put hard requirements for compilers about this in the standard (so that you get a more stable target to work against), but that might be hard to do in this case. > And I suspect I can live with that. We _have_ lived with that for the > longest time, after all. We very much do things that aren't covered by > any existing C standard, and just basically use tricks to coax the > compiler into never generating code that doesn't work (with our inline > asm barriers etc being a prime example). > > > I would also be cautious claiming that the rules you suggested would be > > very clear and very simple. I haven't seen a memory model spec from you > > that would be viable as the standard model for C/C++, nor have I seen > > proof that this would actually be easier to understand for programmers > > in general. > > So personally, if I were to write the spec, I would have taken a > completely different approach from what the standard obviously does. > > I'd have taken the approach of specifying the required semantics each > atomic op (ie the memory operations would end up having to be > annotated with their ordering constraints), and then said that the > compiler can generate any code that is equivalent to that > _on_the_target_machine_. > > Why? Just to avoid the whole "ok, which set of rules applies now" problem. > > >> For example, CPU people actually do tend to give guarantees for > >> certain things, like stores that are causally related being visible in > >> a particular order. > > > > Yes, but that's not part of the model so far. If you want to exploit > > this, please make a suggestion for how to extend the model to cover > > this. > > See above. This is exactly why I detest the C "model" thing. Now you > need ways to describe those CPU guarantees, because if you can't > describe them, you can't express them in the model. That's true, it has to be modeled. The downside of your approach would be that it's not portable code anymore, which would be a big downside for the majority of the C/C++ programmers I guess. > I would *much* have preferred the C standard to say that you have to > generate code that is guaranteed to act the same way - on that machine > - as the "naive direct unoptimized translation". I can see the advantages that has when using C as something like a high-level assembler. But as the discussion of the difficulty of tracking data dependencies indicates (including tracking though non-synchronizing code), this behavior might spread throughout the program and is hard to contain. For many programs, especially on the userspace side or wherever people can't or don't want to perform all optimizations themselves, the optimizations are a real benefit. For example, generic or modular programming becomes so much more practical if a compiler can optimize such code. So if we want both, it's hard to draw the line. > IOW, I would *revel* in the fact that different machines are > different, and basically just describe the "stupid" code generation. > You'd get the guaranteed semantic baseline, but you'd *also* be able > to know that whatever architecture guarantees you have would remain. > Without having to describe those architecture issues. Would you be okay with this preventing lots of optimizations a compiler otherwise could do? Because AFAICT, this spreads into non-synchronizing code via the dependency-tracking, for example. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:21 ` Torvald Riegel @ 2014-02-18 21:40 ` Peter Zijlstra 2014-02-18 21:47 ` Torvald Riegel 2014-02-18 21:47 ` Peter Zijlstra 2014-02-18 22:14 ` Linus Torvalds 2 siblings, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-18 21:40 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Well, that's how atomics that aren't volatile are defined in the > standard. I can see that you want something else too, but that doesn't > mean that the other thing is broken. Well that other thing depends on being able to see the entire program at compile time. PaulMck already listed various ways in which this is not feasible even for normal userspace code. In particular; DSOs and JITs were mentioned. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:40 ` Peter Zijlstra @ 2014-02-18 21:47 ` Torvald Riegel 2014-02-19 15:23 ` David Lang 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 21:47 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > > Well, that's how atomics that aren't volatile are defined in the > > standard. I can see that you want something else too, but that doesn't > > mean that the other thing is broken. > > Well that other thing depends on being able to see the entire program at > compile time. PaulMck already listed various ways in which this is > not feasible even for normal userspace code. > > In particular; DSOs and JITs were mentioned. No it doesn't depend on whole-program analysis being possible. Because if it isn't, then a correct compiler will just not do certain optimizations simply because it can't prove properties required for the optimization to hold. With the exception of access to objects via magic numbers (e.g., fixed and known addresses (see my reply to Paul), which are outside of the semantics specified in the standard), I don't see a correctness problem here. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:47 ` Torvald Riegel @ 2014-02-19 15:23 ` David Lang 2014-02-19 18:11 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: David Lang @ 2014-02-19 15:23 UTC (permalink / raw) To: Torvald Riegel Cc: Peter Zijlstra, Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 18 Feb 2014, Torvald Riegel wrote: > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: >>> Well, that's how atomics that aren't volatile are defined in the >>> standard. I can see that you want something else too, but that doesn't >>> mean that the other thing is broken. >> >> Well that other thing depends on being able to see the entire program at >> compile time. PaulMck already listed various ways in which this is >> not feasible even for normal userspace code. >> >> In particular; DSOs and JITs were mentioned. > > No it doesn't depend on whole-program analysis being possible. Because > if it isn't, then a correct compiler will just not do certain > optimizations simply because it can't prove properties required for the > optimization to hold. With the exception of access to objects via magic > numbers (e.g., fixed and known addresses (see my reply to Paul), which > are outside of the semantics specified in the standard), I don't see a > correctness problem here. Are you really sure that the compiler can figure out every possible thing that a loadable module or JITed code can access? That seems like a pretty strong claim. David Lang ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 15:23 ` David Lang @ 2014-02-19 18:11 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 18:11 UTC (permalink / raw) To: David Lang Cc: Peter Zijlstra, Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-19 at 07:23 -0800, David Lang wrote: > On Tue, 18 Feb 2014, Torvald Riegel wrote: > > > On Tue, 2014-02-18 at 22:40 +0100, Peter Zijlstra wrote: > >> On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > >>> Well, that's how atomics that aren't volatile are defined in the > >>> standard. I can see that you want something else too, but that doesn't > >>> mean that the other thing is broken. > >> > >> Well that other thing depends on being able to see the entire program at > >> compile time. PaulMck already listed various ways in which this is > >> not feasible even for normal userspace code. > >> > >> In particular; DSOs and JITs were mentioned. > > > > No it doesn't depend on whole-program analysis being possible. Because > > if it isn't, then a correct compiler will just not do certain > > optimizations simply because it can't prove properties required for the > > optimization to hold. With the exception of access to objects via magic > > numbers (e.g., fixed and known addresses (see my reply to Paul), which > > are outside of the semantics specified in the standard), I don't see a > > correctness problem here. > > Are you really sure that the compiler can figure out every possible thing that a > loadable module or JITed code can access? That seems like a pretty strong claim. If the other code can be produced by a C translation unit that is valid to be linked with the rest of the program, then I'm pretty sure the compiler has a well-defined notion of whether it does or does not see all other potential accesses. IOW, if the C compiler is dealing with C semantics and mechanisms only (including the C mechanisms for sharing with non-C code!), then it will know what to do. If you're playing tricks behind the C compiler's back using implementation-defined stuff outside of the C specification, then there's nothing the compiler really can do. For example, if you're trying to access a variable on a function's stack from some other function, you better know how the register allocator of the compiler operates. In contrast, if you let this function simply export the address of the variable to some external place, all will be fine. The documentation of GCC's -fwhole-program and -flto might also be interesting for you. GCC wouldn't need to have -fwhole-program if it weren't conservative by default (correctly so). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:21 ` Torvald Riegel 2014-02-18 21:40 ` Peter Zijlstra @ 2014-02-18 21:47 ` Peter Zijlstra 2014-02-19 11:07 ` Torvald Riegel 2014-02-18 22:14 ` Linus Torvalds 2 siblings, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-18 21:47 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > Yes, I do. But that seems to be "volatile" territory. It crosses the > boundaries of the abstract machine, and thus is input/output. Which > fraction of your atomic accesses can read values produced by hardware? > I would still suppose that lots of synchronization is not affected by > this. Its not only hardware; also the kernel/user boundary has this same problem. We cannot a-priory say what userspace will do; in fact, because we're a general purpose OS, we must assume it will willfully try its bestest to wreck whatever assumptions we make about its behaviour. We also have loadable modules -- much like regular userspace DSOs -- so there too we cannot say what will or will not happen. We also have JITs that generate code on demand. And I'm absolutely sure (with the exception of the JITs, its not an area I've worked on) that we have atomic usage across all those boundaries. I must agree with Linus, global state driven optimizations are crack brained; esp. for atomics. We simply cannot know all state at compile time. The best we can hope for are local optimizations. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:47 ` Peter Zijlstra @ 2014-02-19 11:07 ` Torvald Riegel 2014-02-19 11:42 ` Peter Zijlstra 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 11:07 UTC (permalink / raw) To: Peter Zijlstra Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 22:47 +0100, Peter Zijlstra wrote: > On Tue, Feb 18, 2014 at 10:21:56PM +0100, Torvald Riegel wrote: > > Yes, I do. But that seems to be "volatile" territory. It crosses the > > boundaries of the abstract machine, and thus is input/output. Which > > fraction of your atomic accesses can read values produced by hardware? > > I would still suppose that lots of synchronization is not affected by > > this. > > Its not only hardware; also the kernel/user boundary has this same > problem. We cannot a-priory say what userspace will do; in fact, because > we're a general purpose OS, we must assume it will willfully try its > bestest to wreck whatever assumptions we make about its behaviour. That's a good note, and I think a distinct case from those below, because here you're saying that you can't assume that the userspace code follows the C11 semantics ... > We also have loadable modules -- much like regular userspace DSOs -- so > there too we cannot say what will or will not happen. > > We also have JITs that generate code on demand. .. whereas for those, you might assume that the other code follows C11 semantics and the same ABI, which makes this just a normal case already handled as (see my other replies nearby in this thread). > And I'm absolutely sure (with the exception of the JITs, its not an area > I've worked on) that we have atomic usage across all those boundaries. That would be fine as long as all involved parties use the same memory model and ABI to implement it. (Of course, I'm assuming here that the compiler is aware of sharing with other entities, which is always the case except in those corner case like accesses to (void*)0x123 magically aliasing with something else). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 11:07 ` Torvald Riegel @ 2014-02-19 11:42 ` Peter Zijlstra 0 siblings, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-19 11:42 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Alec Teal, Paul McKenney, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 12:07:02PM +0100, Torvald Riegel wrote: > > Its not only hardware; also the kernel/user boundary has this same > > problem. We cannot a-priory say what userspace will do; in fact, because > > we're a general purpose OS, we must assume it will willfully try its > > bestest to wreck whatever assumptions we make about its behaviour. > > That's a good note, and I think a distinct case from those below, > because here you're saying that you can't assume that the userspace code > follows the C11 semantics ... Right; we can malfunction in those cases though; as long as the malfunctioning happens on the userspace side. That is, whatever userspace does should not cause the kernel to crash, but userspace crashing itself, or getting crap data or whatever is its own damn fault for not following expected behaviour. To stay on topic; if the kernel/user interface requires memory ordering and userspace explicitly omits the barriers all malfunctioning should be on the user. For instance it might loose a fwd progress guarantee or data integrity guarantees. In specific, given a kernel/user lockless producer/consumer buffer, if the user-side allows the tail write to happen before its data reads are complete, the kernel might overwrite the data its still reading. Or in case of futexes, if the user side doesn't use the appropriate operations its lock state gets corrupt but only userspace should suffer. But yes, this does require some care and consideration from our side. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 21:21 ` Torvald Riegel 2014-02-18 21:40 ` Peter Zijlstra 2014-02-18 21:47 ` Peter Zijlstra @ 2014-02-18 22:14 ` Linus Torvalds 2014-02-19 14:40 ` Torvald Riegel 2 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 22:14 UTC (permalink / raw) To: Torvald Riegel Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote: >> >> So imagine that you have some clever global optimizer that sees that >> the program never ever actually sets the dirty bit at all in any >> thread, and then uses that kind of non-local knowledge to make >> optimization decisions. THAT WOULD BE BAD. >> >> Do you see what I'm aiming for? > > Yes, I do. But that seems to be "volatile" territory. It crosses the > boundaries of the abstract machine, and thus is input/output. Which > fraction of your atomic accesses can read values produced by hardware? > I would still suppose that lots of synchronization is not affected by > this. The "hardware can change things" case is indeed pretty rare. But quite frankly, even when it isn't hardware, as far as the compiler is concerned you have the exact same issue - you have TLB faults happening on other CPU's that do the same thing asynchronously using software TLB fault handlers. So *semantically*, it really doesn't make any difference what-so-ever if it's a software TLB handler on another CPU, a microcoded TLB fault, or an actual hardware path. So if the answer for all of the above is "use volatile", then I think that means that the C11 atomics are badly designed. The whole *point* of atomic accesses is that stuff like above should "JustWork(tm)" > Do you perhaps want a weaker form of volatile? That is, one that, for > example, allows combining of two adjacent loads of the dirty bits, but > will make sure that this is treated as if there is some imaginary > external thread that it cannot analyze and that may write? Yes, that's basically what I would want. And it is what I would expect an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a bit of a misnomer, because technically we really generally want "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use volatile, and is a hell of a lot simpler to write ;^). So we obviously use "volatile" for this currently, and generally the semantics we really want are: - the load or store is done as a single access ("atomic") - the compiler must not try to re-materialize the value by reloading it from memory (this is the "at most once" part) and quite frankly, "volatile" is a big hammer for this. In practice it tends to work pretty well, though, because in _most_ cases, there really is just the single access, so there isn't anything that it could be combined with, and the biggest issue is often just the correctness of not re-materializing the value. And I agree - memory ordering is a totally separate issue, and in fact we largely tend to consider it entirely separate. For cases where we have ordering constraints, we either handle those with special accessors (ie "atomic-modify-and-test" helpers tend to have some serialization guarantees built in), or we add explicit fencing. But semantically, C11 atomic accessors *should* generally have the correct behavior for our uses. If we have to add "volatile", that makes atomics basically useless. We already *have* the volatile semantics, if atomics need it, that just means that atomics have zero upside for us. >> But *local* optimizations are fine, as long as they follow the obvious >> rule of not actually making changes that are semantically visible. > > If we assume that there is this imaginary thread called hardware that > can write/read to/from such weak-volatile atomics, I believe this should > restrict optimizations sufficiently even in the model as specified in > the standard. Well, what about *real* threads that do this, but that aren't analyzable by the C compiler because they are written in another language entirely (inline asm, asm, perl, INTERCA:. microcode, PAL-code, whatever?) I really don't think that "hardware" is necessary for this to happen. What is done by hardware on x86, for example, is done by PAL-code (loaded at boot-time) on alpha, and done by hand-tuned assembler fault handlers on Sparc. The *effect* is the same: it's not visible to the compiler. There is no way in hell that the compiler can understand the hand-tuned Sparc TLB fault handler, even if it parsed it. >> IOW, I would *revel* in the fact that different machines are >> different, and basically just describe the "stupid" code generation. >> You'd get the guaranteed semantic baseline, but you'd *also* be able >> to know that whatever architecture guarantees you have would remain. >> Without having to describe those architecture issues. > > Would you be okay with this preventing lots of optimizations a compiler > otherwise could do? Because AFAICT, this spreads into non-synchronizing > code via the dependency-tracking, for example. Actually, it probably wouldn't really hurt code generation. The thing is, you really have three cases: - architectures that have weak memory ordering and fairly stupid hardware (power, some day maybe ARM) - architectures that expose a fairly strong memory ordering, but reorders aggressively in hardware by tracking cacheline lifetimes until instruction retirement (x86) - slow hardware (current ARM, crap Atom hardware etc embedded devices that don't really do either) and let's just agree that anybody who does high-performance work doesn't really care about the third case, ok? The compiler really wouldn't be that constrained in ordering on a weakly ordered machine: sure, you'll have to add "lwsync" on powerpc pretty much every time you have an ordered atomic read, but then you could optimize them away for certain cases (like "oh, it was a 'consume' read, and the only consumer is already data-dependent, so now we can remove the lwsync"), so you'd end up with pretty much the code you looked for anyway. And the HPC people who use the "relaxed" model wouldn't see the extra memory barriers anyway. And on a non-weakly ordered machine like x86, there are going to be hardly any barriers in the first place, and the hardware will do the right thing wrt ordering, so it's not like the compiler has gotten prevented from any big optimization wins. So I really think a simpler model that tied to the hardware directly - which you need to do *anyway* since the memory ordering constraints really are hw-dependent - would have been preferable. The people who don't want to take advantage of hardware-specific guarantees would never know the difference (they'll rely purely on pairing acquire/release and fences properly - the way they *already* have to in the current C11 model), and the people who _do_ care about particular hw guarantees would get them by definition. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 22:14 ` Linus Torvalds @ 2014-02-19 14:40 ` Torvald Riegel 2014-02-19 19:49 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-19 14:40 UTC (permalink / raw) To: Linus Torvalds Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 14:14 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 1:21 PM, Torvald Riegel <triegel@redhat.com> wrote: > >> > >> So imagine that you have some clever global optimizer that sees that > >> the program never ever actually sets the dirty bit at all in any > >> thread, and then uses that kind of non-local knowledge to make > >> optimization decisions. THAT WOULD BE BAD. > >> > >> Do you see what I'm aiming for? > > > > Yes, I do. But that seems to be "volatile" territory. It crosses the > > boundaries of the abstract machine, and thus is input/output. Which > > fraction of your atomic accesses can read values produced by hardware? > > I would still suppose that lots of synchronization is not affected by > > this. > > The "hardware can change things" case is indeed pretty rare. > > But quite frankly, even when it isn't hardware, as far as the compiler > is concerned you have the exact same issue - you have TLB faults > happening on other CPU's that do the same thing asynchronously using > software TLB fault handlers. So *semantically*, it really doesn't make > any difference what-so-ever if it's a software TLB handler on another > CPU, a microcoded TLB fault, or an actual hardware path. I think there are a few semantic differences: * If a SW handler uses the C11 memory model, it will synchronize like any other thread. HW might do something else entirely, including synchronizing differently, not using atomic accesses, etc. (At least that's the constraints I had in mind). * If we can treat any interrupt handler like Just Another Thread, then the next question is whether the compiler will be aware that there is another thread. I think that in practice it will be: You'll set up the handler in some way by calling a function the compiler can't analyze, so the compiler will know that stuff accessible to the handler (e.g., global variables) will potentially be accessed by other threads. * Similarly, if the C code is called from some external thing, it also has to assume the presence of other threads. (Perhaps this is what the compiler has to assume in a freestanding implementation anyway...) However, accessibility will be different for, say, stack variables that haven't been shared with other functions yet; those are arguably not reachable by other things, at least not through mechanisms defined by the C standard. So optimizing these should be possible with the assumption that there is no other thread (at least as default -- I'm not saying that this is the only reasonable semantics). > So if the answer for all of the above is "use volatile", then I think > that means that the C11 atomics are badly designed. > > The whole *point* of atomic accesses is that stuff like above should > "JustWork(tm)" I think that it should in the majority of cases. If the other thing potentially accessing can do as much as a valid C11 thread can do, the synchronization itself will work just fine. In most cases except the (void*)0x123 example (or linker scripts etc.) the compiler is aware when data is made visible to other threads or other non-analyzable functions that may spawn other threads (or just by being a plain global variable accessible to other (potentially .S) translation units. > > Do you perhaps want a weaker form of volatile? That is, one that, for > > example, allows combining of two adjacent loads of the dirty bits, but > > will make sure that this is treated as if there is some imaginary > > external thread that it cannot analyze and that may write? > > Yes, that's basically what I would want. And it is what I would expect > an atomic to be. Right now we tend to use "ACCESS_ONCE()", which is a > bit of a misnomer, because technically we really generally want > "ACCESS_AT_MOST_ONCE()" (but "once" is what we get, because we use > volatile, and is a hell of a lot simpler to write ;^). > > So we obviously use "volatile" for this currently, and generally the > semantics we really want are: > > - the load or store is done as a single access ("atomic") > > - the compiler must not try to re-materialize the value by reloading > it from memory (this is the "at most once" part) In the presence of other threads performing operations unknown to the compiler, that's what you should get even if the compiler is trying to optimize C11 atomics. The first requirement is clear, and the "at most once" follows from another thread potentially writing to the variable. The only difference I can see right now is that a compiler may be able to *prove* that it doesn't matter whether it reloaded the value or not. But this seems very hard to prove for me, and likely to require whole-program analysis (which won't be possible because we don't know what other threads are doing). I would guess that this isn't a problem in practice. I just wanted to note it because it theoretically does have a different semantics than plain volatiles. > and quite frankly, "volatile" is a big hammer for this. In practice it > tends to work pretty well, though, because in _most_ cases, there > really is just the single access, so there isn't anything that it > could be combined with, and the biggest issue is often just the > correctness of not re-materializing the value. > > And I agree - memory ordering is a totally separate issue, and in fact > we largely tend to consider it entirely separate. For cases where we > have ordering constraints, we either handle those with special > accessors (ie "atomic-modify-and-test" helpers tend to have some > serialization guarantees built in), or we add explicit fencing. Good. > But semantically, C11 atomic accessors *should* generally have the > correct behavior for our uses. > > If we have to add "volatile", that makes atomics basically useless. We > already *have* the volatile semantics, if atomics need it, that just > means that atomics have zero upside for us. I agree, but I don't think it's necessary. atomics should have the right semantics for you, provided the compiler is aware that there are other unknown threads accessing the same data. > >> But *local* optimizations are fine, as long as they follow the obvious > >> rule of not actually making changes that are semantically visible. > > > > If we assume that there is this imaginary thread called hardware that > > can write/read to/from such weak-volatile atomics, I believe this should > > restrict optimizations sufficiently even in the model as specified in > > the standard. > > Well, what about *real* threads that do this, but that aren't > analyzable by the C compiler because they are written in another > language entirely (inline asm, asm, perl, INTERCA:. microcode, > PAL-code, whatever?) > > I really don't think that "hardware" is necessary for this to happen. > What is done by hardware on x86, for example, is done by PAL-code > (loaded at boot-time) on alpha, and done by hand-tuned assembler fault > handlers on Sparc. The *effect* is the same: it's not visible to the > compiler. There is no way in hell that the compiler can understand the > hand-tuned Sparc TLB fault handler, even if it parsed it. I agree. Let me rephrase it. If all those other threads written in whichever way use the same memory model and ABI for synchronization (e.g., choice of HW barriers for a certain memory_order), it doesn't matter whether it's a hardware thread, microcode, whatever. In this case, C11 atomics should be fine. (We have this in userspace already, because correct compilers will have to assume that the code generated by them has to properly synchronize with other code generated by different compilers.) If the other threads use a different model, access memory entirely differently, etc, then we might be back to "volatile" because we don't know anything, and the very strict rules about execution steps of the abstract machine (ie, no as-if rule) are probably the safest thing to do. If you agree with this categorization, then I believe we just need to look at whether a compiler is naturally aware of a variable being shared with potentially other threads that follow C11 synchronization semantics but are written in other languages and generally not accessible: * Maybe that's the case anyway when compiling for freestanding optimizations. * In a lot of cases, the compiler will know, because data escapes to non-C / non-analyzable functions, or is global and accessible to other translation units. * Maybe we need some additional mechanism to mark those corner cases where it isn't known (e.g., because of (void*)0x123 fixed-address accesses, or other non-C-semantics issues). That should be a clearer mechanism than weak-volatile; maybe a shared_with_other_threads attribute. But my current gut feeling is that we wouldn't need that often, if ever. Sounds better? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-19 14:40 ` Torvald Riegel @ 2014-02-19 19:49 ` Linus Torvalds 0 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-19 19:49 UTC (permalink / raw) To: Torvald Riegel Cc: Alec Teal, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 6:40 AM, Torvald Riegel <triegel@redhat.com> wrote: > > If all those other threads written in whichever way use the same memory > model and ABI for synchronization (e.g., choice of HW barriers for a > certain memory_order), it doesn't matter whether it's a hardware thread, > microcode, whatever. In this case, C11 atomics should be fine. > (We have this in userspace already, because correct compilers will have > to assume that the code generated by them has to properly synchronize > with other code generated by different compilers.) > > If the other threads use a different model, access memory entirely > differently, etc, then we might be back to "volatile" because we don't > know anything, and the very strict rules about execution steps of the > abstract machine (ie, no as-if rule) are probably the safest thing to > do. Oh, I don't even care about architectures that don't have real hardware atomics. So if there's a software protocol for atomics, all bets are off. The compiler almost certainly has to do atomics with function calls anyway, and we'll just plug in out own. And frankly, nobody will ever care, because those architectures aren't relevant, and never will be. Sure, there are some ancient Sparc platforms that only support a single-byte "ldstub" and there are some embedded chips that don't really do SMP, but have some pseudo-smp with special separate locking. Really, nobody cares. The C standard has that crazy lock-free atomic tests, and talks about address-free, but generally we require both lock-free and address-free in the kernel, because otherwise it's just too painful to do interrupt-safe locking, or do atomics in user-space (for futexes). So if your worry is just about software protocols for CPU's that aren't actually designed for modern SMP, that's pretty much a complete non-issue. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 20:18 ` Linus Torvalds 2014-02-17 21:21 ` Torvald Riegel 2014-02-17 23:10 ` Alec Teal @ 2014-02-18 3:00 ` Paul E. McKenney 2014-02-18 3:24 ` Linus Torvalds 2014-02-18 15:56 ` Torvald Riegel 2 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 3:00 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > Which example do you have in mind here? Haven't we resolved all the > > debated examples, or did I miss any? > > Well, Paul seems to still think that the standard possibly allows > speculative writes or possibly value speculation in ways that break > the hardware-guaranteed orderings. It is not that I know of any specific problems, but rather that I know I haven't looked under all the rocks. Plus my impression from my few years on the committee is that the standard will be pushed to the limit when it comes time to add optimizations. One example that I learned about last week uses the branch-prediction hardware to validate value speculation. And no, I am not at all a fan of value speculation, in case you were curious. However, it is still an educational example. This is where you start: p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ do_something(p->a, p->b, p->c); p->d = 1; Then you leverage branch-prediction hardware as follows: p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ if (p == GUESS) { do_something(GUESS->a, GUESS->b, GUESS->c); GUESS->d = 1; } else { do_something(p->a, p->b, p->c); p->d = 1; } The CPU's branch-prediction hardware squashes speculation in the case where the guess was wrong, and this prevents the speculative store to ->d from ever being visible. However, the then-clause breaks dependencies, which means that the loads -could- be speculated, so that do_something() gets passed pre-initialization values. Now, I hope and expect that the wording in the standard about dependency ordering prohibits this sort of thing. But I do not yet know for certain. And yes, I am being paranoid. But not unnecessarily paranoid. ;-) Thanx, Paul > And personally, I can't read standards paperwork. It is invariably > written in some basically impossible-to-understand lawyeristic mode, > and then it is read by people (compiler writers) that intentionally > try to mis-use the words and do language-lawyering ("that depends on > what the meaning of 'is' is"). The whole "lvalue vs rvalue expression > vs 'what is a volatile access'" thing for C++ was/is a great example > of that. > > So quite frankly, as a result I refuse to have anything to do with the > process directly. > > Linus > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:00 ` Paul E. McKenney @ 2014-02-18 3:24 ` Linus Torvalds 2014-02-18 3:42 ` Linus Torvalds 2014-02-18 5:01 ` Paul E. McKenney 2014-02-18 15:56 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 3:24 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 7:00 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > One example that I learned about last week uses the branch-prediction > hardware to validate value speculation. And no, I am not at all a fan > of value speculation, in case you were curious. Heh. See the example I used in my reply to Alec Teal. It basically broke the same dependency the same way. Yes, value speculation of reads is simply wrong, the same way speculative writes are simply wrong. The dependency chain matters, and is meaningful, and breaking it is actively bad. As far as I can tell, the intent is that you can't do value speculation (except perhaps for the "relaxed", which quite frankly sounds largely useless). But then I do get very very nervous when people talk about "proving" certain values. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:24 ` Linus Torvalds @ 2014-02-18 3:42 ` Linus Torvalds 2014-02-18 5:22 ` Paul E. McKenney 2014-02-18 16:17 ` Torvald Riegel 2014-02-18 5:01 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 3:42 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > As far as I can tell, the intent is that you can't do value > speculation (except perhaps for the "relaxed", which quite frankly > sounds largely useless). Hmm. The language I see for "consume" is not obvious: "Consume operation: no reads in the current thread dependent on the value currently loaded can be reordered before this load" and it could make a compiler writer say that value speculation is still valid, if you do it like this (with "ptr" being the atomic variable): value = ptr->val; into tmp = ptr; value = speculated.value; if (unlikely(tmp != &speculated)) value = tmp->value; which is still bogus. The load of "ptr" does happen before the load of "value = speculated->value" in the instruction stream, but it would still result in the CPU possibly moving the value read before the pointer read at least on ARM and power. So if you're a compiler person, you think you followed the letter of the spec - as far as *you* were concerned, no load dependent on the value of the atomic load moved to before the atomic load. You go home, happy, knowing you've done your job. Never mind that you generated code that doesn't actually work. I dread having to explain to the compiler person that he may be right in some theoretical virtual machine, but the code is subtly broken and nobody will ever understand why (and likely not be able to create a test-case showing the breakage). But maybe the full standard makes it clear that "reordered before this load" actually means on the real hardware, not just in the generated instruction stream. Reading it with understanding of the *intent* and understanding all the different memory models that requirement should be obvious (on alpha, you need an "rmb" instruction after the load), but ... Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:42 ` Linus Torvalds @ 2014-02-18 5:22 ` Paul E. McKenney 2014-02-18 16:17 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 5:22 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 07:42:42PM -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > As far as I can tell, the intent is that you can't do value > > speculation (except perhaps for the "relaxed", which quite frankly > > sounds largely useless). > > Hmm. The language I see for "consume" is not obvious: > > "Consume operation: no reads in the current thread dependent on the > value currently loaded can be reordered before this load" > > and it could make a compiler writer say that value speculation is > still valid, if you do it like this (with "ptr" being the atomic > variable): > > value = ptr->val; > > into > > tmp = ptr; > value = speculated.value; > if (unlikely(tmp != &speculated)) > value = tmp->value; > > which is still bogus. The load of "ptr" does happen before the load of > "value = speculated->value" in the instruction stream, but it would > still result in the CPU possibly moving the value read before the > pointer read at least on ARM and power. > > So if you're a compiler person, you think you followed the letter of > the spec - as far as *you* were concerned, no load dependent on the > value of the atomic load moved to before the atomic load. You go home, > happy, knowing you've done your job. Never mind that you generated > code that doesn't actually work. Agreed, that would be bad. But please see below. > I dread having to explain to the compiler person that he may be right > in some theoretical virtual machine, but the code is subtly broken and > nobody will ever understand why (and likely not be able to create a > test-case showing the breakage). If things go as they usually do, such explanations will be required a time or two. > But maybe the full standard makes it clear that "reordered before this > load" actually means on the real hardware, not just in the generated > instruction stream. Reading it with understanding of the *intent* and > understanding all the different memory models that requirement should > be obvious (on alpha, you need an "rmb" instruction after the load), > but ... The key point with memory_order_consume is that it must be paired with some sort of store-release, a category that includes stores tagged with memory_order_release (surprise!), memory_order_acq_rel, and memory_order_seq_cst. This pairing is analogous to the memory-barrier pairing in the Linux kernel. So you have something like this for the rcu_assign_pointer() side: p = kmalloc(...); if (unlikely(!p)) return -ENOMEM; p->a = 1; p->b = 2; p->c = 3; /* The following would be buried within rcu_assign_pointer(). */ atomic_store_explicit(&gp, p, memory_order_release); And something like this for the rcu_dereference() side: /* The following would be buried within rcu_dereference(). */ q = atomic_load_explicit(&gp, memory_order_consume); do_something_with(q->a); So, let's look at the C11 draft, section 5.1.2.4 "Multi-threaded executions and data races". 5.1.2.4p14 says that the atomic_load_explicit() carries a dependency to the argument of do_something_with(). 5.1.2.4p15 says that the atomic_store_explicit() is dependency-ordered before the atomic_load_explicit(). 5.1.2.4p15 also says that the atomic_store_explicit() is dependency-ordered before the argument of do_something_with(). This is because if A is dependency-ordered before X and X carries a dependency to B, then A is dependency-ordered before B. 5.1.2.4p16 says that the atomic_store_explicit() inter-thread happens before the argument of do_something_with(). The assignment to p->a is sequenced before the atomic_store_explicit(). Therefore, combining these last two, the assignment to p->a happens before the argument of do_something_with(), and that means that do_something_with() had better see the "1" assigned to p->a or some later value. But as far as I know, compiler writers currently take the approach of treating memory_order_consume as if it was memory_order_acquire. Which certainly works, as long as ARM and PowerPC people don't mind an extra memory barrier out of each rcu_dereference(). Which is one thing that compiler writers are permitted to do according to the standard -- substitute a memory-barrier instruction for any given dependency... Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:42 ` Linus Torvalds 2014-02-18 5:22 ` Paul E. McKenney @ 2014-02-18 16:17 ` Torvald Riegel 2014-02-18 17:44 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 16:17 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 19:42 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 7:24 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > As far as I can tell, the intent is that you can't do value > > speculation (except perhaps for the "relaxed", which quite frankly > > sounds largely useless). > > Hmm. The language I see for "consume" is not obvious: > > "Consume operation: no reads in the current thread dependent on the > value currently loaded can be reordered before this load" I can't remember seeing that language in the standard (ie, C or C++). Where is this from? > and it could make a compiler writer say that value speculation is > still valid, if you do it like this (with "ptr" being the atomic > variable): > > value = ptr->val; I assume the load from ptr has mo_consume ordering? > into > > tmp = ptr; > value = speculated.value; > if (unlikely(tmp != &speculated)) > value = tmp->value; > > which is still bogus. The load of "ptr" does happen before the load of > "value = speculated->value" in the instruction stream, but it would > still result in the CPU possibly moving the value read before the > pointer read at least on ARM and power. And surprise, in the C/C++ model the load from ptr is sequenced-before the load from speculated, but there's no ordering constraint on the reads-from relation for the value load if you use mo_consume on the ptr load. Thus, the transformed code has less ordering constraints than the original code, and we arrive at the same outcome. > So if you're a compiler person, you think you followed the letter of > the spec - as far as *you* were concerned, no load dependent on the > value of the atomic load moved to before the atomic load. No. Because the wobbly sentence you cited(?) above is not what the standard says. Would you please stop making claims about what compiler writers would do or not if you seemingly aren't even familiar with the model that compiler writers would use to reason about transformations? Seriously? > You go home, > happy, knowing you've done your job. Never mind that you generated > code that doesn't actually work. > > I dread having to explain to the compiler person that he may be right > in some theoretical virtual machine, but the code is subtly broken and > nobody will ever understand why (and likely not be able to create a > test-case showing the breakage). > > But maybe the full standard makes it clear that "reordered before this > load" actually means on the real hardware, not just in the generated > instruction stream. Do you think everyone else is stupid? If there's an ordering constraint in the virtual machine, it better be present when executing in the real machine unless it provably cannot result in different output as specified by the language's semantics. > Reading it with understanding of the *intent* and > understanding all the different memory models that requirement should > be obvious (on alpha, you need an "rmb" instruction after the load), > but ... The standard is clear on what's required. I strongly suggest reading the formalization of the memory model by Batty et al. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 16:17 ` Torvald Riegel @ 2014-02-18 17:44 ` Linus Torvalds 2014-02-18 19:40 ` Paul E. McKenney 2014-02-18 19:47 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 17:44 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote: >> >> "Consume operation: no reads in the current thread dependent on the >> value currently loaded can be reordered before this load" > > I can't remember seeing that language in the standard (ie, C or C++). > Where is this from? That's just for googling for explanations. I do have some old standard draft, but that doesn't have any concise definitions anywhere that I could find. >> and it could make a compiler writer say that value speculation is >> still valid, if you do it like this (with "ptr" being the atomic >> variable): >> >> value = ptr->val; > > I assume the load from ptr has mo_consume ordering? Yes. >> into >> >> tmp = ptr; >> value = speculated.value; >> if (unlikely(tmp != &speculated)) >> value = tmp->value; >> >> which is still bogus. The load of "ptr" does happen before the load of >> "value = speculated->value" in the instruction stream, but it would >> still result in the CPU possibly moving the value read before the >> pointer read at least on ARM and power. > > And surprise, in the C/C++ model the load from ptr is sequenced-before > the load from speculated, but there's no ordering constraint on the > reads-from relation for the value load if you use mo_consume on the ptr > load. Thus, the transformed code has less ordering constraints than the > original code, and we arrive at the same outcome. Ok, good. > The standard is clear on what's required. I strongly suggest reading > the formalization of the memory model by Batty et al. Can you point to it? Because I can find a draft standard, and it sure as hell does *not* contain any clarity of the model. It has a *lot* of verbiage, but it's pretty much impossible to actually understand, even for somebody who really understands memory ordering. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 17:44 ` Linus Torvalds @ 2014-02-18 19:40 ` Paul E. McKenney 2014-02-18 19:47 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 19:40 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 09:44:48AM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote: [ . . . ] > > The standard is clear on what's required. I strongly suggest reading > > the formalization of the memory model by Batty et al. > > Can you point to it? Because I can find a draft standard, and it sure > as hell does *not* contain any clarity of the model. It has a *lot* of > verbiage, but it's pretty much impossible to actually understand, even > for somebody who really understands memory ordering. I suspect he is thinking of the following: "Mathematizing C++ Concurrency." Mark Batty, Scott Owens, Susmit Sarkar, Peter Sewell, and Tjark Weber. https://www.cl.cam.ac.uk/~pes20/cpp/popl085ap-sewell.pdf Even if you don't like the math, it contains some very good examples. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 17:44 ` Linus Torvalds 2014-02-18 19:40 ` Paul E. McKenney @ 2014-02-18 19:47 ` Torvald Riegel 2014-02-20 0:53 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 19:47 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 8:17 AM, Torvald Riegel <triegel@redhat.com> wrote: > > The standard is clear on what's required. I strongly suggest reading > > the formalization of the memory model by Batty et al. > > Can you point to it? Because I can find a draft standard, and it sure > as hell does *not* contain any clarity of the model. It has a *lot* of > verbiage, but it's pretty much impossible to actually understand, even > for somebody who really understands memory ordering. http://www.cl.cam.ac.uk/~mjb220/n3132.pdf This has an explanation of the model up front, and then the detailed formulae in Section 6. This is from 2010, and there might have been smaller changes since then, but I'm not aware of any bigger ones. The cppmem tool is based on this, so if you want to play around with a few code examples, it's pretty nice because it shows you all allowed executions, including graphs like in the paper (see elsewhere in this thread for CAS syntax): http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 19:47 ` Torvald Riegel @ 2014-02-20 0:53 ` Linus Torvalds 2014-02-20 4:01 ` Paul E. McKenney 2014-02-20 17:14 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 0:53 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: >> >> Can you point to it? Because I can find a draft standard, and it sure >> as hell does *not* contain any clarity of the model. It has a *lot* of >> verbiage, but it's pretty much impossible to actually understand, even >> for somebody who really understands memory ordering. > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > This has an explanation of the model up front, and then the detailed > formulae in Section 6. This is from 2010, and there might have been > smaller changes since then, but I'm not aware of any bigger ones. Ahh, this is different from what others pointed at. Same people, similar name, but not the same paper. I will read this version too, but from reading the other one and the standard in parallel and trying to make sense of it, it seems that I may have originally misunderstood part of the whole control dependency chain. The fact that the left side of "? :", "&&" and "||" breaks data dependencies made me originally think that the standard tried very hard to break any control dependencies. Which I felt was insane, when then some of the examples literally were about the testing of the value of an atomic read. The data dependency matters quite a bit. The fact that the other "Mathematical" paper then very much talked about consume only in the sense of following a pointer made me think so even more. But reading it some more, I now think that the whole "data dependency" logic (which is where the special left-hand side rule of the ternary and logical operators come in) are basically an exception to the rule that sequence points end up being also meaningful for ordering (ok, so C11 seems to have renamed "sequence points" to "sequenced before"). So while an expression like atomic_read(p, consume) ? a : b; doesn't have a data dependency from the atomic read that forces serialization, writing if (atomic_read(p, consume)) a; else b; the standard *does* imply that the atomic read is "happens-before" wrt "a", and I'm hoping that there is no question that the control dependency still acts as an ordering point. THAT was one of my big confusions, the discussion about control dependencies and the fact that the logical ops broke the data dependency made me believe that the standard tried to actively avoid the whole issue with "control dependencies can break ordering dependencies on some CPU's due to branch prediction and memory re-ordering by the CPU". But after all the reading, I'm starting to think that that was never actually the implication at all, and the "logical ops breaks the data dependency rule" is simply an exception to the sequence point rule. All other sequence points still do exist, and do imply an ordering that matters for "consume" Am I now reading it right? So the clarification is basically to the statement that the "if (consume(p)) a" version *would* have an ordering guarantee between the read of "p" and "a", but the "consume(p) ? a : b" would *not* have such an ordering guarantee. Yes? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 0:53 ` Linus Torvalds @ 2014-02-20 4:01 ` Paul E. McKenney 2014-02-20 4:43 ` Linus Torvalds 2014-02-20 17:26 ` Torvald Riegel 2014-02-20 17:14 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 4:01 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > >> > >> Can you point to it? Because I can find a draft standard, and it sure > >> as hell does *not* contain any clarity of the model. It has a *lot* of > >> verbiage, but it's pretty much impossible to actually understand, even > >> for somebody who really understands memory ordering. > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > This has an explanation of the model up front, and then the detailed > > formulae in Section 6. This is from 2010, and there might have been > > smaller changes since then, but I'm not aware of any bigger ones. > > Ahh, this is different from what others pointed at. Same people, > similar name, but not the same paper. > > I will read this version too, but from reading the other one and the > standard in parallel and trying to make sense of it, it seems that I > may have originally misunderstood part of the whole control dependency > chain. > > The fact that the left side of "? :", "&&" and "||" breaks data > dependencies made me originally think that the standard tried very > hard to break any control dependencies. Which I felt was insane, when > then some of the examples literally were about the testing of the > value of an atomic read. The data dependency matters quite a bit. The > fact that the other "Mathematical" paper then very much talked about > consume only in the sense of following a pointer made me think so even > more. > > But reading it some more, I now think that the whole "data dependency" > logic (which is where the special left-hand side rule of the ternary > and logical operators come in) are basically an exception to the rule > that sequence points end up being also meaningful for ordering (ok, so > C11 seems to have renamed "sequence points" to "sequenced before"). > > So while an expression like > > atomic_read(p, consume) ? a : b; > > doesn't have a data dependency from the atomic read that forces > serialization, writing > > if (atomic_read(p, consume)) > a; > else > b; > > the standard *does* imply that the atomic read is "happens-before" wrt > "a", and I'm hoping that there is no question that the control > dependency still acts as an ordering point. The control dependency should order subsequent stores, at least assuming that "a" and "b" don't start off with identical stores that the compiler could pull out of the "if" and merge. The same might also be true for ?: for all I know. (But see below) That said, in this case, you could substitute relaxed for consume and get the same effect. The return value from atomic_read() gets absorbed into the "if" condition, so there is no dependency-ordered-before relationship, so nothing for consume to do. One caution... The happens-before relationship requires you to trace a full path between the two operations of interest. This is illustrated by the following example, with both x and y initially zero: T1: atomic_store_explicit(&x, 1, memory_order_relaxed); r1 = atomic_load_explicit(&y, memory_order_relaxed); T2: atomic_store_explicit(&y, 1, memory_order_relaxed); r2 = atomic_load_explicit(&x, memory_order_relaxed); There is a happens-before relationship between T1's load and store, and another happens-before relationship between T2's load and store, but there is no happens-before relationship from T1 to T2, and none in the other direction, either. And you don't get to assume any ordering based on reasoning about these two disjoint happens-before relationships. So it is quite possible for r1==1&&r2==1 after both threads complete. Which should be no surprise: This misordering can happen even on x86, which would need a full smp_mb() to prevent it. > THAT was one of my big confusions, the discussion about control > dependencies and the fact that the logical ops broke the data > dependency made me believe that the standard tried to actively avoid > the whole issue with "control dependencies can break ordering > dependencies on some CPU's due to branch prediction and memory > re-ordering by the CPU". > > But after all the reading, I'm starting to think that that was never > actually the implication at all, and the "logical ops breaks the data > dependency rule" is simply an exception to the sequence point rule. > All other sequence points still do exist, and do imply an ordering > that matters for "consume" > > Am I now reading it right? As long as there is an unbroken chain of -data- dependencies from the consume to the later access in question, and as long as that chain doesn't go through the excluded operations, yes. > So the clarification is basically to the statement that the "if > (consume(p)) a" version *would* have an ordering guarantee between the > read of "p" and "a", but the "consume(p) ? a : b" would *not* have > such an ordering guarantee. Yes? Neither has a data-dependency guarantee, because there is no data dependency from the load to either "a" or "b". After all, the value loaded got absorbed into the "if" condition. However, according to discussions earlier in this thread, the "if" variant would have a control-dependency ordering guarantee for any stores in "a" and "b" (but not loads!). The ?: form might also have a control-dependency guarantee for any stores in "a" and "b" (again, not loads). Why my uncertainty? Well, the standard does not talk explicitly about control dependencies. They currently appear to be a side effect of other requirements in the standard, for example, the prohibition against doing stores to atomics if those stores wouldn't happen in an unoptimized naive compilation of the program. Even then, you have to take this in combination with ordering guarantees of all the hardware that Linux currently runs on to get to the control dependency. I would feel way better if the standard explicitly called out ordering based on control dependencies, but that is something for Torvald Riegel and me to hash out. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 4:01 ` Paul E. McKenney @ 2014-02-20 4:43 ` Linus Torvalds 2014-02-20 8:30 ` Paul E. McKenney 2014-02-20 17:49 ` Torvald Riegel 2014-02-20 17:26 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 4:43 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > The control dependency should order subsequent stores, at least assuming > that "a" and "b" don't start off with identical stores that the compiler > could pull out of the "if" and merge. The same might also be true for ?: > for all I know. (But see below) Stores I don't worry about so much because (a) you can't sanely move stores up in a compiler anyway (b) no sane CPU or moves stores up, since they aren't on the critical path so a read->cmp->store is actually really hard to make anything sane re-order. I'm sure it can be done, and I'm sure it's stupid as hell. But that "it's hard to screw up" is *not* true for a load->cmp->load. So lets make this really simple: if you have a consume->cmp->read, is the ordering of the two reads guaranteed? > As long as there is an unbroken chain of -data- dependencies from the > consume to the later access in question, and as long as that chain > doesn't go through the excluded operations, yes. So let's make it *really* specific, and make it real code doing a real operation, that is actually realistic and reasonable in a threaded environment, and may even be in some critical code. The issue is the read-side ordering guarantee for 'a' and 'b', for this case: - Initial state: a = b = 0; - Thread 1 ("consumer"): if (atomic_read(&a, consume)) return b; /* not yet initialized */ return -1; - Thread 2 ("initializer"): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(&a, 1, release); and quite frankly, if there is no ordering guarantee between the read of "a" and the read of "b" in the consumer thread, then the C atomics standard is broken. Put another way: I claim that if "thread 1" ever sees a return value other than -1 or 42, then the whole definition of atomics is broken. Question 2: and what changes if the atomic_read() is turned into an acquire, and why? Does it start working? > Neither has a data-dependency guarantee, because there is no data > dependency from the load to either "a" or "b". After all, the value > loaded got absorbed into the "if" condition. However, according to > discussions earlier in this thread, the "if" variant would have a > control-dependency ordering guarantee for any stores in "a" and "b" > (but not loads!). So exactly what part of the standard allows the loads to be re-ordered, and why? Quite frankly, I'd think that any sane person will agree that the above code snippet is realistic, and that my requirement that thread 1 sees either -1 or 42 is valid. And if the C standards body has said that control dependencies break the read ordering, then I really think that the C standards committee has screwed up. If the consumer of an atomic load isn't a pointer chasing operation, then the consume should be defined to be the same as acquire. None of this "conditionals break consumers". No, conditionals on the dependency path should turn consumers into acquire, because otherwise the "consume" load is dangerous as hell. And if the definition of acquire doesn't include the control dependency either, then the C atomic memory model is just completely and utterly broken, since the above *trivial* and clearly useful example is broken. I really think the above example is pretty damn black-and-white. Either it works, or the standard isn't worth wiping your ass with. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 4:43 ` Linus Torvalds @ 2014-02-20 8:30 ` Paul E. McKenney 2014-02-20 9:20 ` Paul E. McKenney ` (2 more replies) 2014-02-20 17:49 ` Torvald Riegel 1 sibling, 3 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 8:30 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote: > On Wed, Feb 19, 2014 at 8:01 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > The control dependency should order subsequent stores, at least assuming > > that "a" and "b" don't start off with identical stores that the compiler > > could pull out of the "if" and merge. The same might also be true for ?: > > for all I know. (But see below) > > Stores I don't worry about so much because > > (a) you can't sanely move stores up in a compiler anyway > (b) no sane CPU or moves stores up, since they aren't on the critical path > > so a read->cmp->store is actually really hard to make anything sane > re-order. I'm sure it can be done, and I'm sure it's stupid as hell. > > But that "it's hard to screw up" is *not* true for a load->cmp->load. > > So lets make this really simple: if you have a consume->cmp->read, is > the ordering of the two reads guaranteed? Not as far as I know. Also, as far as I know, there is no difference between consume and relaxed in the consume->cmp->read case. > > As long as there is an unbroken chain of -data- dependencies from the > > consume to the later access in question, and as long as that chain > > doesn't go through the excluded operations, yes. > > So let's make it *really* specific, and make it real code doing a real > operation, that is actually realistic and reasonable in a threaded > environment, and may even be in some critical code. > > The issue is the read-side ordering guarantee for 'a' and 'b', for this case: > > - Initial state: > > a = b = 0; > > - Thread 1 ("consumer"): > > if (atomic_read(&a, consume)) > return b; > /* not yet initialized */ > return -1; > > - Thread 2 ("initializer"): > > b = some_value_lets_say_42; > /* We are now ready to party */ > atomic_write(&a, 1, release); > > and quite frankly, if there is no ordering guarantee between the read > of "a" and the read of "b" in the consumer thread, then the C atomics > standard is broken. > > Put another way: I claim that if "thread 1" ever sees a return value > other than -1 or 42, then the whole definition of atomics is broken. The above example can have a return value of 0 if translated straightforwardly into either ARM or Power, right? Both of these can speculate a read into a conditional, and both can translate a consume load into a plain load if data dependencies remain unbroken. So, if you make one of two changes to your example, then I will agree with you. The first change is to have a real data dependency between the read of "a" and the second read: - Initial state: a = &c, b = 0; c = 0; - Thread 1 ("consumer"): if (atomic_read(&a, consume)) return *a; /* not yet initialized */ return -1; - Thread 2 ("initializer"): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(&a, &b, release); The second change is to make the "consume" be an acquire: - Initial state: a = b = 0; - Thread 1 ("consumer"): if (atomic_read(&a, acquire)) return b; /* not yet initialized */ return -1; - Thread 2 ("initializer"): b = some_value_lets_say_42; /* We are now ready to party */ atomic_write(&a, 1, release); In theory, you could also change the "return" to a store, but the example gets a bit complicated and as far as I can tell you get into the state where the standard does not explicitly support it, even though I have a hard time imagining an actual implementation that fails to support it. > Question 2: and what changes if the atomic_read() is turned into an > acquire, and why? Does it start working? Yep, that is the second change above. > > Neither has a data-dependency guarantee, because there is no data > > dependency from the load to either "a" or "b". After all, the value > > loaded got absorbed into the "if" condition. However, according to > > discussions earlier in this thread, the "if" variant would have a > > control-dependency ordering guarantee for any stores in "a" and "b" > > (but not loads!). > > So exactly what part of the standard allows the loads to be > re-ordered, and why? Quite frankly, I'd think that any sane person > will agree that the above code snippet is realistic, and that my > requirement that thread 1 sees either -1 or 42 is valid. Unless I am really confused, weakly ordered systems would be just fine returning zero in your original example. In the case of ARM, I believe you need either a data dependency, an ISB after the branch, or a DMB instruction to force the ordering you want. In the case of PowerPC, I believe that you need either a data dependency, an isync after the branch, an lwsync, or a sync. I would not expect a C compiler to generate code with any of these precautions in place. > And if the C standards body has said that control dependencies break > the read ordering, then I really think that the C standards committee > has screwed up. The control dependencies don't break the read ordering, but rather, they are insufficient to preserve the read ordering. > If the consumer of an atomic load isn't a pointer chasing operation, > then the consume should be defined to be the same as acquire. None of > this "conditionals break consumers". No, conditionals on the > dependency path should turn consumers into acquire, because otherwise > the "consume" load is dangerous as hell. Well, all the compilers currently convert consume to acquire, so you have your wish there. Of course, that also means that they generate actual unneeded memory-barrier instructions, which seems extremely sub-optimal to me. > And if the definition of acquire doesn't include the control > dependency either, then the C atomic memory model is just completely > and utterly broken, since the above *trivial* and clearly useful > example is broken. The definition of acquire makes the ordering happen whether or not there is a control dependency. > I really think the above example is pretty damn black-and-white. > Either it works, or the standard isn't worth wiping your ass with. Seems to me that your gripe is with ARM's and PowerPC's weak memory ordering rather than the standard. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 8:30 ` Paul E. McKenney @ 2014-02-20 9:20 ` Paul E. McKenney 2014-02-20 17:01 ` Linus Torvalds 2014-02-20 17:54 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 9:20 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 12:30:32AM -0800, Paul E. McKenney wrote: > On Wed, Feb 19, 2014 at 08:43:14PM -0800, Linus Torvalds wrote: [ . . . ] > So, if you make one of two changes to your example, then I will agree > with you. The first change is to have a real data dependency between > the read of "a" and the second read: > > - Initial state: > > a = &c, b = 0; c = 0; > > - Thread 1 ("consumer"): > > if (atomic_read(&a, consume)) And the above should be "if (atomic_read(&a, consume) != &c)". Sigh!!! Thanx, Paul > return *a; > /* not yet initialized */ > return -1; > > - Thread 2 ("initializer"): > > b = some_value_lets_say_42; > /* We are now ready to party */ > atomic_write(&a, &b, release); ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 8:30 ` Paul E. McKenney 2014-02-20 9:20 ` Paul E. McKenney @ 2014-02-20 17:01 ` Linus Torvalds 2014-02-20 18:11 ` Paul E. McKenney ` (2 more replies) 2014-02-20 17:54 ` Torvald Riegel 2 siblings, 3 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 17:01 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> So lets make this really simple: if you have a consume->cmp->read, is >> the ordering of the two reads guaranteed? > > Not as far as I know. Also, as far as I know, there is no difference > between consume and relaxed in the consume->cmp->read case. Ok, quite frankly, I think that means that "consume" is misdesigned. > The above example can have a return value of 0 if translated > straightforwardly into either ARM or Power, right? Correct. And I think that is too subtle. It's dangerous, it makes code that *looks* correct work incorrectly, and it actually happens to work on x86 since x86 doesn't have crap-for-brains memory ordering semantics. > So, if you make one of two changes to your example, then I will agree > with you. No. We're not playing games here. I'm fed up with complex examples that make no sense. Nobody sane writes code that does that pointer comparison, and it is entirely immaterial what the compiler can do behind our backs. The C standard semantics need to make sense to the *user* (ie programmer), not to a CPU and not to a compiler. The CPU and compiler are "tools". They don't matter. Their only job is to make the code *work*, dammit. So no idiotic made-up examples that involve code that nobody will ever write and that have subtle issues. So the starting point is that (same example as before, but with even clearer naming): Initialization state: initialized = 0; value = 0; Consumer: return atomic_read(&initialized, consume) ? value : -1; Writer: value = 42; atomic_write(&initialized, 1, release); and because the C memory ordering standard is written in such a way that this is subtly buggy (and can return 0, which is *not* logically a valid value), then I think the C memory ordering standard is broken. That "consumer" memory ordering is dangerous as hell, and it is dangerous FOR NO GOOD REASON. The trivial "fix" to the standard would be to get rid of all the "carries a dependency" crap, and just say that *anything* that depends on it is ordered wrt it. That just means that on alpha, "consume" implies an unconditional read barrier (well, unless the value is never used and is loaded just because it is also volatile), on x86, "consume" is the same as "acquire" which is just a plain load with ordering guarantees, and on ARM or power you can still avoid the extra synchronization *if* the value is used just for computation and for following pointers, but if the value is used for a comparison, there needs to be a synchronization barrier. Notice? Getting rid of the stupid "carries-dependency" crap from the standard actually (a) simplifies the standard (b) means that the above obvious example *works* (c) does not in *any* way make for any less efficient code generation for the cases that "consume" works correctly for in the current mis-designed standard. (d) is actually a hell of a lot easier to explain to a compiler writer, and I can guarantee that it is simpler to implement too. Why do I claim (d) "it is simpler to implement" - because on ARM/power you can implement it *exactly* as a special "acquire", with just a trivial peep-hole special case that follows the use chain of the acquire op to the consume, and then just drop the acquire bit if the only use is that compute-to-load chain. In fact, realistically, the *only* thing you need to actually care about for the intended use case of "consume" is the question "is the consuming load immediately consumed as an address (with offset) of a memory operation. So you don't even need to follow any complicated computation chain in a compiler - the only case that matters for the barrier removal optimization is the "oh, I can see that it is only used as an address to a dereference". Seriously. The current standard is broken. It's broken because it mis-compiles code that on the face of it looks logical and works, it's broken because it's overly complex, and it's broken because the complexity doesn't even *buy* you anything. All this complexity for no reason. When much simpler wording and implementation actually WORKS BETTER. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:01 ` Linus Torvalds @ 2014-02-20 18:11 ` Paul E. McKenney 2014-02-20 18:32 ` Linus Torvalds 2014-02-20 18:44 ` Torvald Riegel 2014-02-20 18:23 ` Torvald Riegel [not found] ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com> 2 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:11 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 09:01:06AM -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > >> > >> So lets make this really simple: if you have a consume->cmp->read, is > >> the ordering of the two reads guaranteed? > > > > Not as far as I know. Also, as far as I know, there is no difference > > between consume and relaxed in the consume->cmp->read case. > > Ok, quite frankly, I think that means that "consume" is misdesigned. > > > The above example can have a return value of 0 if translated > > straightforwardly into either ARM or Power, right? > > Correct. And I think that is too subtle. It's dangerous, it makes code > that *looks* correct work incorrectly, and it actually happens to work > on x86 since x86 doesn't have crap-for-brains memory ordering > semantics. > > > So, if you make one of two changes to your example, then I will agree > > with you. > > No. We're not playing games here. I'm fed up with complex examples > that make no sense. Hey, your original example didn't do what you wanted given the current standard. Those two modified examples are no more complex than your original, and are the closest approximations that I can come up with right off-hand that provide the result you wanted. > Nobody sane writes code that does that pointer comparison, and it is > entirely immaterial what the compiler can do behind our backs. The C > standard semantics need to make sense to the *user* (ie programmer), > not to a CPU and not to a compiler. The CPU and compiler are "tools". > They don't matter. Their only job is to make the code *work*, dammit. > > So no idiotic made-up examples that involve code that nobody will ever > write and that have subtle issues. There are places in the Linux kernel that do both pointer comparisons and dereferences. Something like the following: p = rcu_dereference(gp); if (!p) p = &default_structure; do_something_with(p->a, p->b); In the typical case where default_structure was initialized early on, no ordering is needed in the !p case. > So the starting point is that (same example as before, but with even > clearer naming): > > Initialization state: > initialized = 0; > value = 0; > > Consumer: > > return atomic_read(&initialized, consume) ? value : -1; > > Writer: > value = 42; > atomic_write(&initialized, 1, release); > > and because the C memory ordering standard is written in such a way > that this is subtly buggy (and can return 0, which is *not* logically > a valid value), then I think the C memory ordering standard is broken. You really need that "consume" to be "acquire". > That "consumer" memory ordering is dangerous as hell, and it is > dangerous FOR NO GOOD REASON. > > The trivial "fix" to the standard would be to get rid of all the > "carries a dependency" crap, and just say that *anything* that depends > on it is ordered wrt it. > > That just means that on alpha, "consume" implies an unconditional read > barrier (well, unless the value is never used and is loaded just > because it is also volatile), on x86, "consume" is the same as > "acquire" which is just a plain load with ordering guarantees, and on > ARM or power you can still avoid the extra synchronization *if* the > value is used just for computation and for following pointers, but if > the value is used for a comparison, there needs to be a > synchronization barrier. Except in the default-value case, where no barrier is required. People really do comparisons on the pointers that they get back from rcu_dereference(), either to check for NULL (as above) or to check for a specific pointer. Ordering is not required in those cases. The cases that do require ordering from memory_order_consume values being used in an "if" statement should instead be memory_order_acquire. > Notice? Getting rid of the stupid "carries-dependency" crap from the > standard actually > (a) simplifies the standard > (b) means that the above obvious example *works* > (c) does not in *any* way make for any less efficient code generation > for the cases that "consume" works correctly for in the current > mis-designed standard. > (d) is actually a hell of a lot easier to explain to a compiler > writer, and I can guarantee that it is simpler to implement too. > > Why do I claim (d) "it is simpler to implement" - because on ARM/power > you can implement it *exactly* as a special "acquire", with just a > trivial peep-hole special case that follows the use chain of the > acquire op to the consume, and then just drop the acquire bit if the > only use is that compute-to-load chain. I don't believe that it is quite that simple. But yes, the compiler guys would be extremely happy to simply drop memory_order_consume from the standard, as it is the memory order that they most love to hate. Getting them to agree to any sort of peep-hole optimization semantics for memory_order_consume is likely problematic. > In fact, realistically, the *only* thing you need to actually care > about for the intended use case of "consume" is the question "is the > consuming load immediately consumed as an address (with offset) of a > memory operation. So you don't even need to follow any complicated > computation chain in a compiler - the only case that matters for the > barrier removal optimization is the "oh, I can see that it is only > used as an address to a dereference". > > Seriously. The current standard is broken. It's broken because it > mis-compiles code that on the face of it looks logical and works, it's > broken because it's overly complex, and it's broken because the > complexity doesn't even *buy* you anything. All this complexity for no > reason. When much simpler wording and implementation actually WORKS > BETTER. I disagree. The use cases you claim are memory_order_consume breakage are really bugs. You should use memory_order_acquire in those cases. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:11 ` Paul E. McKenney @ 2014-02-20 18:32 ` Linus Torvalds 2014-02-20 18:53 ` Torvald Riegel 2014-02-20 18:56 ` Paul E. McKenney 2014-02-20 18:44 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 18:32 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > You really need that "consume" to be "acquire". So I think we now all agree that that is what the standard is saying. And I'm saying that that is wrong, that the standard is badly written, and should be fixed. Because before the standard is fixed, I claim that "consume" is unusable. We cannot trust it. End of story. The fact that apparently gcc is currently buggy because it got the dependency calculations *wrong* just reinforces my point. The gcc bug Torvald pointed at is exactly because the current C standard is illogical unreadable CRAP. I can guarantee that what happened is: - the compiler saw that the result of the read was used as the left hand expression of the ternary "? :" operator - as a result, the compiler decided that there's no dependency - the compiler didn't think about the dependency that comes from the result of the load *also* being used as the middle part of the ternary expression, because it had optimized it away, despite the standard not talking about that at all. - so the compiler never saw the dependency that the standard talks about BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT. My suggested language never had any of these problems, because *my* suggested semantics are clear, logical, and don't have these kinds of idiotic pit-falls. Solution: Fix the f*cking C standard. No excuses, no explanations. Just get it fixed. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:32 ` Linus Torvalds @ 2014-02-20 18:53 ` Torvald Riegel 2014-02-20 19:09 ` Linus Torvalds 2014-02-20 18:56 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > You really need that "consume" to be "acquire". > > So I think we now all agree that that is what the standard is saying. Huh? The standard says that there are two separate things (among many more): mo_acquire and mo_consume. They both influence happens-before in different (and independent!) ways. What Paul is saying is that *you* should have used *acquire* in that example. > And I'm saying that that is wrong, that the standard is badly written, > and should be fixed. > > Because before the standard is fixed, I claim that "consume" is > unusable. We cannot trust it. End of story. Then we still have all the rest. Let's just ignore mo_consume for now, and look at mo_acquire, I suggest. > The fact that apparently gcc is currently buggy because it got the > dependency calculations *wrong* just reinforces my point. Well, I'm pretty sure nobody actually worked on trying to preserve the dependencies at all. IOW, I suspect this fell through the cracks. We can ask the person working on this if you really want to know. > The gcc bug Torvald pointed at is exactly because the current C > standard is illogical unreadable CRAP. It's obviously logically consistent to the extent that it can be represented by a formal specification such as the one by the Cambridge group. Makes sense, or not? > I can guarantee that what > happened is: > > - the compiler saw that the result of the read was used as the left > hand expression of the ternary "? :" operator > > - as a result, the compiler decided that there's no dependency > > - the compiler didn't think about the dependency that comes from the > result of the load *also* being used as the middle part of the ternary > expression, because it had optimized it away, despite the standard not > talking about that at all. > > - so the compiler never saw the dependency that the standard talks about > > BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT. Please, be specific. Right now you're saying that all of it is useless. Which is arguable not true. > My suggested language never had any of these problems, because *my* > suggested semantics are clear, logical, and don't have these kinds of > idiotic pit-falls. Have you looked at and understood the semantics of the memory model (e.g. in the formalized form) with mo_consume and related being ignored (ie, just ignore 6.13 and 6.14 in n3132)? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:53 ` Torvald Riegel @ 2014-02-20 19:09 ` Linus Torvalds 2014-02-22 18:53 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 19:09 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote: > On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: >> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney >> <paulmck@linux.vnet.ibm.com> wrote: >> > >> > You really need that "consume" to be "acquire". >> >> So I think we now all agree that that is what the standard is saying. > > Huh? > > The standard says that there are two separate things (among many more): > mo_acquire and mo_consume. They both influence happens-before in > different (and independent!) ways. > > What Paul is saying is that *you* should have used *acquire* in that > example. I understand. And I disagree. I think the standard is wrong, and what I *should* be doing is point out the fact very loudly, and just tell people to NEVER EVER use "consume" as long as it's not reliable and has insane semantics. So what I "should do" is to not accept any C11 atomics use in the kernel. Because with the "acquire", it generates worse code than what we already have, and with the "consume" it's shit. See? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 19:09 ` Linus Torvalds @ 2014-02-22 18:53 ` Torvald Riegel 2014-02-22 21:53 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-22 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 11:09 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote: > > On Thu, 2014-02-20 at 10:32 -0800, Linus Torvalds wrote: > >> On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney > >> <paulmck@linux.vnet.ibm.com> wrote: > >> > > >> > You really need that "consume" to be "acquire". > >> > >> So I think we now all agree that that is what the standard is saying. > > > > Huh? > > > > The standard says that there are two separate things (among many more): > > mo_acquire and mo_consume. They both influence happens-before in > > different (and independent!) ways. > > > > What Paul is saying is that *you* should have used *acquire* in that > > example. > > I understand. > > And I disagree. I think the standard is wrong, and what I *should* be > doing is point out the fact very loudly, and just tell people to NEVER > EVER use "consume" as long as it's not reliable and has insane > semantics. Stating that (1) "the standard is wrong" and (2) that you think that mo_consume semantics are not good is two different things. Making bold statements without a proper context isn't helpful in making this discussion constructive. It's simply not efficient if I (or anybody else reading this) has to wonder whether you actually mean what you said (even if, when reading it literally, is arguably not consistent with the arguments brought up in the discussion) or whether those statements just have to be interpreted in some other way. > So what I "should do" is to not accept any C11 atomics use in the > kernel. You're obviously free to do that. > Because with the "acquire", it generates worse code than what > we already have, I would argue that this is still under debate. At least I haven't seen a definition of what you want that is complete and based on the standard (e.g., an example of what a compiler might do in a specific case isn't a definition). From what I've seen, it's not inconceivable that what you want is just an optimized acquire. I'll bring this question up again elsewhere in the thread (where it hopefully fits better). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-22 18:53 ` Torvald Riegel @ 2014-02-22 21:53 ` Linus Torvalds 2014-02-23 0:39 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-22 21:53 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote: > > Stating that (1) "the standard is wrong" and (2) that you think that > mo_consume semantics are not good is two different things. I do agree. They are two independent things. I think the standard is wrong, because it's overly complex, hard to understand, and nigh unimplementable. As shown by the bugzilla example, "carries a dependency" encompasses things that are *not* just synchronizing things just through a pointer, and as a result it's actually very complicated, since they could have been optimized away, or done in non-local code that wasn't even aware of the dependency carrying. That said, I'm reconsidering my suggested stricter semantics, because for RCU we actually do want to test the resulting pointer against NULL _without_ any implied serialization. So I still feel that the standard as written is fragile and confusing (and the bugzilla entry pretty much proves that it is also practically unimplementable as written), but strengthening the serialization may be the wrong thing. Within the kernel, the RCU use for this is literally purely about loading a pointer, and doing either: - testing its value against NULL (without any implied synchronization at all) - using it as a pointer to an object, and expecting that any accesses to that object are ordered wrt the consuming load. So I actually have a suggested *very* different model that people might find more acceptable. How about saying that the result of a "atomic_read(&a, mo_consume)" is required to be a _restricted_ pointer type, and that the consume ordering guarantees the ordering between that atomic read and the accesses to the object that the pointer points to. No "carries a dependency", no nothing. Now, there's two things to note in there: - the "restricted pointer" part means that the compiler does not need to worry about serialization to that object through other possible pointers - we have basically promised that the *only* pointer to that object comes from the mo_consume. So that part makes it clear that the "consume" ordering really only is valid wrt that particular pointer load. - the "to the object that the pointer points to" makes it clear that you can't use the pointer to generate arbitrary other values and claim to serialize that way. IOW, with those alternate semantics, that gcc bugzilla example is utterly bogus, and a compiler can ignore it, because while it tries to synchronize through the "dependency chain" created with that "p-i+i" expression, that is completely irrelevant when you use the above rules instead. In the bugzilla example, the object that "*(p-i+i)" accesses isn't actually the object pointed to by the pointer, so no serialization is implied. And if it actually *were* to be the same object, because "p" happens to have the same value as "i", then the "restrict" part of the rule pops up and the compiler can again say that there is no ordering guarantee, since the programmer lied to it and used a restricted pointer that aliased with another one. So the above suggestion basically tightens the semantics of "consume" in a totally different way - it doesn't make it serialize more, in fact it weakens the serialization guarantees a lot, but it weakens them in a way that makes the semantics a lot simpler and clearer. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-22 21:53 ` Linus Torvalds @ 2014-02-23 0:39 ` Paul E. McKenney 2014-02-23 3:50 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-23 0:39 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 01:53:30PM -0800, Linus Torvalds wrote: > On Sat, Feb 22, 2014 at 10:53 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > Stating that (1) "the standard is wrong" and (2) that you think that > > mo_consume semantics are not good is two different things. > > I do agree. They are two independent things. > > I think the standard is wrong, because it's overly complex, hard to > understand, and nigh unimplementable. As shown by the bugzilla > example, "carries a dependency" encompasses things that are *not* just > synchronizing things just through a pointer, and as a result it's > actually very complicated, since they could have been optimized away, > or done in non-local code that wasn't even aware of the dependency > carrying. > > That said, I'm reconsidering my suggested stricter semantics, because > for RCU we actually do want to test the resulting pointer against NULL > _without_ any implied serialization. > > So I still feel that the standard as written is fragile and confusing > (and the bugzilla entry pretty much proves that it is also practically > unimplementable as written), but strengthening the serialization may > be the wrong thing. > > Within the kernel, the RCU use for this is literally purely about > loading a pointer, and doing either: > > - testing its value against NULL (without any implied synchronization at all) > > - using it as a pointer to an object, and expecting that any accesses > to that object are ordered wrt the consuming load. Agreed, by far the most frequent use is "->" to dereference and assignment to store into a local variable. The other operations where the kernel expects ordering to be maintained are: o Bitwise "&" to strip off low-order bits. The FIB tree does this, for example in fib_table_lookup() in net/ipv4/fib_trie.c. The low-order bit is used to distinguish internal nodes from leaves -- nodes and leaves are different types of structures. (There are a few others.) o Uses "?:" to substitute defaults in case of NULL pointers, but ordering must be maintained in the non-default case. Most, perhaps all, of these could be converted to "if" should "?:" prove problematic. o Addition and subtraction to adjust both pointers to and indexes into RCU-protected arrays. There are not that many indexes, and they could be converted to pointers, but the addition and subtraction looks necessary in a some cases. o Array indexing. The value from rcu_dereference() is used both before and inside the "[]", interestingly enough. o Casts along with unary "&" and "*". That said, I did not see any code that dependended on ordering through the function-call "()", boolean complement "!", comparison (only "==" and "!="), logical operators ("&&" and "||"), and the "*", "/", and "%" arithmetic operators. > So I actually have a suggested *very* different model that people > might find more acceptable. > > How about saying that the result of a "atomic_read(&a, mo_consume)" is > required to be a _restricted_ pointer type, and that the consume > ordering guarantees the ordering between that atomic read and the > accesses to the object that the pointer points to. > > No "carries a dependency", no nothing. In the case of arrays, the object that the pointer points to is considered to be the full array, right? > Now, there's two things to note in there: > > - the "restricted pointer" part means that the compiler does not need > to worry about serialization to that object through other possible > pointers - we have basically promised that the *only* pointer to that > object comes from the mo_consume. So that part makes it clear that the > "consume" ordering really only is valid wrt that particular pointer > load. That could work, though there are some cases where a multi-linked structure is made visible using a single rcu_assign_pointer(), and rcu_dereference() is used only for the pointer leading to that multi-linked structure, not for the pointers among the elements making up that structure. One way to handle this would be to require rcu_dereference() to be used within the structure an well as upon first traversal to the structure. > - the "to the object that the pointer points to" makes it clear that > you can't use the pointer to generate arbitrary other values and claim > to serialize that way. > > IOW, with those alternate semantics, that gcc bugzilla example is > utterly bogus, and a compiler can ignore it, because while it tries to > synchronize through the "dependency chain" created with that "p-i+i" > expression, that is completely irrelevant when you use the above rules > instead. > > In the bugzilla example, the object that "*(p-i+i)" accesses isn't > actually the object pointed to by the pointer, so no serialization is > implied. And if it actually *were* to be the same object, because "p" > happens to have the same value as "i", then the "restrict" part of the > rule pops up and the compiler can again say that there is no ordering > guarantee, since the programmer lied to it and used a restricted > pointer that aliased with another one. > > So the above suggestion basically tightens the semantics of "consume" > in a totally different way - it doesn't make it serialize more, in > fact it weakens the serialization guarantees a lot, but it weakens > them in a way that makes the semantics a lot simpler and clearer. It does look simpler and does look like it handles a large fraction of the Linux-kernel uses. But now it is time for some bullshit syntax for the RCU-protected arrays in the Linux kernel: p = atomic_load_explicit(gp, memory_order_consume); r1 = *p; /* Ordering maintained. */ r2 = p[5]; /* Ordering maintained? */ r3 = p + 5; /* Ordering maintained? */ n = get_an_index(); r4 = p[n]; /* Ordering maintained? */ If the answer to the three questions is "no", then perhaps some special function takes care of accesses to RCU-protected arrays. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-23 0:39 ` Paul E. McKenney @ 2014-02-23 3:50 ` Linus Torvalds 2014-02-23 6:34 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-23 3:50 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Agreed, by far the most frequent use is "->" to dereference and assignment > to store into a local variable. The other operations where the kernel > expects ordering to be maintained are: > > o Bitwise "&" to strip off low-order bits. The FIB tree does > this, for example in fib_table_lookup() in net/ipv4/fib_trie.c. > The low-order bit is used to distinguish internal nodes from > leaves -- nodes and leaves are different types of structures. > (There are a few others.) Note that this is very much outside the scope of the C standard, regardless of 'consume' or not. We'll always do things outside the standard, so I wouldn't worry. > o Uses "?:" to substitute defaults in case of NULL pointers, > but ordering must be maintained in the non-default case. > Most, perhaps all, of these could be converted to "if" should > "?:" prove problematic. Note that this doesn't actually affect the restrict/ordering rule in theory: "?:" isn't special according to those rules. The rules are fairly simple: we guarantee ordering only to the object that the pointer points to, and even that guarantee goes out the window if there is some *other* way to reach the object. ?: is not really relevant, except in the sense that *any* expression that ends up pointing to outside the object will lose the ordering guarantee. ?: can be one such expression, but so can "p-p" or anything like that. And in *practice*, the only thing that needs to be sure to generate special code is alpha, and there you'd just add the "rmb" after the load. That is sufficient to fulfill the guarantees. On ARM and powerpc, the compiler obviously has to guarantee that it doesn't do value-speculation on the result, but again, that never really had anything to do with the whole "carries a dependency", it is really all about the fact that in order to guarantee the ordering, the compiler mustn't generate that magical aliased pointer value. But if the aliased pointer value comes from the *source* code, all bets are off. Now, even on alpha, the compiler can obviously move that "rmb" around. For example, if there is a conditional after the "atomic_read(mo_consume)", and the compiler can tell that the pointer that got read by mo_consume is dead along one branch, then the compiler can move the "rmb" to only exist in the other branch. Why? Because we inherently guarantee only the order to any accesses to the object the pointer pointed to, and that the pointer that got loaded is the *only* way to get to that object (in this context), so if the value is dead, then so is the ordering. In fact, even if the value is *not* dead, but it is NULL, the compiler can validly say "the NULL pointer cannot point to any object, so I don't have to guarantee any serialization". So code like this (writing alpha assembly, since in practice only alpha will ever care): ptr = atomic_read(pp, mo_consume); if (ptr) { ... do something with ptr .. } return ptr; can validly be translated to: ldq $1,0($2) beq $1,branch-over rmb .. the do-something code using register $1 .. because the compiler knows that a NULL pointer cannot be dereferenced, so it can decide to put the rmb in the non-NULL path - even though the pointer value is still *live* in the other branch (well, the liveness of a constant value is somewhat debatable, but you get the idea), and may be used by the caller (but since it is NULL, the "use" can not include accessing any object, only really testing) So note how this is actually very different from the "carries dependency" rule. It's simpler, and it allows much more natural optimizations. > o Addition and subtraction to adjust both pointers to and indexes > into RCU-protected arrays. There are not that many indexes, > and they could be converted to pointers, but the addition and > subtraction looks necessary in a some cases. Addition and subtraction is fine, as long as they stay within the same object/array. And realistically, people violate the whole C pointer "same object" rule all the time. Any time you implement a raw memory allocator, you'll violate the C standard and you *will* basically be depending on architecture-specific behavior. So everybody knows that the C "pointer arithmetic has to stay within the object" is really a fairly made-up but convenient shorthand for "sure, we know you'll do tricks on pointer values, but they won't be portable and you may have to take particular machine representations into account". > o Array indexing. The value from rcu_dereference() is used both > before and inside the "[]", interestingly enough. Well, in the C sense, or in the actual "integer index" sense? Because technically, a[b] is nothing but *(a+b), so "inside" the "[]" is strictly speaking meaningless. Inside and outside are just syntactic sugar. That said, I agree that the way I phrased things really limits things to *just* pointers, and if you want to do RCU on integer values (that get turned into pointers some other way), that would be outside the spec. But in practice, the code generation will *work* for non-pointers too (the exception might be if the compiler actually does the above "NULL is special, so I know I don't need to order wrt it". Exactly the way that people do arithmetic operations on pointers that aren't really covered by the standard (ie arithmetic 'and' to align them etc), the code still *works*. It may be outside the standard, but everybody does it, and one of the reasons C is so powerful is exactly that you *can* do things like that. They won't be portable, and you have to know what you are doing, but they don't stop working just because they aren't covered by the standard. >> - the "restricted pointer" part means that the compiler does not need >> to worry about serialization to that object through other possible >> pointers - we have basically promised that the *only* pointer to that >> object comes from the mo_consume. So that part makes it clear that the >> "consume" ordering really only is valid wrt that particular pointer >> load. > > That could work, though there are some cases where a multi-linked > structure is made visible using a single rcu_assign_pointer(), and > rcu_dereference() is used only for the pointer leading to that > multi-linked structure, not for the pointers among the elements > making up that structure. One way to handle this would be to > require rcu_dereference() to be used within the structure an well > as upon first traversal to the structure. I don't see that you would need to protect anything but the first read, so I think you need rcu_dereference() only on the initial pointer access. On architectures where following a pointer is sufficient ordering, we're fine. On alpha (and, if anybody ever makes that horrid mistake again), the 'rmb' after the first access will guarantee that all the later accesses will see the new list. So basically, 'consume' will end up inserting a barrier (if necessary) before following the pointer for the first time, but once that barrier has been inserted, we are now fully ordered wrt the releasing store, so we're done. No need to order *all* the accesses (even off the same base pointer), only the first one. > But now it is time for some bullshit syntax for the RCU-protected arrays > in the Linux kernel: > > p = atomic_load_explicit(gp, memory_order_consume); > r1 = *p; /* Ordering maintained. */ > r2 = p[5]; /* Ordering maintained? */ Oh yes. It's in the same object. If it wasn't, then "p[5]" itself would be meaningless and outside the scope of the standard to begin with. > r3 = p + 5; /* Ordering maintained? */ > n = get_an_index(); > r4 = p[n]; /* Ordering maintained? */ Yes. By definition. And again, for the very same reason. You would violate *other* parts of the C standard before you violated the suggested rule. Also, remember: a lot of this is about "legalistic guarantees". In practice, just the fact that a data chain *exists* is guarantee enough, so from a code generation standpoint, NONE OF THIS MATTERS. The exception is alpha, where a "mo_consume" load basically needs to be generated with a "rmb" following it (see above how you can relax that a _bit_, but not much). Trivial. So code generation is actually *trivial*. Ordering is maintained *automatically*, with no real effort on the side of the compiler. The only issues really are: - the current C standard language implies *too much* ordering. The whole "carries dependency" is broken. The practical example - even without using "?:" or any conditionals or any function calls - is just that ptr = atomic_read(pp, mo_consume); value = array[ptr-ptr]; and there really isn't any sane ordering there. But the current standard clearly says there is. The current standard is wrong. - my trick of saying that it is only ordered wrt accesses to the object the pointer points to gets rid of that whole bogus false dependency. - the "restricted pointer" thing is just legalistic crap, and has no actual meaning for code generation, it is _literally_ just that if the programmer does value speculation by hand: ptr = atomic_read(pp, mo_consume); value = object.val; if (ptr != &object) value = ptr->val; then in this situation the compiler does *not* need to worry about the ordering to the "object.val" access, because it was gotten through an alias. So that "restrict" thing is really just a way to say "the ordering only exists through that single pointer". That's not really what restrict was _designed_ for, but to quote the standard: "An object that is accessed through a restrict-qualified pointer has a special association with that pointer. This association, defined in 6.7.3.1 below, requires that all accesses to that object use, directly or indirectly, the value of that particular pointer" that's not the *definition* of "restrict", and quite frankly, the actual language that talks about "restrict" really talks about being able to know that nobody *modifies* the value through another pointer, so I'm clearly stretching things a bit. But anybody reading the above quote from the standard hopefully agrees that I'm not stretching it all that much. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-23 3:50 ` Linus Torvalds @ 2014-02-23 6:34 ` Paul E. McKenney 2014-02-23 19:31 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-23 6:34 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 07:50:35PM -0800, Linus Torvalds wrote: > On Sat, Feb 22, 2014 at 4:39 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Agreed, by far the most frequent use is "->" to dereference and assignment > > to store into a local variable. The other operations where the kernel > > expects ordering to be maintained are: > > > > o Bitwise "&" to strip off low-order bits. The FIB tree does > > this, for example in fib_table_lookup() in net/ipv4/fib_trie.c. > > The low-order bit is used to distinguish internal nodes from > > leaves -- nodes and leaves are different types of structures. > > (There are a few others.) > > Note that this is very much outside the scope of the C standard, > regardless of 'consume' or not. > > We'll always do things outside the standard, so I wouldn't worry. Fair enough. I am going to make examples to try to make sure that we aren't using the same words for two different meanings: struct foo restrict *p; p = atomic_load_explicit(&gp, memory_order_consume); p &= ~0x3; /* OK because p has restrict property. */ return p->a; /* Ordered WRT above load. */ Of course, the compiler would have to work pretty hard to break this ordering, so I am with you in not worrying too much about this one. > > o Uses "?:" to substitute defaults in case of NULL pointers, > > but ordering must be maintained in the non-default case. > > Most, perhaps all, of these could be converted to "if" should > > "?:" prove problematic. > > Note that this doesn't actually affect the restrict/ordering rule in > theory: "?:" isn't special according to those rules. The rules are > fairly simple: we guarantee ordering only to the object that the > pointer points to, and even that guarantee goes out the window if > there is some *other* way to reach the object. > > ?: is not really relevant, except in the sense that *any* expression > that ends up pointing to outside the object will lose the ordering > guarantee. ?: can be one such expression, but so can "p-p" or anything > like that. > > And in *practice*, the only thing that needs to be sure to generate > special code is alpha, and there you'd just add the "rmb" after the > load. That is sufficient to fulfill the guarantees. OK, seems reasonable. Reworking the earlier example: struct foo restrict *p; p = atomic_load_explicit(&gp, memory_order_consume); p = p ? p : &default_foo; return p->a; /* Ordered WRT above load if p non-NULL. */ And the ordering makes sense only in the non-NULL case anyway, so this should be fine. > On ARM and powerpc, the compiler obviously has to guarantee that it > doesn't do value-speculation on the result, but again, that never > really had anything to do with the whole "carries a dependency", it is > really all about the fact that in order to guarantee the ordering, the > compiler mustn't generate that magical aliased pointer value. But if > the aliased pointer value comes from the *source* code, all bets are > off. Agreed, compiler-based value speculation is dangerous in any case. (Though the branch-predictor-based trick seems like it should be safe on TSO systems like x86, s390, etc.) > Now, even on alpha, the compiler can obviously move that "rmb" around. > For example, if there is a conditional after the > "atomic_read(mo_consume)", and the compiler can tell that the pointer > that got read by mo_consume is dead along one branch, then the > compiler can move the "rmb" to only exist in the other branch. Why? > Because we inherently guarantee only the order to any accesses to the > object the pointer pointed to, and that the pointer that got loaded is > the *only* way to get to that object (in this context), so if the > value is dead, then so is the ordering. Yep! > In fact, even if the value is *not* dead, but it is NULL, the compiler > can validly say "the NULL pointer cannot point to any object, so I > don't have to guarantee any serialization". So code like this (writing > alpha assembly, since in practice only alpha will ever care): > > ptr = atomic_read(pp, mo_consume); > if (ptr) { > ... do something with ptr .. > } > return ptr; > > can validly be translated to: > > ldq $1,0($2) > beq $1,branch-over > rmb > .. the do-something code using register $1 .. > > because the compiler knows that a NULL pointer cannot be dereferenced, > so it can decide to put the rmb in the non-NULL path - even though the > pointer value is still *live* in the other branch (well, the liveness > of a constant value is somewhat debatable, but you get the idea), and > may be used by the caller (but since it is NULL, the "use" can not > include accessing any object, only really testing) Agreed. > So note how this is actually very different from the "carries > dependency" rule. It's simpler, and it allows much more natural > optimizations. I do have a couple more questions about it, but please see below. > > o Addition and subtraction to adjust both pointers to and indexes > > into RCU-protected arrays. There are not that many indexes, > > and they could be converted to pointers, but the addition and > > subtraction looks necessary in a some cases. > > Addition and subtraction is fine, as long as they stay within the same > object/array. > > And realistically, people violate the whole C pointer "same object" > rule all the time. Any time you implement a raw memory allocator, > you'll violate the C standard and you *will* basically be depending on > architecture-specific behavior. So everybody knows that the C "pointer > arithmetic has to stay within the object" is really a fairly made-up > but convenient shorthand for "sure, we know you'll do tricks on > pointer values, but they won't be portable and you may have to take > particular machine representations into account". Adding and subtracting integers to/from a RCU-protected pointer makes sense to me. Adding and subtracting integers to/from an RCU-protected integer makes sense in many practical cases, but I worry about the compiler figuring out that the RCU-protected integer cancelled with some other integer. I am beginning to suspect that the few uses of RCU-protected array indexes should be converted to pointer form. I don't feel all that good about subtractions involving an RCU-protected pointer and another pointer, again due to the possibility of arithmetic optimizations cancelling everything. But it is true that doing so is perfectly safe in a number of situations. For example, the following should work, and might even be useful: p = atomic_load_explicit(&gp, memory_order_consume); q = gq + p - gp_base; do_something_with(q->a); /* No ordering, but corresponding element. */ But at that point, the RCU-protected array index seems nicer. Give or take my nervousness about arithmetic optimizations. > > o Array indexing. The value from rcu_dereference() is used both > > before and inside the "[]", interestingly enough. > > Well, in the C sense, or in the actual "integer index" sense? Because > technically, a[b] is nothing but *(a+b), so "inside" the "[]" is > strictly speaking meaningless. Inside and outside are just syntactic > sugar. Yep, understood that a[b] is identical to b[a] in classic C. Just getting nervous about RCU-protected integer indexes interacting with compiler optimizations. Perhaps needlessly so, but... > That said, I agree that the way I phrased things really limits things > to *just* pointers, and if you want to do RCU on integer values (that > get turned into pointers some other way), that would be outside the > spec. That was why I asked. ;-) > But in practice, the code generation will *work* for non-pointers too > (the exception might be if the compiler actually does the above "NULL > is special, so I know I don't need to order wrt it". Exactly the way > that people do arithmetic operations on pointers that aren't really > covered by the standard (ie arithmetic 'and' to align them etc), the > code still *works*. It may be outside the standard, but everybody does > it, and one of the reasons C is so powerful is exactly that you *can* > do things like that. They won't be portable, and you have to know what > you are doing, but they don't stop working just because they aren't > covered by the standard. Yep, again modulo my nervousness about arithmetic optimizations on RCU-protected integers. > >> - the "restricted pointer" part means that the compiler does not need > >> to worry about serialization to that object through other possible > >> pointers - we have basically promised that the *only* pointer to that > >> object comes from the mo_consume. So that part makes it clear that the > >> "consume" ordering really only is valid wrt that particular pointer > >> load. > > > > That could work, though there are some cases where a multi-linked > > structure is made visible using a single rcu_assign_pointer(), and > > rcu_dereference() is used only for the pointer leading to that > > multi-linked structure, not for the pointers among the elements > > making up that structure. One way to handle this would be to > > require rcu_dereference() to be used within the structure an well > > as upon first traversal to the structure. > > I don't see that you would need to protect anything but the first > read, so I think you need rcu_dereference() only on the initial > pointer access. Let me try an example: struct bar { int a; unsigned b; }; struct foo { struct bar *next; char c; }; struct bar bar1; struct foo foo1; struct foo *foop; T1: bar1.a = -1; bar1.b = 2; foo1.next = &bar; foo1.c = "*"; atomic_store_explicit(&foop, &foo1, memory_order_release); T2: struct foo restrict *p; struct bar *q; p = atomic_load_explicit(&foop, memory_order_consume); if (p == NULL) return -EDEALWITHIT; q = p->next; /* Ordered with above load. */ do_something(q->b); /* Non-decorated pointer, ordered? */ Yes, any reasonable code generation of the above will produce the ordering on all systems that Linux runs on, understood. But by your rules, if we don't do something special for pointer q, how do we avoid either (1) losing the ordering, (2) being back in the dependency-tracing business, or (3) having to rely on the kindness of compiler writers? > On architectures where following a pointer is sufficient ordering, > we're fine. On alpha (and, if anybody ever makes that horrid mistake > again), the 'rmb' after the first access will guarantee that all the > later accesses will see the new list. > > So basically, 'consume' will end up inserting a barrier (if necessary) > before following the pointer for the first time, but once that barrier > has been inserted, we are now fully ordered wrt the releasing store, > so we're done. No need to order *all* the accesses (even off the same > base pointer), only the first one. I hear what you are saying, but it is sounding like #3 in my list above. ;-) Which is OK, except from the legalistic guarantees viewpoint you mention below. It might also allow compiler writers to do crazy optimizations in general, but not on these restricted pointers. This might improve performance without risking messing up critical orderings. > > But now it is time for some bullshit syntax for the RCU-protected arrays > > in the Linux kernel: > > > > p = atomic_load_explicit(gp, memory_order_consume); > > r1 = *p; /* Ordering maintained. */ > > r2 = p[5]; /* Ordering maintained? */ > > Oh yes. It's in the same object. If it wasn't, then "p[5]" itself > would be meaningless and outside the scope of the standard to begin > with. > > > r3 = p + 5; /* Ordering maintained? */ > > n = get_an_index(); > > r4 = p[n]; /* Ordering maintained? */ > > Yes. By definition. And again, for the very same reason. You would > violate *other* parts of the C standard before you violated the > suggested rule. Good! > Also, remember: a lot of this is about "legalistic guarantees". In > practice, just the fact that a data chain *exists* is guarantee > enough, so from a code generation standpoint, NONE OF THIS MATTERS. Completely understood. > The exception is alpha, where a "mo_consume" load basically needs to > be generated with a "rmb" following it (see above how you can relax > that a _bit_, but not much). Trivial. > > So code generation is actually *trivial*. Ordering is maintained > *automatically*, with no real effort on the side of the compiler. > > The only issues really are: > > - the current C standard language implies *too much* ordering. The > whole "carries dependency" is broken. The practical example - even > without using "?:" or any conditionals or any function calls - is just > that > > ptr = atomic_read(pp, mo_consume); > value = array[ptr-ptr]; > > and there really isn't any sane ordering there. But the current > standard clearly says there is. The current standard is wrong. Yep, not my idea, though I must take the blame for going along with it. > - my trick of saying that it is only ordered wrt accesses to the > object the pointer points to gets rid of that whole bogus false > dependency. So I am still nervous about pointer "q" in my example above. > - the "restricted pointer" thing is just legalistic crap, and has no > actual meaning for code generation, it is _literally_ just that if the > programmer does value speculation by hand: > > ptr = atomic_read(pp, mo_consume); > value = object.val; > if (ptr != &object) > value = ptr->val; > > then in this situation the compiler does *not* need to worry about > the ordering to the "object.val" access, because it was gotten through > an alias. And hopefully to tell the compiler to go easy on crazy optimizations in case where ptr->a or some such is really used. > So that "restrict" thing is really just a way to say "the ordering > only exists through that single pointer". That's not really what > restrict was _designed_ for, but to quote the standard: > > "An object that is accessed through a restrict-qualified pointer has a > special association > with that pointer. This association, defined in 6.7.3.1 below, > requires that all accesses to > that object use, directly or indirectly, the value of that particular pointer" > > that's not the *definition* of "restrict", and quite frankly, the > actual language that talks about "restrict" really talks about being > able to know that nobody *modifies* the value through another pointer, > so I'm clearly stretching things a bit. But anybody reading the above > quote from the standard hopefully agrees that I'm not stretching it > all that much. Well, based on my experience, the committee would pick some other spelling anyway... ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-23 6:34 ` Paul E. McKenney @ 2014-02-23 19:31 ` Linus Torvalds 2014-02-24 1:16 ` Paul E. McKenney 2014-02-24 15:57 ` Linus Torvalds 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-23 19:31 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Adding and subtracting integers to/from a RCU-protected pointer makes > sense to me. Ack. And that's normal "access to an object" behavior anyway. > Adding and subtracting integers to/from an RCU-protected integer makes > sense in many practical cases, but I worry about the compiler figuring > out that the RCU-protected integer cancelled with some other integer. I suspect that in practice, all the same normal rules apply - assuming that there aren't any "aliasing" integers, and that the compiler doesn't do any value prediction, the end result *should* be exactly the same as for the pointer arithmetic. So even if the standard were to talk about just pointers, I suspect that in practice, there really isn't anything that can go wrong. > I don't feel all that good about subtractions involving an RCU-protected > pointer and another pointer, again due to the possibility of arithmetic > optimizations cancelling everything. Actually, pointer subtraction is a pretty useful operation, even without the "gotcha" case of "p-p" just to force a fake dependency. Getting an index, or indeed just getting an offset within a structure, is valid and common, and people will and should do it. It doesn't really matter for my suggested language: it should be perfectly fine to do something like ptr = atomic_read(pp, mo_consume); index = ptr - array_base; .. pass off 'index' to some function, which then re-generates the ptr using it .. and the compiler will have no trouble generating code, and the suggested "ordered wrt that object" guarantee is that the eventual ordering is between the pointer load and the use in the function, regardless of how it got there (ie turning it into an index and back is perfectly fine). So both from a legalistic language wording standpoint, and from a compiler code generation standpoint, this is a non-issue. Now, if you actually lose information (ie some chain there drops enough data from the pointer that it cannot be recovered, partially of fully), and then "regenerate" the object value by faking it, and still end up accessing the right data, but without actually going through any of the the pointer that you loaded, that falls under the "restricted" heading, and you must clearly at least partially have used other information. In which case the standard wording wouldn't guarantee anything at all. >> I don't see that you would need to protect anything but the first >> read, so I think you need rcu_dereference() only on the initial >> pointer access. > > Let me try an example: > > struct bar { > int a; > unsigned b; > }; > > struct foo { > struct bar *next; > char c; > }; > > struct bar bar1; > struct foo foo1; > > struct foo *foop; > > T1: bar1.a = -1; > bar1.b = 2; > foo1.next = &bar; > foo1.c = "*"; > atomic_store_explicit(&foop, &foo1, memory_order_release); So here, the standard requires that the store with release is an ordering to all preceding writes. So *all* writes to bar and foo are ordered, despite the fact that the pointer just points to foo. > T2: struct foo restrict *p; > struct bar *q; > > p = atomic_load_explicit(&foop, memory_order_consume); > if (p == NULL) > return -EDEALWITHIT; > q = p->next; /* Ordered with above load. */ > do_something(q->b); /* Non-decorated pointer, ordered? */ So the theory is that a compiler *could* do some value speculation now on "q", and with value speculation move the actual load of "q->p" up to before "foop" was even loaded. So in practice, I think we agree that this doesn't really affect compiler writers (because they'd have to do a whole lot extra code to break it intentionally, considering that they can't do value-prediction on 'p'), and we just need to make sure to close the hole in the language to make this safe, right? Let me think about it some more, but my gut feel is that just tweaking the definition of what "ordered" means is sufficient. So to go back to the suggested ordering rules (ignoring the "restrict" part, which is just to clarify that ordering through other means to get to the object doesn't matter), I suggested: "the consume ordering guarantees the ordering between that atomic read and the accesses to the object that the pointer points to" and I think the solution is to just say that this ordering acts as a fence. It doesn't say exactly *where* the fence is, but it says that there is *some* fence between the load of the pointer and any/all accesses to the object through that pointer. So with that definition, the memory accesses that are dependent on 'q' will obviously be ordered. Now, they will *not* be ordered wrt the load of q itself, but they will be ordered wrt the load from 'foop' - because we've made it clear that there is a fence *somewhere* between that atomic_load and the load of 'q'. So that handles the ordering guarantee you worry about. Now, the other worry is that this "fence" language would impose a *stricter* ordering, and that by saying there is a fence, we'd now constrain code generation on architectures like ARM and power, agreed? And we do not want to do that, other than make it clear that the whole "break the dependency chain through value speculation" cannot break it past the load from 'foop'. Are we in agreement? So we do *not* want that fence to limit re-ordering of independent memory operations. All agreed? But let's look at any independent chain, especially around that consuming load from '&foop'. Say we have some other memory access (adding them for visualization into your example): p = atomic_load_explicit(&foop, memory_order_consume); if (p == NULL) return -EDEALWITHIT; q = p->next; /* Ordered with above load. */ + a = *b; *c = d; do_something(q->b); /* Non-decorated pointer, ordered? */ and we are looking to make sure that those memory accesses can still move around freely. I'm claiming that they can, even with the "fence" language, because (a) we've said 'q' is restricted, so there is no aliasing between q and the pointers b/c. So the compiler is free to move those accesses around the "q = p->next" access. (b) once you've moved them to *before* the "q = p->next" access, the fence no longer constrains you, because the guarantee is that there is a fence *somewhere* between the consuming load and the accesses to that object, but it could be *at* the access, so you're done. So I think the "we guarantee there is an ordering *somewhere* between the consuming load and the first ordered access" model actually gives us the semantics we need. (Note that you cannot necessarily move the accesses through the pointers b/c past the consuming load, since the only non-aliasing guarantee is about the pointer 'p', not about 'foop'. But you may use other arguments to move them past that too) But it's possible that the above argument doesn't really work. It is a very off-the-cuff "my intuition says this should work" model. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-23 19:31 ` Linus Torvalds @ 2014-02-24 1:16 ` Paul E. McKenney 2014-02-24 1:35 ` Linus Torvalds 2014-02-24 15:57 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 1:16 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 23, 2014 at 11:31:25AM -0800, Linus Torvalds wrote: > On Sat, Feb 22, 2014 at 10:34 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Adding and subtracting integers to/from a RCU-protected pointer makes > > sense to me. > > Ack. And that's normal "access to an object" behavior anyway. And covers things like container_of() as well as array indexing and field selection, so good. > > Adding and subtracting integers to/from an RCU-protected integer makes > > sense in many practical cases, but I worry about the compiler figuring > > out that the RCU-protected integer cancelled with some other integer. > > I suspect that in practice, all the same normal rules apply - assuming > that there aren't any "aliasing" integers, and that the compiler > doesn't do any value prediction, the end result *should* be exactly > the same as for the pointer arithmetic. So even if the standard were > to talk about just pointers, I suspect that in practice, there really > isn't anything that can go wrong. Yep, in practice in absence of the "i-i" code, it should work. As long as the integer returned from the memory_order_consume load is always kept in an integer tagged as "restrict", I believe that it should be possible to make the standard guarantee it as well. > > I don't feel all that good about subtractions involving an RCU-protected > > pointer and another pointer, again due to the possibility of arithmetic > > optimizations cancelling everything. > > Actually, pointer subtraction is a pretty useful operation, even > without the "gotcha" case of "p-p" just to force a fake dependency. > Getting an index, or indeed just getting an offset within a structure, > is valid and common, and people will and should do it. It doesn't > really matter for my suggested language: it should be perfectly fine > to do something like > > ptr = atomic_read(pp, mo_consume); > index = ptr - array_base; > .. pass off 'index' to some function, which then re-generates the > ptr using it .. I believe that the devil is in the details of the regeneration of the pointer. Yet another example below. > and the compiler will have no trouble generating code, and the > suggested "ordered wrt that object" guarantee is that the eventual > ordering is between the pointer load and the use in the function, > regardless of how it got there (ie turning it into an index and back > is perfectly fine). > > So both from a legalistic language wording standpoint, and from a > compiler code generation standpoint, this is a non-issue. > > Now, if you actually lose information (ie some chain there drops > enough data from the pointer that it cannot be recovered, partially of > fully), and then "regenerate" the object value by faking it, and still > end up accessing the right data, but without actually going through > any of the the pointer that you loaded, that falls under the > "restricted" heading, and you must clearly at least partially have > used other information. In which case the standard wording wouldn't > guarantee anything at all. I agree in general, but believe that the pointer regeneration needs to be done with care. If you are adding the difference back into the original variable ptr read from the RCU-protected pointer, I have no problem with it. My concern is with more elaborate cases. For example: struct foo g1[N]; struct bar g2[N]; struct foo restrict *p1; struct bar *p2; int index; p1 = atomic_read_explicit(*gp1, memory_order_consume); index = (ptr - g1) / 2; /* Getting this into the BS syntax realm. */ p2 = g2 + index; The standard as currently worded would force ordering between the read into p1 and the read into p2. But it does so via dependency chains, so this is starting to feel to me like we are getting back into the business of tracing dependency chains. Of course, one can argue that parallel arrays are not all that useful in the Linux kernel, and one can argue that dividing indexes by two is beyond the pale. (Although heaps in dense arrays really do this sort of index arithmetic, I am having some trouble imagining why RCU protection would be useful in that case. Yes, I could make up something about replacing one heap locklessly with another, but...) I might well be in my usual overly paranoid mode, but this is the sort of thing that makes me a bit nervous. I would rather either have it bulletproof safe, or to say "don't do that". Hmmm... I don't recall seeing a use case for subtraction involving an RCU-protected pointer and another pointer when I looked through all the rcu_dereference() invocations in the Linux kernel. Is there a use case that I am missing? > >> I don't see that you would need to protect anything but the first > >> read, so I think you need rcu_dereference() only on the initial > >> pointer access. > > > > Let me try an example: > > > > struct bar { > > int a; > > unsigned b; > > }; > > > > struct foo { > > struct bar *next; > > char c; > > }; > > > > struct bar bar1; > > struct foo foo1; > > > > struct foo *foop; > > > > T1: bar1.a = -1; > > bar1.b = 2; > > foo1.next = &bar; > > foo1.c = "*"; > > atomic_store_explicit(&foop, &foo1, memory_order_release); > > So here, the standard requires that the store with release is an > ordering to all preceding writes. So *all* writes to bar and foo are > ordered, despite the fact that the pointer just points to foo. As long as T1 does some load from foop that extends T1's happens-before chain, agreed. > > T2: struct foo restrict *p; > > struct bar *q; > > > > p = atomic_load_explicit(&foop, memory_order_consume); > > if (p == NULL) > > return -EDEALWITHIT; > > q = p->next; /* Ordered with above load. */ > > do_something(q->b); /* Non-decorated pointer, ordered? */ > > So the theory is that a compiler *could* do some value speculation now > on "q", and with value speculation move the actual load of "q->p" up > to before "foop" was even loaded. That would be one concern. The other concern is that the only reason I believe that this is safe is because I trace the dependency chain from "p" through "p->next" and through "q->b". And because the more clearly the standard states the rules the Linux kernel relies on, the easier it is to get compiler writers to pay attention to bug reports. It might be that adding "restrict" to the definition of "q" would address my concerns. > So in practice, I think we agree that this doesn't really affect > compiler writers (because they'd have to do a whole lot extra code to > break it intentionally, considering that they can't do > value-prediction on 'p'), and we just need to make sure to close the > hole in the language to make this safe, right? Yep. Unless the compiler is doing some "interesting" optimizations on q, this should in practice be safe. After all, we are relying on this right now. ;-) > Let me think about it some more, but my gut feel is that just tweaking > the definition of what "ordered" means is sufficient. > > So to go back to the suggested ordering rules (ignoring the "restrict" > part, which is just to clarify that ordering through other means to > get to the object doesn't matter), I suggested: > > "the consume ordering guarantees the ordering between that > atomic read and the accesses to the object that the pointer > points to" > > and I think the solution is to just say that this ordering acts as a > fence. It doesn't say exactly *where* the fence is, but it says that > there is *some* fence between the load of the pointer and any/all > accesses to the object through that pointer. > > So with that definition, the memory accesses that are dependent on 'q' > will obviously be ordered. Now, they will *not* be ordered wrt the > load of q itself, but they will be ordered wrt the load from 'foop' - > because we've made it clear that there is a fence *somewhere* between > that atomic_load and the load of 'q'. > > So that handles the ordering guarantee you worry about. > > Now, the other worry is that this "fence" language would impose a > *stricter* ordering, and that by saying there is a fence, we'd now > constrain code generation on architectures like ARM and power, agreed? > And we do not want to do that, other than make it clear that the whole > "break the dependency chain through value speculation" cannot break it > past the load from 'foop'. Are we in agreement? > > So we do *not* want that fence to limit re-ordering of independent > memory operations. All agreed? Yep! > But let's look at any independent chain, especially around that > consuming load from '&foop'. Say we have some other memory access > (adding them for visualization into your example): > > p = atomic_load_explicit(&foop, memory_order_consume); > if (p == NULL) > return -EDEALWITHIT; > q = p->next; /* Ordered with above load. */ > + a = *b; *c = d; > do_something(q->b); /* Non-decorated pointer, ordered? */ > > and we are looking to make sure that those memory accesses can still > move around freely. > > I'm claiming that they can, even with the "fence" language, because > > (a) we've said 'q' is restricted, so there is no aliasing between q > and the pointers b/c. So the compiler is free to move those accesses > around the "q = p->next" access. Ah, if I understand you, very good! My example intentionally left "q" -not- restricted. I have to think about it more, but I believe that I am OK requiring that "q" be restricted in order to have the ordering guarantees. In other words, leaving "q" unrestricted would with extremely high probability get you the ordering guarantees in practice, but would not be absolutely guaranteed according to the standard. Changing the above example to mark "q" restricted gets you the guarantee both in practice and according to the standard. Is that where you are going with this? > (b) once you've moved them to *before* the "q = p->next" access, the > fence no longer constrains you, because the guarantee is that there is > a fence *somewhere* between the consuming load and the accesses to > that object, but it could be *at* the access, so you're done. Plus the pointer "p" is restricted, so the reasoning above should also allow moving the assignments to b/c to before the assignment to "p", right? > So I think the "we guarantee there is an ordering *somewhere* between > the consuming load and the first ordered access" model actually gives > us the semantics we need. > > (Note that you cannot necessarily move the accesses through the > pointers b/c past the consuming load, since the only non-aliasing > guarantee is about the pointer 'p', not about 'foop'. But you may use > other arguments to move them past that too) > > But it's possible that the above argument doesn't really work. It is a > very off-the-cuff "my intuition says this should work" model. Yes, we definitely will need to run it past Peter Sewell and Mark Batty once we have it fully nailed down to see if it survives formalization. ;-) But expanding the example one step further: + char ch; p = atomic_load_explicit(&foop, memory_order_consume); if (p == NULL) return -EDEALWITHIT; q = p->next; /* Ordered with above load. */ ch = p->c; /* Ordered with above load. */ a = *b; *c = d; do_something(q->b); /* Non-decorated pointer, ordered? */ + r1 = somearray[ch]; /* "ch" is not restricted, no guarantee. */ Again, given current compilers and hardware, the load into r1 would be ordered, and the current unloved standard would guarantee that as well (at great pain to compiler writers, I hasten to add). But I would be OK with this guarantee not being ironclad as far as a new version standard is concerned. If you needed the standard to guarantee this ordering, you could write char restrict ch; instead of just plain "char ch". Does that fit into the approach you are thinking of? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 1:16 ` Paul E. McKenney @ 2014-02-24 1:35 ` Linus Torvalds 2014-02-24 4:59 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 1:35 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> (a) we've said 'q' is restricted, so there is no aliasing between q >> and the pointers b/c. So the compiler is free to move those accesses >> around the "q = p->next" access. > > Ah, if I understand you, very good! > > My example intentionally left "q" -not- restricted. No, I 100% agree with that. "q" is *not* restricted. But "p" is, since it came from that consuming load. But "q = p->next" is ordered by how something can alias "p->next", not by 'q'! There is no need to restrict anything but 'p' for all of this to work. Btw, it's also worth pointing out that I do *not* in any way expect people to actually write the "restrict" keyword anywhere. So no need to change source code. What you have is a situation where the pointer coming out of the memory_order_consume is restricted. But if you assign it to a non-restricted pointer, that's *fine*. That's perfectly normal C behavior. The "restrict" concept is not something that the programmer needs to worry about or ever even notice, it's basically just a promise to the compiler that "if somebody has another pointer lying around, accesses though that other pointer do not require ordering". So it sounds like you believe that the programmer would mark things "restrict", and I did not mean that at all. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 1:35 ` Linus Torvalds @ 2014-02-24 4:59 ` Paul E. McKenney 2014-02-24 5:25 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 4:59 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote: > On Sun, Feb 23, 2014 at 5:16 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > >> > >> (a) we've said 'q' is restricted, so there is no aliasing between q > >> and the pointers b/c. So the compiler is free to move those accesses > >> around the "q = p->next" access. > > > > Ah, if I understand you, very good! > > > > My example intentionally left "q" -not- restricted. > > No, I 100% agree with that. "q" is *not* restricted. But "p" is, since > it came from that consuming load. > > But "q = p->next" is ordered by how something can alias "p->next", not by 'q'! > > There is no need to restrict anything but 'p' for all of this to work. I cannot say I understand this last sentence right new from the viewpoint of the standard, but suspending disbelief for the moment... (And yes, given current compilers and CPUs, I agree that this should all work in practice. My concern is the legality, not the reality.) > Btw, it's also worth pointing out that I do *not* in any way expect > people to actually write the "restrict" keyword anywhere. So no need > to change source code. Understood -- in this variant, you are taking the marking from the fact that there was an assignment from a memory_order_consume load rather than from a keyword on the assigned-to variable's declaration. > What you have is a situation where the pointer coming out of the > memory_order_consume is restricted. But if you assign it to a > non-restricted pointer, that's *fine*. That's perfectly normal C > behavior. The "restrict" concept is not something that the programmer > needs to worry about or ever even notice, it's basically just a > promise to the compiler that "if somebody has another pointer lying > around, accesses though that other pointer do not require ordering". > > So it sounds like you believe that the programmer would mark things > "restrict", and I did not mean that at all. Indeed I did believe that. I must confess that I was looking for an easy way to express in standardese -exactly- where the ordering guarantee did and did not propagate. The thing is that the vast majority of the Linux-kernel RCU code is more than happy with the guarantee only applying to fetches via the pointer returned from the memory_order_consume load. There are relatively few places where groups of structures are made visible to RCU readers via a single rcu_assign_pointer(). I guess I need to actually count them. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 4:59 ` Paul E. McKenney @ 2014-02-24 5:25 ` Linus Torvalds 0 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 5:25 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 23, 2014 at 8:59 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Sun, Feb 23, 2014 at 05:35:28PM -0800, Linus Torvalds wrote: >> >> But "q = p->next" is ordered by how something can alias "p->next", not by 'q'! >> >> There is no need to restrict anything but 'p' for all of this to work. > > I cannot say I understand this last sentence right new from the viewpoint > of the standard, but suspending disbelief for the moment... So 'p' is what comes from that consuming load that returns a 'restrict' pointer. That doesn't affect 'q' in any way. But the act of initializing 'q' by dereferencing p (in "p->next") is - by virtue of the restrict - something that the compiler can see cannot alias with anything else, so the compiler could re-order other memory accesses freely around it, if you see what I mean. Modulo all the *other* ordering guarantees, of course. So other atomics and volatiles etc may have their own rules, quite apart from any aliasing issues. > Understood -- in this variant, you are taking the marking from the > fact that there was an assignment from a memory_order_consume load > rather than from a keyword on the assigned-to variable's declaration. Yes, and to me, it's really just a legalistic trick to make it clear that any *other* pointer that happens to point to the same object cannot be dereferenced within scope of the result of the atomic_read(mo_consume), at least not if you expect to get the memory ordering semantics. You can do it, but then you violate the guarantee of the restrict, and you get what you get - a potential non-ordering. So if somebody just immediately assigns the value to a normal (non-restrict) pointer nothing *really* changes. It's just there to describe the guarantees. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-23 19:31 ` Linus Torvalds 2014-02-24 1:16 ` Paul E. McKenney @ 2014-02-24 15:57 ` Linus Torvalds 2014-02-24 16:27 ` Richard Biener 2014-02-24 17:21 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 15:57 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Let me think about it some more, but my gut feel is that just tweaking > the definition of what "ordered" means is sufficient. > > So to go back to the suggested ordering rules (ignoring the "restrict" > part, which is just to clarify that ordering through other means to > get to the object doesn't matter), I suggested: > > "the consume ordering guarantees the ordering between that > atomic read and the accesses to the object that the pointer > points to" > > and I think the solution is to just say that this ordering acts as a > fence. It doesn't say exactly *where* the fence is, but it says that > there is *some* fence between the load of the pointer and any/all > accesses to the object through that pointer. I'm wrong. That doesn't work. At all. There is no ordering except through the pointer chain. So I think saying just that, and nothing else (no magic fences, no nothing) is the right thing: "the consume ordering guarantees the ordering between that atomic read and the accesses to the object that the pointer points to directly or indirectly through a chain of pointers" The thing is, anything but a chain of pointers (and maybe relaxing it to "indexes in tables" in addition to pointers) doesn't really work. The current standard tries to break it at "obvious" points that can lose the data dependency (either by turning it into a control dependency, or by just dropping the value, like the left-hand side of a comma-expression), but the fact is, it's broken. It's broken not just because the value can be lost other ways (ie the "p-p" example), it's broken because the value can be turned into a control dependency so many other ways too. Compilers regularly turn arithmetic ops with logical comparisons into branches. So an expression like "a = !!ptr" carries a dependency in the current C standard, but it's entirely possible that a compiler ends up turning it into a compare-and-branch rather than a compare-and-set-conditional, depending on just exactly how "a" ends up being used. That's true even on an architecture like ARM that has a lot of conditional instructions (there are way less if you compile for Thumb, for example, but compilers also do things like "if there are more than N predicated instructions I'll just turn it into a branch-over instead"). So I think the C standard needs to just explicitly say that you can walk a chain of pointers (with that possible "indexes in arrays" extension), and nothing more. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 15:57 ` Linus Torvalds @ 2014-02-24 16:27 ` Richard Biener 2014-02-24 16:37 ` Linus Torvalds 2014-02-24 17:21 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Richard Biener @ 2014-02-24 16:27 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 4:57 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Let me think about it some more, but my gut feel is that just tweaking >> the definition of what "ordered" means is sufficient. >> >> So to go back to the suggested ordering rules (ignoring the "restrict" >> part, which is just to clarify that ordering through other means to >> get to the object doesn't matter), I suggested: >> >> "the consume ordering guarantees the ordering between that >> atomic read and the accesses to the object that the pointer >> points to" >> >> and I think the solution is to just say that this ordering acts as a >> fence. It doesn't say exactly *where* the fence is, but it says that >> there is *some* fence between the load of the pointer and any/all >> accesses to the object through that pointer. > > I'm wrong. That doesn't work. At all. There is no ordering except > through the pointer chain. > > So I think saying just that, and nothing else (no magic fences, no > nothing) is the right thing: > > "the consume ordering guarantees the ordering between that > atomic read and the accesses to the object that the pointer > points to directly or indirectly through a chain of pointers" To me that reads like int i; int *q = &i; int **p = &q; atomic_XXX (p, CONSUME); orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they want to say that it orders against aliased storage - but then go further and include "indirectly through a chain of pointers"?! Thus an atomic read of a int * orders against any 'int' memory operation but not against 'float' memory operations? Eh ... Just jumping in to throw in my weird-2-cents. Richard. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 16:27 ` Richard Biener @ 2014-02-24 16:37 ` Linus Torvalds 2014-02-24 16:40 ` Linus Torvalds 2014-02-24 16:55 ` Michael Matz 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 16:37 UTC (permalink / raw) To: Richard Biener Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 8:27 AM, Richard Biener <richard.guenther@gmail.com> wrote: > > To me that reads like > > int i; > int *q = &i; > int **p = &q; > > atomic_XXX (p, CONSUME); > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > want to say that it orders against aliased storage - but then go further > and include "indirectly through a chain of pointers"?! Thus an > atomic read of a int * orders against any 'int' memory operation but > not against 'float' memory operations? No, it's not about type at all, and the "chain of pointers" can be much more complex than that, since the "int *" can point to within an object that contains other things than just that "int" (the "int" can be part of a structure that then has pointers to other structures etc). So in your example, ptr = atomic_read(p, CONSUME); would indeed order against the subsequent access of the chain through *that* pointer (the whole "restrict" thing that I left out as a separate thing, which was probably a mistake), but certainly not against any integer pointer, and certainly not against any aliasing pointer chains. So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q') _and_ '**ptr' (getting 'i'), but nothing else - including just the aliasing access of dereferencing 'i' directly. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 16:37 ` Linus Torvalds @ 2014-02-24 16:40 ` Linus Torvalds 2014-02-24 16:55 ` Michael Matz 1 sibling, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 16:40 UTC (permalink / raw) To: Richard Biener Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 8:37 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > So yes, the atomic_read() would be ordered wrt '*ptr' (getting 'q') > _and_ '**ptr' (getting 'i'), but nothing else - including just the > aliasing access of dereferencing 'i' directly. Btw, what CPU architects and memory ordering guys tend to do in documentation is give a number of "litmus test" pseudo-code sequences to show the effects and intent of the language. I think giving those kinds of litmus tests for both "this is ordered" and "this is not ordered" cases like the above is would be a great clarification. Partly because the language is going to be somewhat legalistic and thus hard to wrap your mind around, and partly to really hit home the *intent* of the language, which I think is actually fairly clear to both compiler writers and to programmers. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 16:37 ` Linus Torvalds 2014-02-24 16:40 ` Linus Torvalds @ 2014-02-24 16:55 ` Michael Matz 2014-02-24 17:28 ` Paul E. McKenney 2014-02-24 17:38 ` Linus Torvalds 1 sibling, 2 replies; 285+ messages in thread From: Michael Matz @ 2014-02-24 16:55 UTC (permalink / raw) To: Linus Torvalds Cc: Richard Biener, Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc Hi, On Mon, 24 Feb 2014, Linus Torvalds wrote: > > To me that reads like > > > > int i; > > int *q = &i; > > int **p = &q; > > > > atomic_XXX (p, CONSUME); > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > want to say that it orders against aliased storage - but then go further > > and include "indirectly through a chain of pointers"?! Thus an > > atomic read of a int * orders against any 'int' memory operation but > > not against 'float' memory operations? > > No, it's not about type at all, and the "chain of pointers" can be > much more complex than that, since the "int *" can point to within an > object that contains other things than just that "int" (the "int" can > be part of a structure that then has pointers to other structures > etc). So, let me try to poke holes into your definition or increase my understanding :) . You said "chain of pointers"(dereferences I assume), e.g. if p is result of consume load, then access to p->here->there->next->prev->stuff is supposed to be ordered with that load (or only when that last load/store itself is also an atomic load or store?). So, what happens if the pointer deref chain is partly hidden in some functions: A * adjustptr (B *ptr) { return &ptr->here->there->next; } B * p = atomic_XXX (&somewhere, consume); adjustptr(p)->prev->stuff = bla; As far as I understood you, this whole ptrderef chain business would be only an optimization opportunity, right? So if the compiler can't be sure how p is actually used (as in my function-using case, assume adjustptr is defined in another unit), then the consume load would simply be transformed into an acquire (or whatever, with some barrier I mean)? Only _if_ the compiler sees all obvious uses of p (indirectly through pointer derefs) can it, yeah, do what with the consume load? Ciao, Michael. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 16:55 ` Michael Matz @ 2014-02-24 17:28 ` Paul E. McKenney 2014-02-24 17:57 ` Paul E. McKenney 2014-02-26 17:39 ` Torvald Riegel 2014-02-24 17:38 ` Linus Torvalds 1 sibling, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 17:28 UTC (permalink / raw) To: Michael Matz Cc: Linus Torvalds, Richard Biener, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: > Hi, > > On Mon, 24 Feb 2014, Linus Torvalds wrote: > > > > To me that reads like > > > > > > int i; > > > int *q = &i; > > > int **p = &q; > > > > > > atomic_XXX (p, CONSUME); > > > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > > want to say that it orders against aliased storage - but then go further > > > and include "indirectly through a chain of pointers"?! Thus an > > > atomic read of a int * orders against any 'int' memory operation but > > > not against 'float' memory operations? > > > > No, it's not about type at all, and the "chain of pointers" can be > > much more complex than that, since the "int *" can point to within an > > object that contains other things than just that "int" (the "int" can > > be part of a structure that then has pointers to other structures > > etc). > > So, let me try to poke holes into your definition or increase my > understanding :) . You said "chain of pointers"(dereferences I assume), > e.g. if p is result of consume load, then access to > p->here->there->next->prev->stuff is supposed to be ordered with that load > (or only when that last load/store itself is also an atomic load or > store?). > > So, what happens if the pointer deref chain is partly hidden in some > functions: > > A * adjustptr (B *ptr) { return &ptr->here->there->next; } > B * p = atomic_XXX (&somewhere, consume); > adjustptr(p)->prev->stuff = bla; > > As far as I understood you, this whole ptrderef chain business would be > only an optimization opportunity, right? So if the compiler can't be sure > how p is actually used (as in my function-using case, assume adjustptr is > defined in another unit), then the consume load would simply be > transformed into an acquire (or whatever, with some barrier I mean)? Only > _if_ the compiler sees all obvious uses of p (indirectly through pointer > derefs) can it, yeah, do what with the consume load? Good point, I left that out of my list. Adding it: 13. By default, pointer chains do not propagate into or out of functions. In implementations having attributes, a [[carries_dependency]] may be used to mark a function argument or return as passing a pointer chain into or out of that function. If a function does not contain memory_order_consume loads and also does not contain [[carries_dependency]] attributes, then that function may be compiled using any desired dependency-breaking optimizations. The ordering effects are implementation defined when a given pointer chain passes into or out of a function through a parameter or return not marked with a [[carries_dependency]] attributed. Note that this last paragraph differs from the current standard, which would require ordering regardless. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 17:28 ` Paul E. McKenney @ 2014-02-24 17:57 ` Paul E. McKenney 2014-02-26 17:39 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 17:57 UTC (permalink / raw) To: Michael Matz Cc: Linus Torvalds, Richard Biener, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 09:28:56AM -0800, Paul E. McKenney wrote: > On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: > > Hi, > > > > On Mon, 24 Feb 2014, Linus Torvalds wrote: > > > > > > To me that reads like > > > > > > > > int i; > > > > int *q = &i; > > > > int **p = &q; > > > > > > > > atomic_XXX (p, CONSUME); > > > > > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > > > want to say that it orders against aliased storage - but then go further > > > > and include "indirectly through a chain of pointers"?! Thus an > > > > atomic read of a int * orders against any 'int' memory operation but > > > > not against 'float' memory operations? > > > > > > No, it's not about type at all, and the "chain of pointers" can be > > > much more complex than that, since the "int *" can point to within an > > > object that contains other things than just that "int" (the "int" can > > > be part of a structure that then has pointers to other structures > > > etc). > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > > > A * adjustptr (B *ptr) { return &ptr->here->there->next; } > > B * p = atomic_XXX (&somewhere, consume); > > adjustptr(p)->prev->stuff = bla; > > > > As far as I understood you, this whole ptrderef chain business would be > > only an optimization opportunity, right? So if the compiler can't be sure > > how p is actually used (as in my function-using case, assume adjustptr is > > defined in another unit), then the consume load would simply be > > transformed into an acquire (or whatever, with some barrier I mean)? Only > > _if_ the compiler sees all obvious uses of p (indirectly through pointer > > derefs) can it, yeah, do what with the consume load? > > Good point, I left that out of my list. Adding it: > > 13. By default, pointer chains do not propagate into or out of functions. > In implementations having attributes, a [[carries_dependency]] > may be used to mark a function argument or return as passing > a pointer chain into or out of that function. > > If a function does not contain memory_order_consume loads and > also does not contain [[carries_dependency]] attributes, then > that function may be compiled using any desired dependency-breaking > optimizations. > > The ordering effects are implementation defined when a given > pointer chain passes into or out of a function through a parameter > or return not marked with a [[carries_dependency]] attributed. > > Note that this last paragraph differs from the current standard, which > would require ordering regardless. And there is also kill_dependency(), which needs to be added to the list in #8 of operators that take a chained pointer and return something that is not a chained pointer. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 17:28 ` Paul E. McKenney 2014-02-24 17:57 ` Paul E. McKenney @ 2014-02-26 17:39 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-26 17:39 UTC (permalink / raw) To: paulmck Cc: Michael Matz, Linus Torvalds, Richard Biener, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-24 at 09:28 -0800, Paul E. McKenney wrote: > On Mon, Feb 24, 2014 at 05:55:50PM +0100, Michael Matz wrote: > > Hi, > > > > On Mon, 24 Feb 2014, Linus Torvalds wrote: > > > > > > To me that reads like > > > > > > > > int i; > > > > int *q = &i; > > > > int **p = &q; > > > > > > > > atomic_XXX (p, CONSUME); > > > > > > > > orders against accesses '*p', '**p', '*q' and 'i'. Thus it seems they > > > > want to say that it orders against aliased storage - but then go further > > > > and include "indirectly through a chain of pointers"?! Thus an > > > > atomic read of a int * orders against any 'int' memory operation but > > > > not against 'float' memory operations? > > > > > > No, it's not about type at all, and the "chain of pointers" can be > > > much more complex than that, since the "int *" can point to within an > > > object that contains other things than just that "int" (the "int" can > > > be part of a structure that then has pointers to other structures > > > etc). > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > > > A * adjustptr (B *ptr) { return &ptr->here->there->next; } > > B * p = atomic_XXX (&somewhere, consume); > > adjustptr(p)->prev->stuff = bla; > > > > As far as I understood you, this whole ptrderef chain business would be > > only an optimization opportunity, right? So if the compiler can't be sure > > how p is actually used (as in my function-using case, assume adjustptr is > > defined in another unit), then the consume load would simply be > > transformed into an acquire (or whatever, with some barrier I mean)? Only > > _if_ the compiler sees all obvious uses of p (indirectly through pointer > > derefs) can it, yeah, do what with the consume load? > > Good point, I left that out of my list. Adding it: > > 13. By default, pointer chains do not propagate into or out of functions. > In implementations having attributes, a [[carries_dependency]] > may be used to mark a function argument or return as passing > a pointer chain into or out of that function. > > If a function does not contain memory_order_consume loads and > also does not contain [[carries_dependency]] attributes, then > that function may be compiled using any desired dependency-breaking > optimizations. > > The ordering effects are implementation defined when a given > pointer chain passes into or out of a function through a parameter > or return not marked with a [[carries_dependency]] attributed. > > Note that this last paragraph differs from the current standard, which > would require ordering regardless. I would prefer if we could get rid off [[carries_dependency]] as well; currently, it's a hint whose effectiveness really depends on how the particular implementation handles this attribute. If we still need something like it in the future, it would be good if it had a clearer use and performance effects. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 16:55 ` Michael Matz 2014-02-24 17:28 ` Paul E. McKenney @ 2014-02-24 17:38 ` Linus Torvalds 2014-02-24 18:12 ` Paul E. McKenney 2014-02-26 17:34 ` Torvald Riegel 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 17:38 UTC (permalink / raw) To: Michael Matz Cc: Richard Biener, Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote: > > So, let me try to poke holes into your definition or increase my > understanding :) . You said "chain of pointers"(dereferences I assume), > e.g. if p is result of consume load, then access to > p->here->there->next->prev->stuff is supposed to be ordered with that load > (or only when that last load/store itself is also an atomic load or > store?). It's supposed to be ordered wrt the first load (the consuming one), yes. > So, what happens if the pointer deref chain is partly hidden in some > functions: No problem. The thing is, the ordering is actually handled by the CPU in all relevant cases. So the compiler doesn't actually need to *do* anything. All this legalistic stuff is just to describe the semantics and the guarantees. The problem is two cases: (a) alpha (which doesn't really order any accesses at all, not even dependent loads), but for a compiler alpha is actually trivial: just add a "rmb" instruction after the load, and you can't really do anything else (there's a few optimizations you can do wrt the rmb, but they are very specific and simple). So (a) is a "problem", but the solution is actually really simple, and gives very *strong* guarantees: on alpha, a "consume" ends up being basically the same as a read barrier after the load, with only very minimal room for optimization. (b) ARM and powerpc and similar architectures, that guarantee the data dependency as long as it is an *actual* data dependency, and never becomes a control dependency. On ARM and powerpc, control dependencies do *not* order accesses (the reasons boil down to essentially: branch prediction breaks the dependency, and instructions that come after the branch can be happily executed before the branch). But it's almost impossible to describe that in the standard, since compilers can (and very much do) turn a control dependency into a data dependency and vice versa. So the current standard tries to describe that "control vs data" dependency, and tries to limit it to a data dependency. It fails. It fails for multiple reasons - it doesn't allow for trivial optimizations that just remove the data dependency, and it also doesn't allow for various trivial cases where the compiler *does* turn the data dependency into a control dependency. So I really really think that the current C standard language is broken. Unfixably broken. I'm trying to change the "syntactic data dependency" that the current standard uses into something that is clearer and correct. The "chain of pointers" thing is still obviously a data dependency, but by limiting it to pointers, it simplifies the language, clarifies the meaning, avoids all syntactic tricks (ie "p-p" is clearly a syntactic dependency on "p", but does *not* involve in any way following the pointer) and makes it basically impossible for the compiler to break the dependency without doing value prediction, and since value prediction has to be disallowed anyway, that's a feature, not a bug. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 17:38 ` Linus Torvalds @ 2014-02-24 18:12 ` Paul E. McKenney 2014-02-26 17:34 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 18:12 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Matz, Richard Biener, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 09:38:46AM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote: > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > It's supposed to be ordered wrt the first load (the consuming one), yes. > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > No problem. > > The thing is, the ordering is actually handled by the CPU in all > relevant cases. So the compiler doesn't actually need to *do* > anything. All this legalistic stuff is just to describe the semantics > and the guarantees. > > The problem is two cases: > > (a) alpha (which doesn't really order any accesses at all, not even > dependent loads), but for a compiler alpha is actually trivial: just > add a "rmb" instruction after the load, and you can't really do > anything else (there's a few optimizations you can do wrt the rmb, but > they are very specific and simple). > > So (a) is a "problem", but the solution is actually really simple, and > gives very *strong* guarantees: on alpha, a "consume" ends up being > basically the same as a read barrier after the load, with only very > minimal room for optimization. > > (b) ARM and powerpc and similar architectures, that guarantee the > data dependency as long as it is an *actual* data dependency, and > never becomes a control dependency. > > On ARM and powerpc, control dependencies do *not* order accesses (the > reasons boil down to essentially: branch prediction breaks the > dependency, and instructions that come after the branch can be happily > executed before the branch). But it's almost impossible to describe > that in the standard, since compilers can (and very much do) turn a > control dependency into a data dependency and vice versa. > > So the current standard tries to describe that "control vs data" > dependency, and tries to limit it to a data dependency. It fails. It > fails for multiple reasons - it doesn't allow for trivial > optimizations that just remove the data dependency, and it also > doesn't allow for various trivial cases where the compiler *does* turn > the data dependency into a control dependency. > > So I really really think that the current C standard language is > broken. Unfixably broken. > > I'm trying to change the "syntactic data dependency" that the current > standard uses into something that is clearer and correct. > > The "chain of pointers" thing is still obviously a data dependency, > but by limiting it to pointers, it simplifies the language, clarifies > the meaning, avoids all syntactic tricks (ie "p-p" is clearly a > syntactic dependency on "p", but does *not* involve in any way > following the pointer) and makes it basically impossible for the > compiler to break the dependency without doing value prediction, and > since value prediction has to be disallowed anyway, that's a feature, > not a bug. OK, good point, please ignore my added thirteenth item in the list. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 17:38 ` Linus Torvalds 2014-02-24 18:12 ` Paul E. McKenney @ 2014-02-26 17:34 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-26 17:34 UTC (permalink / raw) To: Linus Torvalds Cc: Michael Matz, Richard Biener, Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-24 at 09:38 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 8:55 AM, Michael Matz <matz@suse.de> wrote: > > > > So, let me try to poke holes into your definition or increase my > > understanding :) . You said "chain of pointers"(dereferences I assume), > > e.g. if p is result of consume load, then access to > > p->here->there->next->prev->stuff is supposed to be ordered with that load > > (or only when that last load/store itself is also an atomic load or > > store?). > > It's supposed to be ordered wrt the first load (the consuming one), yes. > > > So, what happens if the pointer deref chain is partly hidden in some > > functions: > > No problem. > > The thing is, the ordering is actually handled by the CPU in all > relevant cases. So the compiler doesn't actually need to *do* > anything. All this legalistic stuff is just to describe the semantics > and the guarantees. > > The problem is two cases: > > (a) alpha (which doesn't really order any accesses at all, not even > dependent loads), but for a compiler alpha is actually trivial: just > add a "rmb" instruction after the load, and you can't really do > anything else (there's a few optimizations you can do wrt the rmb, but > they are very specific and simple). > > So (a) is a "problem", but the solution is actually really simple, and > gives very *strong* guarantees: on alpha, a "consume" ends up being > basically the same as a read barrier after the load, with only very > minimal room for optimization. > > (b) ARM and powerpc and similar architectures, that guarantee the > data dependency as long as it is an *actual* data dependency, and > never becomes a control dependency. > > On ARM and powerpc, control dependencies do *not* order accesses (the > reasons boil down to essentially: branch prediction breaks the > dependency, and instructions that come after the branch can be happily > executed before the branch). But it's almost impossible to describe > that in the standard, since compilers can (and very much do) turn a > control dependency into a data dependency and vice versa. > > So the current standard tries to describe that "control vs data" > dependency, and tries to limit it to a data dependency. It fails. It > fails for multiple reasons - it doesn't allow for trivial > optimizations that just remove the data dependency, and it also > doesn't allow for various trivial cases where the compiler *does* turn > the data dependency into a control dependency. > > So I really really think that the current C standard language is > broken. Unfixably broken. > > I'm trying to change the "syntactic data dependency" that the current > standard uses into something that is clearer and correct. > > The "chain of pointers" thing is still obviously a data dependency, > but by limiting it to pointers, it simplifies the language, clarifies > the meaning, avoids all syntactic tricks (ie "p-p" is clearly a > syntactic dependency on "p", but does *not* involve in any way > following the pointer) and makes it basically impossible for the > compiler to break the dependency without doing value prediction, and > since value prediction has to be disallowed anyway, that's a feature, > not a bug. AFAIU, Michael is wondering about how we can separate non-synchronizing code (ie, in this case, not taking part in any "chain of pointers" used with mo_consume loads) from code that does. If we cannot, then we prevent value prediction *everywhere*, unless the compiler can prove that the code is never part of such a chain (which is hard due to alias analysis being hard, etc.). (We can probably argue to which extent value prediction is necessary for generation of efficient code, but it obviously does work in non-synchronizing code (or even with acquire barriers with some care) -- so forbidding it entirely might be bad.) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 15:57 ` Linus Torvalds 2014-02-24 16:27 ` Richard Biener @ 2014-02-24 17:21 ` Paul E. McKenney 2014-02-24 18:14 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 17:21 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 07:57:24AM -0800, Linus Torvalds wrote: > On Sun, Feb 23, 2014 at 11:31 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Let me think about it some more, but my gut feel is that just tweaking > > the definition of what "ordered" means is sufficient. > > > > So to go back to the suggested ordering rules (ignoring the "restrict" > > part, which is just to clarify that ordering through other means to > > get to the object doesn't matter), I suggested: > > > > "the consume ordering guarantees the ordering between that > > atomic read and the accesses to the object that the pointer > > points to" > > > > and I think the solution is to just say that this ordering acts as a > > fence. It doesn't say exactly *where* the fence is, but it says that > > there is *some* fence between the load of the pointer and any/all > > accesses to the object through that pointer. > > I'm wrong. That doesn't work. At all. There is no ordering except > through the pointer chain. > > So I think saying just that, and nothing else (no magic fences, no > nothing) is the right thing: > > "the consume ordering guarantees the ordering between that > atomic read and the accesses to the object that the pointer > points to directly or indirectly through a chain of pointers" > > The thing is, anything but a chain of pointers (and maybe relaxing it > to "indexes in tables" in addition to pointers) doesn't really work. > > The current standard tries to break it at "obvious" points that can > lose the data dependency (either by turning it into a control > dependency, or by just dropping the value, like the left-hand side of > a comma-expression), but the fact is, it's broken. > > It's broken not just because the value can be lost other ways (ie the > "p-p" example), it's broken because the value can be turned into a > control dependency so many other ways too. > > Compilers regularly turn arithmetic ops with logical comparisons into > branches. So an expression like "a = !!ptr" carries a dependency in > the current C standard, but it's entirely possible that a compiler > ends up turning it into a compare-and-branch rather than a > compare-and-set-conditional, depending on just exactly how "a" ends up > being used. That's true even on an architecture like ARM that has a > lot of conditional instructions (there are way less if you compile for > Thumb, for example, but compilers also do things like "if there are > more than N predicated instructions I'll just turn it into a > branch-over instead"). > > So I think the C standard needs to just explicitly say that you can > walk a chain of pointers (with that possible "indexes in arrays" > extension), and nothing more. I am comfortable with this. My desire for also marking the later pointers does not make sense without some automated way of validating them, which I don't immediately see a way to do. So let me try laying out the details. Sticking with pointers for the moment, if we reach agreement on these, I will try expanding to integers. 1. A pointer value obtained from a memory_order_consume load is part of a pointer chain. I am calling the pointer itself a "chained pointer" for the moment. 2. Note that it is the value that qualifies as being chained, not the variable. For example, given pointer variable might hold a chained pointer at one point in the code, then a non-chained pointer later. Therefore, "q = p", where "q" is a pointer and "p" is a chained pointer results in "q" containing a chained pointer. 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. 4. Bitwise operators ("&", "|", "^", and I suppose also "~") applied to a chained pointer and an integer results in another chained pointer in that same pointer chain. 5. Consider a sequence as follows: dereference operator (unary "*", "[]", "->") optionally followed by a series of direct selection operators ("."), finally (and unconditionally) followed by a unary "&" operator. Applying such a sequence to a chained pointer results in another chained pointer in the same chain. Given a chained pointer "p", examples include "&p[3]", "&p->a.b.c.d.e.f.g", and "&*p". 6. The expression "p->f", where "p" is a chained pointer and "f" is a pointer, results in a chained pointer. FWIW, this means that pointer chains can overlap as in this example: p = atomic_load_explicit(&gp, memory_order_consume); q = atomic_load_explicit(&p->ap, memory_order_consume); x = q->a; This should be fine, I don't see any problems with this. 7. Applying a pointer cast to a chained pointer results in a chained pointer. 8. Applying any of the following operators to a chained pointer results in something that is not a chained pointer: "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=", ">=", "==", "!=", "&&", and "||". 9. The effect of the compound assignment operators "+=", "-=", and so on is the same as the equivalent expression using simple assignment. 10. In a "?:" operator, if the selected one of the rightmost two values is a chained pointer, then the result is also a chained pointer. 11. In a "," operator, if the rightmost value is a chained pointer, then the result is also a chained pointer. 12. A memory_order_consume load carries a dependency to any dereference operator (unary "*", "[]", and "->") in the resulting pointer chain. I think that covers everything... Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 17:21 ` Paul E. McKenney @ 2014-02-24 18:14 ` Linus Torvalds 2014-02-24 18:53 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 18:14 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > applied to a chained pointer and an integer results in another > chained pointer in that same pointer chain. No. You cannot define it this way. Taking the value of a pointer and doing a bitwise operation that throws away all the bits (or even *most* of the bits) results in the compiler easily being able to turn the "chain" into a non-chain. The obvious example being "val & 0", but things like "val & 1" are in practice also something that compilers easily turn into control dependencies instead of data dependencies. So you can talk about things like "aligning the pointer value to object boundaries" etc, but it really cannot and *must* not be about the syntactic operations. The same goes for "adding and subtracting an integer". The *syntax* doesn't matter. It's about remaining information. Doing "p-(int)p" or "p+(-(int)p)" doesn't leave any information despite being "subtracting and adding an integer" at a syntactic level. Syntax is meaningless. Really. > 8. Applying any of the following operators to a chained pointer > results in something that is not a chained pointer: > "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=", > ">=", "==", "!=", "&&", and "||". Parenthesis? I'm assuming that you mean calling through the chained pointer. Also, I think all of /, * and % are perfectly fine, and might be used for that "aligning the pointer" operation that is fine. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 18:14 ` Linus Torvalds @ 2014-02-24 18:53 ` Paul E. McKenney 2014-02-24 19:54 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 18:53 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 10:14:01AM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 9:21 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > applied to a chained pointer and an integer results in another > > chained pointer in that same pointer chain. > > No. You cannot define it this way. Taking the value of a pointer and > doing a bitwise operation that throws away all the bits (or even > *most* of the bits) results in the compiler easily being able to turn > the "chain" into a non-chain. > > The obvious example being "val & 0", but things like "val & 1" are in > practice also something that compilers easily turn into control > dependencies instead of data dependencies. Indeed, most of the bits need to remain for this to work. > So you can talk about things like "aligning the pointer value to > object boundaries" etc, but it really cannot and *must* not be about > the syntactic operations. > > The same goes for "adding and subtracting an integer". The *syntax* > doesn't matter. It's about remaining information. Doing "p-(int)p" or > "p+(-(int)p)" doesn't leave any information despite being "subtracting > and adding an integer" at a syntactic level. > > Syntax is meaningless. Really. Good points. How about the following replacements? 3. Adding or subtracting an integer to/from a chained pointer results in another chained pointer in that same pointer chain. The results of addition and subtraction operations that cancel the chained pointer's value (for example, "p-(long)p" where "p" is a pointer to char) are implementation defined. 4. Bitwise operators ("&", "|", "^", and I suppose also "~") applied to a chained pointer and an integer for the purposes of alignment and pointer translation results in another chained pointer in that same pointer chain. Other uses of bitwise operators on chained pointers (for example, "p|~0") are implementation defined. > > 8. Applying any of the following operators to a chained pointer > > results in something that is not a chained pointer: > > "()", sizeof, "!", "*", "/", "%", ">>", "<<", "<", ">", "<=", > > ">=", "==", "!=", "&&", and "||". > > Parenthesis? I'm assuming that you mean calling through the chained pointer. Yes, good point. Of course, parentheses for grouping just pass the value through without affecting the chained-ness. > Also, I think all of /, * and % are perfectly fine, and might be used > for that "aligning the pointer" operation that is fine. Something like this? char *p; p = p - (unsigned long)p % 8; I was thinking of this as subtraction -- the "p" gets moduloed by 8, which loses the chained-pointer designation. But that is OK because that designation gets folded back in by the subtraction. Am I missing a use case? That leaves things like this one: p = (p / 8) * 8; I cannot think of any other legitimate use for "/" and "*". Here is an updated #8 and a new 8a: 8. Applying any of the following operators to a chained pointer results in something that is not a chained pointer: function call "()", sizeof, "!", "%", ">>", "<<", "<", ">", "<=", ">=", "==", "!=", "&&", "||", and "kill_dependency()". 8a. Dividing a chained pointer by an integer and multiplying it by that same integer (for example, to align that pointer) results in a chained pointer in that same pointer chain. The ordering effects of other uses of infix "*" and "/" on chained pointers are implementation defined. Does that capture it? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 18:53 ` Paul E. McKenney @ 2014-02-24 19:54 ` Linus Torvalds 2014-02-24 22:37 ` Paul E. McKenney 2014-02-27 15:37 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 19:54 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Good points. How about the following replacements? > > 3. Adding or subtracting an integer to/from a chained pointer > results in another chained pointer in that same pointer chain. > The results of addition and subtraction operations that cancel > the chained pointer's value (for example, "p-(long)p" where "p" > is a pointer to char) are implementation defined. > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > applied to a chained pointer and an integer for the purposes > of alignment and pointer translation results in another > chained pointer in that same pointer chain. Other uses > of bitwise operators on chained pointers (for example, > "p|~0") are implementation defined. Quite frankly, I think all of this language that is about the actual operations is irrelevant and wrong. It's not going to help compiler writers, and it sure isn't going to help users that read this. Why not just talk about "value chains" and that any operations that restrict the value range severely end up breaking the chain. There is no point in listing the operations individually, because every single operation *can* restrict things. Listing individual operations and depdendencies is just fundamentally wrong. For example, let's look at this obvious case: int q,*p = atomic_read(&pp, consume); .. nothing modifies 'p' .. q = *p; and there are literally *zero* operations that modify the value change, so obviously the two operations are ordered, right? Wrong. What if the "nothing modifies 'p'" part looks like this: if (p != &myvariable) return; and now any sane compiler will happily optimize "q = *p" into "q = myvariable", and we're all done - nothing invalid was ever So my earlier suggestion tried to avoid this by having the restrict thing, so the above wouldn't work. But your (and the current C standards) attempt to define this with some kind of syntactic dependency carrying chain will _inevitably_ get this wrong, and/or be too horribly complex to actually be useful. Seriously, don't do it. I claim that all your attempts to do this crazy syntactic "these operations maintain the chained pointers" is broken. The fact that you have changed "carries a dependency" to "chained pointer" changes NOTHING. So just give it up. It's a fundamentally broken model. It's *wrong*, but even more importantly, it's not even *useful*, since it ends up being too complicated for a compiler writer or a programmer to understand. I really really really think you need to do this at a higher conceptual level, get away from all these idiotic "these operations maintain the chain" crap. Because there *is* no such list. Quite frankly, any standards text that has that "[[carries_dependency]]" or "[[kill_dependency]]" or whatever attribute is broken. It's broken because the whole concept is TOTALLY ALIEN to the compiler writer or the programmer. It makes no sense. It's purely legalistic language that has zero reason to exist. It's non-intuitive for everybody. And *any* language that talks about the individual operations only encourages people to play legalistic games that actually defeat the whole purpose (namely that all relevant CPU's are going to implement that consume ordering guarantee natively, with no extra code generation rules AT ALL). So any time you talk about some random detail of some operation, somebody is going to come up with a "trick" that defeats things. So don't do it. There is absolutely ZERO difference between any of the arithmetic operations, be they bitwise, additive, multiplicative, shifts, whatever. The *only* thing that matters for all of them is whether they are "value-preserving", or whether they drop so much information that the compiler might decide to use a control dependency instead. That's true for every single one of them. Similarly, actual true control dependencies that limit the problem space sufficiently that the actual pointer value no longer has significant information in it (see the above example) are also things that remove information to the point that only a control dependency remains. Even when the value itself is not modified in any way at all. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 19:54 ` Linus Torvalds @ 2014-02-24 22:37 ` Paul E. McKenney 2014-02-24 23:35 ` Linus Torvalds 2014-02-27 15:37 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 22:37 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 11:54:46AM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Good points. How about the following replacements? > > > > 3. Adding or subtracting an integer to/from a chained pointer > > results in another chained pointer in that same pointer chain. > > The results of addition and subtraction operations that cancel > > the chained pointer's value (for example, "p-(long)p" where "p" > > is a pointer to char) are implementation defined. > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > applied to a chained pointer and an integer for the purposes > > of alignment and pointer translation results in another > > chained pointer in that same pointer chain. Other uses > > of bitwise operators on chained pointers (for example, > > "p|~0") are implementation defined. > > Quite frankly, I think all of this language that is about the actual > operations is irrelevant and wrong. > > It's not going to help compiler writers, and it sure isn't going to > help users that read this. > > Why not just talk about "value chains" and that any operations that > restrict the value range severely end up breaking the chain. There is > no point in listing the operations individually, because every single > operation *can* restrict things. Listing individual operations and > depdendencies is just fundamentally wrong. > > For example, let's look at this obvious case: > > int q,*p = atomic_read(&pp, consume); > .. nothing modifies 'p' .. > q = *p; > > and there are literally *zero* operations that modify the value > change, so obviously the two operations are ordered, right? > > Wrong. > > What if the "nothing modifies 'p'" part looks like this: > > if (p != &myvariable) > return; > > and now any sane compiler will happily optimize "q = *p" into "q = > myvariable", and we're all done - nothing invalid was ever Yes, the compiler could do that. But it would still be required to carry a dependency from the memory_order_consume read to the "*p", which it could do by compiling "q = *p" rather than "q = myvariable" on the one hand or by emitting a memory-barrier instruction on the other. This was the point of #12: 12. A memory_order_consume load carries a dependency to any dereference operator (unary "*", "[]", and "->") in the resulting pointer chain. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 22:37 ` Paul E. McKenney @ 2014-02-24 23:35 ` Linus Torvalds 2014-02-25 6:00 ` Paul E. McKenney 2014-02-25 6:05 ` Linus Torvalds 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-24 23:35 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> What if the "nothing modifies 'p'" part looks like this: >> >> if (p != &myvariable) >> return; >> >> and now any sane compiler will happily optimize "q = *p" into "q = >> myvariable", and we're all done - nothing invalid was ever > > Yes, the compiler could do that. But it would still be required to > carry a dependency from the memory_order_consume read to the "*p", But that's *BS*. You didn't actually listen to the main issue. Paul, why do you insist on this carries-a-dependency crap? It's broken. If you don't believe me, then believe the compiler person who already piped up and told you so. The "carries a dependency" model is broken. Get over it. No sane compiler will ever distinguish two different registers that have the same value from each other. No sane compiler will ever say "ok, register r1 has the exact same value as register r2, but r2 carries the dependency, so I need to make sure to pass r2 to that function or use it as a base pointer". And nobody sane should *expect* a compiler to distinguish two registers with the same value that way. So the whole model is broken. I gave an alternate model (the "restrict"), and you didn't seem to understand the really fundamental difference. It's not a language difference, it's a conceptual difference. In the broken "carries a dependency" model, you have fight all those aliases that can have the same value, and it is not a fight you can win. We've had the "p-p" examples, we've had the "p&0" examples, but the fact is, that "p==&myvariable" example IS EXACTLY THE SAME THING. All three of those things: "p-p", "p&0", and "p==&myvariable" mean that any compiler worth its salt now know that "p" carries no information, and will optimize it away. So please stop arguing against that. Whenever you argue against that simple fact, you are arguing against sane compilers. So *accept* the fact that some operations (and I guarantee that there are more of those than you can think of, and you can create them with various tricks using pretty much *any* feature in the C language) essentially take the data information away. And just accept the fact that then the ordering goes away too. So give up on "carries a dependency". Because there will be cases where that dependency *isn't* carried. The language of the standard needs to get *away* from the broken model, because otherwise the standard is broken. I suggest we instead talk about "litmus tests" and why certain code sequences are ordered, and others are not. So the code sequence I already mentioned is *not* ordered: Litmus test 1: p = atomic_read(pp, consume); if (p == &variable) return p->val; is *NOT* ordered, because the compiler can trivially turn this into "return variable.val", and break the data dependency. This is true *regardless* of any "carries a dependency" language, because that language is insane, and doesn't work when the different pieces here may be in different compilation units. BUT: Litmus test 2: p = atomic_read(pp, consume); if (p != &variable) return p->val; *IS* ordered, because while it looks syntactically a lot like "Litmus test 1", there is no sane way a compiler can use the knowledge that "p is not a pointer to a particular location" to break the data dependency. There is no way in hell that any sane "carries a dependency" model can get the simple tests above right. So give up on it already. "Carries a dependency" cannot work. It's a bad model. You're trying to describe the problem using the wrong tools. Note that my "restrict+pointer to object" language actually got this *right*. The "restrict" part made Litmus test 1 not ordered, because that "p == &variable" success case means that the pointer wasn't restricted, so the pre-requisite for ordering didn't exist. See? The "carries a dependency" is a broken model for this, but there are _other_ models that can work. You tried to rewrite my model into "carries a dependency". That *CANNOT* work. It's like trying to rewrite quantum physics into the Greek model of the four elements. They are not compatible models, and one of them can be shown to not work. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 23:35 ` Linus Torvalds @ 2014-02-25 6:00 ` Paul E. McKenney 2014-02-26 1:47 ` Linus Torvalds 2014-02-25 6:05 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-25 6:00 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 03:35:04PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 2:37 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > >> > >> What if the "nothing modifies 'p'" part looks like this: > >> > >> if (p != &myvariable) > >> return; > >> > >> and now any sane compiler will happily optimize "q = *p" into "q = > >> myvariable", and we're all done - nothing invalid was ever > > > > Yes, the compiler could do that. But it would still be required to > > carry a dependency from the memory_order_consume read to the "*p", > > But that's *BS*. You didn't actually listen to the main issue. > > Paul, why do you insist on this carries-a-dependency crap? Sigh. Read on... > It's broken. If you don't believe me, then believe the compiler person > who already piped up and told you so. > > The "carries a dependency" model is broken. Get over it. > > No sane compiler will ever distinguish two different registers that > have the same value from each other. No sane compiler will ever say > "ok, register r1 has the exact same value as register r2, but r2 > carries the dependency, so I need to make sure to pass r2 to that > function or use it as a base pointer". > > And nobody sane should *expect* a compiler to distinguish two > registers with the same value that way. > > So the whole model is broken. > > I gave an alternate model (the "restrict"), and you didn't seem to > understand the really fundamental difference. It's not a language > difference, it's a conceptual difference. > > In the broken "carries a dependency" model, you have fight all those > aliases that can have the same value, and it is not a fight you can > win. We've had the "p-p" examples, we've had the "p&0" examples, but > the fact is, that "p==&myvariable" example IS EXACTLY THE SAME THING. > > All three of those things: "p-p", "p&0", and "p==&myvariable" mean > that any compiler worth its salt now know that "p" carries no > information, and will optimize it away. > > So please stop arguing against that. Whenever you argue against that > simple fact, you are arguing against sane compilers. So let me see if I understand your reasoning. My best guess is that it goes something like this: 1. The Linux kernel contains code that passes pointers from rcu_dereference() through external functions. 2. Code in the Linux kernel expects the normal RCU ordering guarantees to be in effect even when external functions are involved. 3. When compiling one of these external functions, the C compiler has no way of knowing about these RCU ordering guarantees. 4. The C compiler might therefore apply any and all optimizations to these external functions. 5. This in turn implies that we the only way to prohibit any given optimization from being applied to the results obtained from rcu_dereference() is to prohibit that optimization globally. 6. We have to be very careful what optimizations are globally prohibited, because a poor choice could result in unacceptable performance degradation. 7. Therefore, the only operations that can be counted on to maintain the needed RCU orderings are those where the compiler really doesn't have any choice, in other words, where any reasonable way of computing the result will necessarily maintain the needed ordering. Did I get this right, or am I confused? > So *accept* the fact that some operations (and I guarantee that there > are more of those than you can think of, and you can create them with > various tricks using pretty much *any* feature in the C language) > essentially take the data information away. And just accept the fact > that then the ordering goes away too. Actually, the fact that there are more potential optimizations than I can think of is a big reason for my insistence on the carries-a-dependency crap. My lack of optimization omniscience makes me very nervous about relying on there never ever being a reasonable way of computing a given result without preserving the ordering. > So give up on "carries a dependency". Because there will be cases > where that dependency *isn't* carried. > > The language of the standard needs to get *away* from the broken > model, because otherwise the standard is broken. > > I suggest we instead talk about "litmus tests" and why certain code > sequences are ordered, and others are not. OK... > So the code sequence I already mentioned is *not* ordered: > > Litmus test 1: > > p = atomic_read(pp, consume); > if (p == &variable) > return p->val; > > is *NOT* ordered, because the compiler can trivially turn this into > "return variable.val", and break the data dependency. Right, given your model, the compiler is free to produce code that doesn't order the load from pp against the load from p->val. > This is true *regardless* of any "carries a dependency" language, > because that language is insane, and doesn't work when the different > pieces here may be in different compilation units. Indeed, it won't work across different compilation units unless the compiler is told about it, which is of course the whole point of [[carries_dependency]]. Understood, though, the Linux kernel currently does not have anything that could reasonably automatically generate those [[carries_dependency]] attributes. (Or are there other reasons why you believe [[carries_dependency]] is problematic?) > BUT: > > Litmus test 2: > > p = atomic_read(pp, consume); > if (p != &variable) > return p->val; > > *IS* ordered, because while it looks syntactically a lot like > "Litmus test 1", there is no sane way a compiler can use the knowledge > that "p is not a pointer to a particular location" to break the data > dependency. > > There is no way in hell that any sane "carries a dependency" model can > get the simple tests above right. > > So give up on it already. "Carries a dependency" cannot work. It's a > bad model. You're trying to describe the problem using the wrong > tools. > > Note that my "restrict+pointer to object" language actually got this > *right*. The "restrict" part made Litmus test 1 not ordered, because > that "p == &variable" success case means that the pointer wasn't > restricted, so the pre-requisite for ordering didn't exist. > > See? The "carries a dependency" is a broken model for this, but there > are _other_ models that can work. > > You tried to rewrite my model into "carries a dependency". That > *CANNOT* work. It's like trying to rewrite quantum physics into the > Greek model of the four elements. They are not compatible models, and > one of them can be shown to not work. Of course, I cannot resist putting forward a third litmus test: static struct foo variable1; static struct foo variable2; static struct foo *pp = &variable1; T1: initialize_foo(&variable2); atomic_store_explicit(&pp, &variable2, memory_order_release); /* The above is the only store to pp in this translation unit, * and the address of pp is not exported in any way. */ T2: if (p == &variable1) return p->val1; /* Must be variable1.val1. */ else return p->val2; /* Must be variable2.val2. */ My guess is that your approach would not provide ordering in this case, either. Or am I missing something? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-25 6:00 ` Paul E. McKenney @ 2014-02-26 1:47 ` Linus Torvalds 2014-02-26 5:12 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-26 1:47 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > So let me see if I understand your reasoning. My best guess is that it > goes something like this: > > 1. The Linux kernel contains code that passes pointers from > rcu_dereference() through external functions. No, actually, it's not so much Linux-specific at all. I'm actually thinking about what I'd do as a compiler writer, and as a defender the "C is a high-level assembler" concept. I love C. I'm a huge fan. I think it's a great language, and I think it's a great language not because of some theoretical issues, but because it is the only language around that actually maps fairly well to what machines really do. And it's a *simple* language. Sure, it's not quite as simple as it used to be, but look at how thin the "K&R book" is. Which pretty much describes it - still. That's the real strength of C, and why it's the only language serious people use for system programming. Ignore C++ for a while (Jesus Xavier Christ, I've had to do C++ programming for subsurface), and just think about what makes _C_ a good language. I can look at C code, and I can understand what the code generation is, and what it will really *do*. And I think that's important. Abstractions that hide what the compiler will actually generate are bad abstractions. And ok, so this is obviously Linux-specific in that it's generally only Linux where I really care about the code generation, but I do think it's a bigger issue too. So I want C features to *map* to the hardware features they implement. The abstractions should match each other, not fight each other. > Actually, the fact that there are more potential optimizations than I can > think of is a big reason for my insistence on the carries-a-dependency > crap. My lack of optimization omniscience makes me very nervous about > relying on there never ever being a reasonable way of computing a given > result without preserving the ordering. But if I can give two clear examples that are basically identical from a syntactic standpoint, and one clearly can be trivially optimized to the point where the ordering guarantee goes away, and the other cannot, and you cannot describe the difference, then I think your description is seriously lacking. And I do *not* think the C language should be defined by how it can be described. Leave that to things like Haskell or LISP, where the goal is some kind of completeness of the language that is about the language, not about the machines it will run on. >> So the code sequence I already mentioned is *not* ordered: >> >> Litmus test 1: >> >> p = atomic_read(pp, consume); >> if (p == &variable) >> return p->val; >> >> is *NOT* ordered, because the compiler can trivially turn this into >> "return variable.val", and break the data dependency. > > Right, given your model, the compiler is free to produce code that > doesn't order the load from pp against the load from p->val. Yes. Note also that that is what existing compilers would actually do. And they'd do it "by mistake": they'd load the address of the variable into a register, and then compare the two registers, and then end up using _one_ of the registers as the base pointer for the "p->val" access, but I can almost *guarantee* that there are going to be sequences where some compiler will choose one register over the other based on some random detail. So my model isn't just a "model", it also happens to descibe reality. > Indeed, it won't work across different compilation units unless > the compiler is told about it, which is of course the whole point of > [[carries_dependency]]. Understood, though, the Linux kernel currently > does not have anything that could reasonably automatically generate those > [[carries_dependency]] attributes. (Or are there other reasons why you > believe [[carries_dependency]] is problematic?) So I think carries_dependency is problematic because: - it's not actually in C11 afaik - it requires the programmer to solve the problem of the standard not matching the hardware. - I think it's just insanely ugly, *especially* if it's actually meant to work so that the current carries-a-dependency works even for insane expressions like "a-a". in practice, it's one of those things where I guess nobody actually would ever use it. > Of course, I cannot resist putting forward a third litmus test: > > static struct foo variable1; > static struct foo variable2; > static struct foo *pp = &variable1; > > T1: initialize_foo(&variable2); > atomic_store_explicit(&pp, &variable2, memory_order_release); > /* The above is the only store to pp in this translation unit, > * and the address of pp is not exported in any way. > */ > > T2: if (p == &variable1) > return p->val1; /* Must be variable1.val1. */ > else > return p->val2; /* Must be variable2.val2. */ > > My guess is that your approach would not provide ordering in this > case, either. Or am I missing something? I actually agree. If you write insane code to "trick" the compiler into generating optimizations that break the dependency, then you get what you deserve. Now, realistically, I doubt a compiler will notice, but if it does, I'd go "well, that's your own fault for writing code that makes no sense". Basically, the above uses a pointer as a boolean flag. The compiler noticed it was really a boolean flag, and "consume" doesn't work on boolean flags. Tough. Linus Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 1:47 ` Linus Torvalds @ 2014-02-26 5:12 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-26 5:12 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 25, 2014 at 05:47:03PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:00 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > So let me see if I understand your reasoning. My best guess is that it > > goes something like this: > > > > 1. The Linux kernel contains code that passes pointers from > > rcu_dereference() through external functions. > > No, actually, it's not so much Linux-specific at all. > > I'm actually thinking about what I'd do as a compiler writer, and as a > defender the "C is a high-level assembler" concept. > > I love C. I'm a huge fan. I think it's a great language, and I think > it's a great language not because of some theoretical issues, but > because it is the only language around that actually maps fairly well > to what machines really do. > > And it's a *simple* language. Sure, it's not quite as simple as it > used to be, but look at how thin the "K&R book" is. Which pretty much > describes it - still. > > That's the real strength of C, and why it's the only language serious > people use for system programming. Ignore C++ for a while (Jesus > Xavier Christ, I've had to do C++ programming for subsurface), and > just think about what makes _C_ a good language. The last time I used C++ for a project was in 1990. It was a lot smaller then. > I can look at C code, and I can understand what the code generation > is, and what it will really *do*. And I think that's important. > Abstractions that hide what the compiler will actually generate are > bad abstractions. > > And ok, so this is obviously Linux-specific in that it's generally > only Linux where I really care about the code generation, but I do > think it's a bigger issue too. > > So I want C features to *map* to the hardware features they implement. > The abstractions should match each other, not fight each other. OK... > > Actually, the fact that there are more potential optimizations than I can > > think of is a big reason for my insistence on the carries-a-dependency > > crap. My lack of optimization omniscience makes me very nervous about > > relying on there never ever being a reasonable way of computing a given > > result without preserving the ordering. > > But if I can give two clear examples that are basically identical from > a syntactic standpoint, and one clearly can be trivially optimized to > the point where the ordering guarantee goes away, and the other > cannot, and you cannot describe the difference, then I think your > description is seriously lacking. In my defense, my plan was to constrain the compiler to retain the ordering guarantee in either case. Yes, I did notice that you find that unacceptable. > And I do *not* think the C language should be defined by how it can be > described. Leave that to things like Haskell or LISP, where the goal > is some kind of completeness of the language that is about the > language, not about the machines it will run on. I am with you up to the point that the fancy optimizers start kicking in. I don't know how to describe what the optimizers are and are not permitted to do strictly in terms of the underlying hardware. > >> So the code sequence I already mentioned is *not* ordered: > >> > >> Litmus test 1: > >> > >> p = atomic_read(pp, consume); > >> if (p == &variable) > >> return p->val; > >> > >> is *NOT* ordered, because the compiler can trivially turn this into > >> "return variable.val", and break the data dependency. > > > > Right, given your model, the compiler is free to produce code that > > doesn't order the load from pp against the load from p->val. > > Yes. Note also that that is what existing compilers would actually do. > > And they'd do it "by mistake": they'd load the address of the variable > into a register, and then compare the two registers, and then end up > using _one_ of the registers as the base pointer for the "p->val" > access, but I can almost *guarantee* that there are going to be > sequences where some compiler will choose one register over the other > based on some random detail. > > So my model isn't just a "model", it also happens to descibe reality. Sounds to me like your model -is- reality. I believe that it is useful to constrain reality from time to time, but understand that you vehemently disagree. > > Indeed, it won't work across different compilation units unless > > the compiler is told about it, which is of course the whole point of > > [[carries_dependency]]. Understood, though, the Linux kernel currently > > does not have anything that could reasonably automatically generate those > > [[carries_dependency]] attributes. (Or are there other reasons why you > > believe [[carries_dependency]] is problematic?) > > So I think carries_dependency is problematic because: > > - it's not actually in C11 afaik Indeed it is not, but I bet that gcc will implement it like it does the other attributes that are not part of C11. > - it requires the programmer to solve the problem of the standard not > matching the hardware. The programmer in this instance being the compiler writer? > - I think it's just insanely ugly, *especially* if it's actually > meant to work so that the current carries-a-dependency works even for > insane expressions like "a-a". And left to myself, I would prune down the carries-a-dependency trees to reflect what is actually used, excluding "a-a", comparison operators, and so on, allowing the developer to know when ordering is preserved and when it is not, without having to fully understand all the optimizations that might ever be used. Yes, I understand that you hate that thought as well. > in practice, it's one of those things where I guess nobody actually > would ever use it. Well, I believe that this thread has demonstrated that -you- won't ever use it. ;-) > > Of course, I cannot resist putting forward a third litmus test: > > > > static struct foo variable1; > > static struct foo variable2; > > static struct foo *pp = &variable1; > > > > T1: initialize_foo(&variable2); > > atomic_store_explicit(&pp, &variable2, memory_order_release); > > /* The above is the only store to pp in this translation unit, > > * and the address of pp is not exported in any way. > > */ > > > > T2: if (p == &variable1) > > return p->val1; /* Must be variable1.val1. */ > > else > > return p->val2; /* Must be variable2.val2. */ > > > > My guess is that your approach would not provide ordering in this > > case, either. Or am I missing something? > > I actually agree. > > If you write insane code to "trick" the compiler into generating > optimizations that break the dependency, then you get what you > deserve. > > Now, realistically, I doubt a compiler will notice, but if it does, > I'd go "well, that's your own fault for writing code that makes no > sense". > > Basically, the above uses a pointer as a boolean flag. The compiler > noticed it was really a boolean flag, and "consume" doesn't work on > boolean flags. Tough. So the places in the Linux kernel that currently compare the value returned from rcu_dereference() against an address of a variable need to do that comparison in an external function where the compiler cannot see it? Or inline assembly, I suppose. There are probably also "interesting" situations involving structures reached by multiple pointers, some RCU-protected and some not... Ouch. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 23:35 ` Linus Torvalds 2014-02-25 6:00 ` Paul E. McKenney @ 2014-02-25 6:05 ` Linus Torvalds 2014-02-26 0:15 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-25 6:05 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Litmus test 1: > > p = atomic_read(pp, consume); > if (p == &variable) > return p->val; > > is *NOT* ordered Btw, don't get me wrong. I don't _like_ it not being ordered, and I actually did spend some time thinking about my earlier proposal on strengthening the 'consume' ordering. I have for the last several years been 100% convinced that the Intel memory ordering is the right thing, and that people who like weak memory ordering are wrong and should try to avoid reproducing if at all possible. But given that we have memory orderings like power and ARM, I don't actually see a sane way to get a good strong ordering. You can teach compilers about cases like the above when they actually see all the code and they could poison the value chain etc. But it would be fairly painful, and once you cross object files (or even just functions in the same compilation unit, for that matter), it goes from painful to just "ridiculously not worth it". So I think the C semantics should mirror what the hardware gives us - and do so even in the face of reasonable optimizations - not try to do something else that requires compilers to treat "consume" very differently. If people made me king of the world, I'd outlaw weak memory ordering. You can re-order as much as you want in hardware with speculation etc, but you should always *check* your speculation and make it *look* like you did everything in order. Which is pretty much the intel memory ordering (ignoring the write buffering). Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-25 6:05 ` Linus Torvalds @ 2014-02-26 0:15 ` Paul E. McKenney 2014-02-26 3:32 ` Jeff Law 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-26 0:15 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 10:05:52PM -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 3:35 PM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > Litmus test 1: > > > > p = atomic_read(pp, consume); > > if (p == &variable) > > return p->val; > > > > is *NOT* ordered > > Btw, don't get me wrong. I don't _like_ it not being ordered, and I > actually did spend some time thinking about my earlier proposal on > strengthening the 'consume' ordering. Understood. > I have for the last several years been 100% convinced that the Intel > memory ordering is the right thing, and that people who like weak > memory ordering are wrong and should try to avoid reproducing if at > all possible. But given that we have memory orderings like power and > ARM, I don't actually see a sane way to get a good strong ordering. > You can teach compilers about cases like the above when they actually > see all the code and they could poison the value chain etc. But it > would be fairly painful, and once you cross object files (or even just > functions in the same compilation unit, for that matter), it goes from > painful to just "ridiculously not worth it". And I have indeed seen a post or two from you favoring stronger memory ordering over the past few years. ;-) > So I think the C semantics should mirror what the hardware gives us - > and do so even in the face of reasonable optimizations - not try to do > something else that requires compilers to treat "consume" very > differently. I am sure that a great many people would jump for joy at the chance to drop any and all RCU-related verbiage from the C11 and C++11 standards. (I know, you aren't necessarily advocating this, but given what you say above, I cannot think what verbiage that would remain.) The thing that makes me very nervous is how much the definition of "reasonable optimization" has changed. For example, before the 2.6.10 Linux kernel, we didn't even apply volatile semantics to fetches of RCU-protected pointers -- and as far as I know, never needed to. But since then, there have been several cases where the compiler happily hoisted a normal load out of a surprisingly large loop. Hardware advances can come into play as well. For example, my very first RCU work back in the early 90s was on a parallel system whose CPUs had no branch-prediction hardware (80386 or 80486, I don't remember which). Now people talk about compilers using branch prediction hardware to implement value-speculation optimizations. Five or ten years from now, who knows what crazy optimizations might be considered to be completely reasonable? Are ARM and Power really the bad boys here? Or are they instead playing the role of the canary in the coal mine? > If people made me king of the world, I'd outlaw weak memory ordering. > You can re-order as much as you want in hardware with speculation etc, > but you should always *check* your speculation and make it *look* like > you did everything in order. Which is pretty much the intel memory > ordering (ignoring the write buffering). Speaking as someone who got whacked over the head with DEC Alpha when first presenting RCU to the Digital UNIX folks long ago, I do have some sympathy with this line of thought. But as you say, it is not the world we currently live in. Of course, in the final analysis, your kernel, your call. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 0:15 ` Paul E. McKenney @ 2014-02-26 3:32 ` Jeff Law 2014-02-26 5:23 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Jeff Law @ 2014-02-26 3:32 UTC (permalink / raw) To: paulmck, Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 02/25/14 17:15, Paul E. McKenney wrote: >> I have for the last several years been 100% convinced that the Intel >> memory ordering is the right thing, and that people who like weak >> memory ordering are wrong and should try to avoid reproducing if at >> all possible. But given that we have memory orderings like power and >> ARM, I don't actually see a sane way to get a good strong ordering. >> You can teach compilers about cases like the above when they actually >> see all the code and they could poison the value chain etc. But it >> would be fairly painful, and once you cross object files (or even just >> functions in the same compilation unit, for that matter), it goes from >> painful to just "ridiculously not worth it". > > And I have indeed seen a post or two from you favoring stronger memory > ordering over the past few years. ;-) I couldn't agree more. > > Are ARM and Power really the bad boys here? Or are they instead playing > the role of the canary in the coal mine? That's a question I've been struggling with recently as well. I suspect they (arm, power) are going to be the outliers rather than the canary. While the weaker model may give them some advantages WRT scalability, I don't think it'll ultimately be enough to overcome the difficulty in writing correct low level code for them. Regardless, they're here and we have to deal with them. Jeff ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 3:32 ` Jeff Law @ 2014-02-26 5:23 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-26 5:23 UTC (permalink / raw) To: Jeff Law Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 25, 2014 at 08:32:38PM -0700, Jeff Law wrote: > On 02/25/14 17:15, Paul E. McKenney wrote: > >>I have for the last several years been 100% convinced that the Intel > >>memory ordering is the right thing, and that people who like weak > >>memory ordering are wrong and should try to avoid reproducing if at > >>all possible. But given that we have memory orderings like power and > >>ARM, I don't actually see a sane way to get a good strong ordering. > >>You can teach compilers about cases like the above when they actually > >>see all the code and they could poison the value chain etc. But it > >>would be fairly painful, and once you cross object files (or even just > >>functions in the same compilation unit, for that matter), it goes from > >>painful to just "ridiculously not worth it". > > > >And I have indeed seen a post or two from you favoring stronger memory > >ordering over the past few years. ;-) > I couldn't agree more. > > > > >Are ARM and Power really the bad boys here? Or are they instead playing > >the role of the canary in the coal mine? > That's a question I've been struggling with recently as well. I > suspect they (arm, power) are going to be the outliers rather than > the canary. While the weaker model may give them some advantages WRT > scalability, I don't think it'll ultimately be enough to overcome > the difficulty in writing correct low level code for them. > > Regardless, they're here and we have to deal with them. Agreed... Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 19:54 ` Linus Torvalds 2014-02-24 22:37 ` Paul E. McKenney @ 2014-02-27 15:37 ` Torvald Riegel 2014-02-27 17:01 ` Linus Torvalds 2014-02-27 17:50 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-27 15:37 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Good points. How about the following replacements? > > > > 3. Adding or subtracting an integer to/from a chained pointer > > results in another chained pointer in that same pointer chain. > > The results of addition and subtraction operations that cancel > > the chained pointer's value (for example, "p-(long)p" where "p" > > is a pointer to char) are implementation defined. > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > applied to a chained pointer and an integer for the purposes > > of alignment and pointer translation results in another > > chained pointer in that same pointer chain. Other uses > > of bitwise operators on chained pointers (for example, > > "p|~0") are implementation defined. > > Quite frankly, I think all of this language that is about the actual > operations is irrelevant and wrong. > > It's not going to help compiler writers, and it sure isn't going to > help users that read this. > > Why not just talk about "value chains" and that any operations that > restrict the value range severely end up breaking the chain. There is > no point in listing the operations individually, because every single > operation *can* restrict things. Listing individual operations and > depdendencies is just fundamentally wrong. [...] > The *only* thing that matters for all of them is whether they are > "value-preserving", or whether they drop so much information that the > compiler might decide to use a control dependency instead. That's true > for every single one of them. > > Similarly, actual true control dependencies that limit the problem > space sufficiently that the actual pointer value no longer has > significant information in it (see the above example) are also things > that remove information to the point that only a control dependency > remains. Even when the value itself is not modified in any way at all. I agree that just considering syntactic properties of the program seems to be insufficient. Making it instead depend on whether there is a "semantic" dependency due to a value being "necessary" to compute a result seems better. However, whether a value is "necessary" might not be obvious, and I understand Paul's argument that he does not want to have to reason about all potential compiler optimizations. Thus, I believe we need to specify when a value is "necessary". I have a suggestion for a somewhat different formulation of the feature that you seem to have in mind, which I'll discuss below. Excuse the verbosity of the following, but I'd rather like to avoid misunderstandings than save a few words. What we'd like to capture is that a value originating from a mo_consume load is "necessary" for a computation (e.g., it "cannot" be replaced with value predictions and/or control dependencies); if that's the case in the program, we can reasonably assume that a compiler implementation will transform this into a data dependency, which will then lead to ordering guarantees by the HW. However, we need to specify when a value is "necessary". We could say that this is implementation-defined, and use a set of litmus tests (e.g., like those discussed in the thread) to roughly carve out what a programmer could expect. This may even be practical for a project like the Linux kernel that follows strict project-internal rules and pays a lot of attention to what the particular implementations of compilers expected to compile the kernel are doing. However, I think this approach would be too vague for the standard and for many other programs/projects. One way to understand "necessary" would be to say that if a mo_consume load can result in more than V different values, then the actual value is "unknown", and thus "necessary" to compute anything based on it. (But this is flawed, as discussed below.) However, how big should V be? If it's larger than 1, atomic bool cannot be used with mo_consume, which seems weird. If V is 1, then Linus' litmus tests work (but Paul's doesn't; see below), but the compiler must not try to predict more than one value. This holds for any choice of V, so there always is an *additional* constraint on code generation for operations that are meant to take part in such "value dependencies". The bigger V might be, the less likely it should be for this to actually constrain a particular compiler's optimizations (e.g., while it might be meaningful to use value prediction for two or three values, it's probably not for 1000s). Nonetheless, if we don't want to predict the future, we need to specify V. Given that we always have some constraint for code generation anyway, and given that V > 1 might be an arbitrary-looking constraint and disallows use on atomic bool, I believe V should be 1. Furthermore, there is a problem in saying "a load can result in more than one value" because in a deterministic program/operation, it will result in exactly one value. Paul's (counter-)litmus test is a similar example: the compiler saw all stores to a particular variable, so it was able to show that a particular pointer variable actually just has two possible values (ie, it's like a bool). How do we avoid this problem? And do we actually want to require programmers to reason about this at all (ie, issues like Paul's example)? The only solution that I currently see is to specify the allowed scope of the knowledge about the values a load can result in. Based on these thoughts, we could specify the new mo_consume guarantees roughly as follows: An evaluation E (in an execution) has a value dependency to an atomic and mo_consume load L (in an execution) iff: * L's type holds more than one value (ruling out constants etc.), * L is sequenced-before E, * L's result is used by the abstract machine to compute E, * E is value-dependency-preserving code (defined below), and * at the time of execution of E, L can possibly have returned at least two different values under the assumption that L itself could have returned any value allowed by L's type. If a memory access A's targeted memory location has a value dependency on a mo_consume load L, and an action X inter-thread-happens-before L, then X happens-before A. While this attempt at a definition certainly needs work, I hope that at least the first 3 requirements for value dependencies are clear. The fifth requirement tries to capture both Linus' "value chains" concept as well as the point that we need to specify the scope of knowledge about values. Regarding the latter, we make a fresh start at each mo_consume load (ie, we assume we know nothing -- L could have returned any possible value); I believe this is easier to reason about than other scopes like function granularities (what happens on inlining?), or translation units. It should also be simple to implement for compilers, and would hopefully not constrain optimization too much. The rest of the requirement then forbids "breaking value chains". For example (for now, let's ignore the fourth requirement in the examples): int *p = atomic_load(&pp, mo_consume); if (p == &myvariable) q = *p; // (1) When (1) is executed, the load can only have returned the value &myvariable (otherwise, the branch wouldn't be executed), so this evaluation of p is not value-dependent on the mo_consume load anymore. OTOH, the following does work (ie, uses a value dependency for ordering), because int* can have more than two values: int *p = atomic_load(&pp, mo_consume); if (p != &myvariable) q = *p; // (1) But it would not work using bool: bool b = atomic_load(&foo, mo_consume); if (b != false) q = arr[b]; // only one value ("true") is left for b Paul's litmus test would work, because we guarantee to the programmer that it can assume that the mo_consume load would return any value allowed by the type; effectively, this forbids the compiler analysis Paul thought about: static struct foo variable1; static struct foo variable2; static struct foo *pp = &variable1; T1: initialize_foo(&variable2); atomic_store_explicit(&pp, &variable2, memory_order_release); /* The above is the only store to pp in this translation unit, * and the address of pp is not exported in any way. */ T2: int *p = atomic_load(&pp, mo_consume); if (p == &variable1) return p->val1; /* Must be variable1.val1. No value dependency. */ else return p->val2; /* Will read variable2.val2, but value dependency is guaranteed. */ The fourth requirement (value-dependency-preserving code) might look as if I'm re-introducing the problematic carries-a-dependency again, but that's *not* the case. We do need to demarcate value-dependency-carrying code from other code so that the compiler knows when it can do value prediction and similar transformations (see above). However, note that this fourth requirement is necessary but not sufficient for a value dependency -- the fifth requirement (ie, that there must be a "value chain") always has the last word, so to speak. IOW, this does not try to define dependencies based on syntax. I suggest to use data types to demarcate code that is actually meant to preserve value dependencies. The current standard tries to do it on the granularity of functions (ie, using [[carries_dependency]]), and I believe we agree that this isn't great. For example, it's too constraining for large functions, it requires annotations, calls from non-annotated to annotated functions might result in further barriers, function-pointers are problematic (can't put [[carries_dependency]] on a function pointer type?), etc. What I have in mind is roughly the following (totally made-up syntax -- suggestions for how to do this properly are very welcome): * Have a type modifier (eg, like restrict), that specifies that operations on data of this type are preserving value dependencies: int value_dep_preserving *foo; bool value_dep_preserving bar; * mo_consume loads return value_dep_preserving types. * One can cast from value_dep_preserving to the normal type, but not vice-versa. * Operations that take a value_dep_preserving-typed operand will usually produce a value_dep_preserving type. One can argue whether "?:" should be an exception to this, at least if only the left/first operand is value_dep_preserving. Unlike in the current standard, this is *not* critical here because in the end, it's all down to the fifth requirement (see above). This has a couple of drawbacks: * The programmer needs to use the types. * It extends the type system (but [[carries_dependency]] effectively extends the type system as well, depending on how it's implemented). This has a few advantages over function-granularity or just hoping that the compiler gets it right by automatic analysis: * The programmer can use it exactly for the variables that it needs value-dep-preservation for. This won't slow down other code. * If an mo_consume load's result is used directly (or in C++ with auto), no special type annotations / extra code are required. * The compiler knows exactly which code has constraints on code generation (e.g., no value prediction), due to the types. No points-to analysis necessary. * Programmers could even use special macros that only load from value_dep_preserving-typed addresses (based on type checks); this would obviously not catch all errors, but many. For example: int *foo, *bar; int value_dep_preserving *p = atomic_load(&pp, mo_consume); x = VALUE_DEF_DEREF(p != NULL ? &foo : &bar); // Error! if (p == &myvariable) q = VALUE_DEP_DEREF(p); // Wouldn't be guaranteed to be caught, but // compiler might be able to emit warning. * In C++, we might be able to provide this with just a special template: std::value_dep_preserving<int *> p; Suitably defined operators for this template should be able to take care of the rest, and the implementation might use implementation-defined mechanisms internally to constrain code generation. What do you think? Is this meaningful regarding what current hardware offers, or will it do (or might do in the future) value prediction on it's own? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 15:37 ` Torvald Riegel @ 2014-02-27 17:01 ` Linus Torvalds 2014-02-27 19:06 ` Paul E. McKenney 2014-03-03 15:36 ` Torvald Riegel 2014-02-27 17:50 ` Paul E. McKenney 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-27 17:01 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote: > > I agree that just considering syntactic properties of the program seems > to be insufficient. Making it instead depend on whether there is a > "semantic" dependency due to a value being "necessary" to compute a > result seems better. However, whether a value is "necessary" might not > be obvious, and I understand Paul's argument that he does not want to > have to reason about all potential compiler optimizations. Thus, I > believe we need to specify when a value is "necessary". I suspect it's hard to really strictly define, but at the same time I actually think that compiler writers (and users, for that matter) have little problem understanding the concept and intent. I do think that listing operations might be useful to give good examples of what is a "necessary" value, and - perhaps more importantly - what can break the value from being "necessary". Especially the gotchas. > I have a suggestion for a somewhat different formulation of the feature > that you seem to have in mind, which I'll discuss below. Excuse the > verbosity of the following, but I'd rather like to avoid > misunderstandings than save a few words. Ok, I'm going to cut most of the verbiage since it's long and I'm not commenting on most of it. But > Based on these thoughts, we could specify the new mo_consume guarantees > roughly as follows: > > An evaluation E (in an execution) has a value dependency to an > atomic and mo_consume load L (in an execution) iff: > * L's type holds more than one value (ruling out constants > etc.), > * L is sequenced-before E, > * L's result is used by the abstract machine to compute E, > * E is value-dependency-preserving code (defined below), and > * at the time of execution of E, L can possibly have returned at > least two different values under the assumption that L itself > could have returned any value allowed by L's type. > > If a memory access A's targeted memory location has a value > dependency on a mo_consume load L, and an action X > inter-thread-happens-before L, then X happens-before A. I think this mostly works. > Regarding the latter, we make a fresh start at each mo_consume load (ie, > we assume we know nothing -- L could have returned any possible value); > I believe this is easier to reason about than other scopes like function > granularities (what happens on inlining?), or translation units. It > should also be simple to implement for compilers, and would hopefully > not constrain optimization too much. > > [...] > > Paul's litmus test would work, because we guarantee to the programmer > that it can assume that the mo_consume load would return any value > allowed by the type; effectively, this forbids the compiler analysis > Paul thought about: So realistically, since with the new wording we can ignore the silly cases (ie "p-p") and we can ignore the trivial-to-optimize compiler cases ("if (p == &variable) .. use p"), and you would forbid the "global value range optimization case" that Paul bright up, what remains would seem to be just really subtle compiler transformations of data dependencies to control dependencies. And the only such thing I can think of is basically compiler-initiated value-prediction, presumably directed by PGO (since now if the value prediction is in the source code, it's considered to break the value chain). The good thing is that afaik, value-prediction is largely not used in real life, afaik. There are lots of papers on it, but I don't think anybody actually does it (although I can easily see some specint-specific optimization pattern that is build up around it). And even value prediction is actually fine, as long as the compiler can see the memory *source* of the value prediction (and it isn't a mo_consume). So it really ends up limiting your value prediction in very simple ways: you cannot do it to function arguments if they are registers. But you can still do value prediction on values you loaded from memory, if you can actually *see* that memory op. Of course, on more strongly ordered CPU's, even that "register argument" limitation goes away. So I agree that there is basically no real optimization constraint. Value-prediction is of dubious value to begin with, and the actual constraint on its use if some compiler writer really wants to is not onerous. > What I have in mind is roughly the following (totally made-up syntax -- > suggestions for how to do this properly are very welcome): > * Have a type modifier (eg, like restrict), that specifies that > operations on data of this type are preserving value dependencies: So I'm not violently opposed, but I think the upsides are not great. Note that my earlier suggestion to use "restrict" wasn't because I believed the annotation itself would be visible, but basically just as a legalistic promise to the compiler that *if* it found an alias, then it didn't need to worry about ordering. So to me, that type modifier was about conceptual guarantees, not about actual value chains. Anyway, the reason I don't believe any type modifier (and "[[carries_dependency]]" is basically just that) is worth it is simply that it adds a real burden on the programmer, without actually giving the programmer any real upside: Within a single function, the compiler already sees that mo_consume source, and so doing a type-based restriction doesn't really help. The information is already there, without any burden on the programmer. And across functions, the compiler has already - by definition - mostly lost sight of all the things it could use to reduce the value space. Even Paul's example doesn't really work if the use of the "mo_consume" value has been passed to another function, because inside a separate function, the compiler couldn't see that the value it uses comes from only two possible values. And as mentioned, even *if* the compiler wants to do value prediction that turns a data dependency into a control dependency, the limitation to say "no, you can't do it unless you saw where the value got loaded" really isn't that onerous. I bet that if you ask actual production compiler people (as opposed to perhaps academia), none of them actually really believe in value prediction to begin with. > What do you think? > > Is this meaningful regarding what current hardware offers, or will it do > (or might do in the future) value prediction on it's own? I can pretty much guarantee that when/if hardware does value prediction on its own, it will do so without exposing it as breaking the data dependency. The thing is, a CPU is actually *much* better situated at doing speculative memory accesses, because a CPU already has all the infrastructure to do speculation in general. And for a CPU, once you do value speculation, guaranteeing the memory ordering is *trivial*: all you need to do is to track the "speculated" memory instruction until you check the value (which you obviously have to do anyway, otherwise you're not doing value _prediction_, you're just doing "value wild guessing" ;^), and when you check the value you also check that the cacheline hasn't been evicted out-of-order. This is all stuff that CPU people already do. If you have transactional memory, you already have all the resources to do this. Or, even without transactional memory, if like Intel you have a memory model that says "loads are done in order" but you actually wildly speculate loads and just check before retiring instructions that the cachelines didn't get evicted out of order, you already have all the hardware to do value prediction *without* making it visible in the memory order. This, btw, is one reason why people who think that compilers should be overly smart and do fancy tricks are incompetent. People who thought that Itanium was a great idea ("Let's put the complexity in the compiler, and make a simple CPU") are simply objectively *wrong*. People who think that value prediction by a compiler is a good idea are not people you should really care about. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 17:01 ` Linus Torvalds @ 2014-02-27 19:06 ` Paul E. McKenney 2014-02-27 19:47 ` Linus Torvalds 2014-03-03 15:36 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-27 19:06 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 09:01:40AM -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > I suspect it's hard to really strictly define, but at the same time I > actually think that compiler writers (and users, for that matter) have > little problem understanding the concept and intent. > > I do think that listing operations might be useful to give good > examples of what is a "necessary" value, and - perhaps more > importantly - what can break the value from being "necessary". > Especially the gotchas. > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Ok, I'm going to cut most of the verbiage since it's long and I'm not > commenting on most of it. > > But > > > Based on these thoughts, we could specify the new mo_consume guarantees > > roughly as follows: > > > > An evaluation E (in an execution) has a value dependency to an > > atomic and mo_consume load L (in an execution) iff: > > * L's type holds more than one value (ruling out constants > > etc.), > > * L is sequenced-before E, > > * L's result is used by the abstract machine to compute E, > > * E is value-dependency-preserving code (defined below), and > > * at the time of execution of E, L can possibly have returned at > > least two different values under the assumption that L itself > > could have returned any value allowed by L's type. > > > > If a memory access A's targeted memory location has a value > > dependency on a mo_consume load L, and an action X > > inter-thread-happens-before L, then X happens-before A. > > I think this mostly works. > > > Regarding the latter, we make a fresh start at each mo_consume load (ie, > > we assume we know nothing -- L could have returned any possible value); > > I believe this is easier to reason about than other scopes like function > > granularities (what happens on inlining?), or translation units. It > > should also be simple to implement for compilers, and would hopefully > > not constrain optimization too much. > > > > [...] > > > > Paul's litmus test would work, because we guarantee to the programmer > > that it can assume that the mo_consume load would return any value > > allowed by the type; effectively, this forbids the compiler analysis > > Paul thought about: > > So realistically, since with the new wording we can ignore the silly > cases (ie "p-p") and we can ignore the trivial-to-optimize compiler > cases ("if (p == &variable) .. use p"), and you would forbid the > "global value range optimization case" that Paul bright up, what > remains would seem to be just really subtle compiler transformations > of data dependencies to control dependencies. FWIW, I am looking through the kernel for instances of your first "if (p == &variable) .. use p" limus test. All the ones I have found thus far are OK for one of the following reasons: 1. The comparison was against NULL, so you don't get to dereference the pointer anyway. About 80% are in this category. 2. The comparison was against another pointer, but there were no dereferences afterwards. Here is an example of what these can look like: list_for_each_entry_rcu(p, &head, next) if (p == &variable) return; /* "p" goes out of scope. */ 3. The comparison was against another RCU-protected pointer, where that other pointer was properly fetched using one of the RCU primitives. Here it doesn't matter which pointer you use. At least as long as the rcu_assign_pointer() for that other pointer happened after the last update to the pointed-to structure. I am a bit nervous about #3. Any thoughts on it? Some other reasons why it would be OK to dereference after a comparison: 4. The pointed-to data is constant: (a) It was initialized at boot time, (b) the update-side lock is held, (c) we are running in a kthread and the data was initialized before the kthread was created, (d) we are running in a module, and the data was initialized during or before module-init time for that module. And many more besides, involving pretty much every kernel primitive that makes something run later. 5. All subsequent dereferences are stores, so that a control dependency is in effect. Thoughts? FWIW, no arguments with the following. Thanx, Paul > And the only such thing I can think of is basically compiler-initiated > value-prediction, presumably directed by PGO (since now if the value > prediction is in the source code, it's considered to break the value > chain). > > The good thing is that afaik, value-prediction is largely not used in > real life, afaik. There are lots of papers on it, but I don't think > anybody actually does it (although I can easily see some > specint-specific optimization pattern that is build up around it). > > And even value prediction is actually fine, as long as the compiler > can see the memory *source* of the value prediction (and it isn't a > mo_consume). So it really ends up limiting your value prediction in > very simple ways: you cannot do it to function arguments if they are > registers. But you can still do value prediction on values you loaded > from memory, if you can actually *see* that memory op. > > Of course, on more strongly ordered CPU's, even that "register > argument" limitation goes away. > > So I agree that there is basically no real optimization constraint. > Value-prediction is of dubious value to begin with, and the actual > constraint on its use if some compiler writer really wants to is not > onerous. > > > What I have in mind is roughly the following (totally made-up syntax -- > > suggestions for how to do this properly are very welcome): > > * Have a type modifier (eg, like restrict), that specifies that > > operations on data of this type are preserving value dependencies: > > So I'm not violently opposed, but I think the upsides are not great. > Note that my earlier suggestion to use "restrict" wasn't because I > believed the annotation itself would be visible, but basically just as > a legalistic promise to the compiler that *if* it found an alias, then > it didn't need to worry about ordering. So to me, that type modifier > was about conceptual guarantees, not about actual value chains. > > Anyway, the reason I don't believe any type modifier (and > "[[carries_dependency]]" is basically just that) is worth it is simply > that it adds a real burden on the programmer, without actually giving > the programmer any real upside: > > Within a single function, the compiler already sees that mo_consume > source, and so doing a type-based restriction doesn't really help. The > information is already there, without any burden on the programmer. > > And across functions, the compiler has already - by definition - > mostly lost sight of all the things it could use to reduce the value > space. Even Paul's example doesn't really work if the use of the > "mo_consume" value has been passed to another function, because inside > a separate function, the compiler couldn't see that the value it uses > comes from only two possible values. > > And as mentioned, even *if* the compiler wants to do value prediction > that turns a data dependency into a control dependency, the limitation > to say "no, you can't do it unless you saw where the value got loaded" > really isn't that onerous. > > I bet that if you ask actual production compiler people (as opposed to > perhaps academia), none of them actually really believe in value > prediction to begin with. > > > What do you think? > > > > Is this meaningful regarding what current hardware offers, or will it do > > (or might do in the future) value prediction on it's own? > > I can pretty much guarantee that when/if hardware does value > prediction on its own, it will do so without exposing it as breaking > the data dependency. > > The thing is, a CPU is actually *much* better situated at doing > speculative memory accesses, because a CPU already has all the > infrastructure to do speculation in general. > > And for a CPU, once you do value speculation, guaranteeing the memory > ordering is *trivial*: all you need to do is to track the "speculated" > memory instruction until you check the value (which you obviously have > to do anyway, otherwise you're not doing value _prediction_, you're > just doing "value wild guessing" ;^), and when you check the value you > also check that the cacheline hasn't been evicted out-of-order. > > This is all stuff that CPU people already do. If you have > transactional memory, you already have all the resources to do this. > Or, even without transactional memory, if like Intel you have a memory > model that says "loads are done in order" but you actually wildly > speculate loads and just check before retiring instructions that the > cachelines didn't get evicted out of order, you already have all the > hardware to do value prediction *without* making it visible in the > memory order. > > This, btw, is one reason why people who think that compilers should be > overly smart and do fancy tricks are incompetent. People who thought > that Itanium was a great idea ("Let's put the complexity in the > compiler, and make a simple CPU") are simply objectively *wrong*. > People who think that value prediction by a compiler is a good idea > are not people you should really care about. > > Linus > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 19:06 ` Paul E. McKenney @ 2014-02-27 19:47 ` Linus Torvalds 2014-02-27 20:53 ` Paul E. McKenney 2014-03-03 18:59 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-27 19:47 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > 3. The comparison was against another RCU-protected pointer, > where that other pointer was properly fetched using one > of the RCU primitives. Here it doesn't matter which pointer > you use. At least as long as the rcu_assign_pointer() for > that other pointer happened after the last update to the > pointed-to structure. > > I am a bit nervous about #3. Any thoughts on it? I think that it might be worth pointing out as an example, and saying that code like p = atomic_read(consume); X; q = atomic_read(consume); Y; if (p == q) data = p->val; then the access of "p->val" is constrained to be data-dependent on *either* p or q, but you can't really tell which, since the compiler can decide that the values are interchangeable. I cannot for the life of me come up with a situation where this would matter, though. If "X" contains a fence, then that fence will be a stronger ordering than anything the consume through "p" would guarantee anyway. And if "X" does *not* contain a fence, then the atomic reads of p and q are unordered *anyway*, so then whether the ordering to the access through "p" is through p or q is kind of irrelevant. No? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 19:47 ` Linus Torvalds @ 2014-02-27 20:53 ` Paul E. McKenney 2014-03-01 0:50 ` Paul E. McKenney 2014-03-03 18:59 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-27 20:53 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > 3. The comparison was against another RCU-protected pointer, > > where that other pointer was properly fetched using one > > of the RCU primitives. Here it doesn't matter which pointer > > you use. At least as long as the rcu_assign_pointer() for > > that other pointer happened after the last update to the > > pointed-to structure. > > > > I am a bit nervous about #3. Any thoughts on it? > > I think that it might be worth pointing out as an example, and saying > that code like > > p = atomic_read(consume); > X; > q = atomic_read(consume); > Y; > if (p == q) > data = p->val; > > then the access of "p->val" is constrained to be data-dependent on > *either* p or q, but you can't really tell which, since the compiler > can decide that the values are interchangeable. > > I cannot for the life of me come up with a situation where this would > matter, though. If "X" contains a fence, then that fence will be a > stronger ordering than anything the consume through "p" would > guarantee anyway. And if "X" does *not* contain a fence, then the > atomic reads of p and q are unordered *anyway*, so then whether the > ordering to the access through "p" is through p or q is kind of > irrelevant. No? I can make a contrived litmus test for it, but you are right, the only time you can see it happen is when X has no barriers, in which case you don't have any ordering anyway -- both the compiler and the CPU can reorder the loads into p and q, and the read from p->val can, as you say, come from either pointer. For whatever it is worth, hear is the litmus test: T1: p = kmalloc(...); if (p == NULL) deal_with_it(); p->a = 42; /* Each field in its own cache line. */ p->b = 43; p->c = 44; atomic_store_explicit(&gp1, p, memory_order_release); p->b = 143; p->c = 144; atomic_store_explicit(&gp2, p, memory_order_release); T2: p = atomic_load_explicit(&gp2, memory_order_consume); r1 = p->b; /* Guaranteed to get 143. */ q = atomic_load_explicit(&gp1, memory_order_consume); if (p == q) { /* The compiler decides that q->c is same as p->c. */ r2 = p->c; /* Could get 44 on weakly order system. */ } The loads from gp1 and gp2 are, as you say, unordered, so you get what you get. And publishing a structure via one RCU-protected pointer, updating it, then publishing it via another pointer seems to me to be asking for trouble anyway. If you really want to do something like that and still see consistency across all the fields in the structure, please put a lock in the structure and use it to guard updates and accesses to those fields. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 20:53 ` Paul E. McKenney @ 2014-03-01 0:50 ` Paul E. McKenney 2014-03-01 10:06 ` Peter Sewell 2014-03-03 18:55 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-03-01 0:50 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > 3. The comparison was against another RCU-protected pointer, > > > where that other pointer was properly fetched using one > > > of the RCU primitives. Here it doesn't matter which pointer > > > you use. At least as long as the rcu_assign_pointer() for > > > that other pointer happened after the last update to the > > > pointed-to structure. > > > > > > I am a bit nervous about #3. Any thoughts on it? > > > > I think that it might be worth pointing out as an example, and saying > > that code like > > > > p = atomic_read(consume); > > X; > > q = atomic_read(consume); > > Y; > > if (p == q) > > data = p->val; > > > > then the access of "p->val" is constrained to be data-dependent on > > *either* p or q, but you can't really tell which, since the compiler > > can decide that the values are interchangeable. > > > > I cannot for the life of me come up with a situation where this would > > matter, though. If "X" contains a fence, then that fence will be a > > stronger ordering than anything the consume through "p" would > > guarantee anyway. And if "X" does *not* contain a fence, then the > > atomic reads of p and q are unordered *anyway*, so then whether the > > ordering to the access through "p" is through p or q is kind of > > irrelevant. No? > > I can make a contrived litmus test for it, but you are right, the only > time you can see it happen is when X has no barriers, in which case > you don't have any ordering anyway -- both the compiler and the CPU can > reorder the loads into p and q, and the read from p->val can, as you say, > come from either pointer. > > For whatever it is worth, hear is the litmus test: > > T1: p = kmalloc(...); > if (p == NULL) > deal_with_it(); > p->a = 42; /* Each field in its own cache line. */ > p->b = 43; > p->c = 44; > atomic_store_explicit(&gp1, p, memory_order_release); > p->b = 143; > p->c = 144; > atomic_store_explicit(&gp2, p, memory_order_release); > > T2: p = atomic_load_explicit(&gp2, memory_order_consume); > r1 = p->b; /* Guaranteed to get 143. */ > q = atomic_load_explicit(&gp1, memory_order_consume); > if (p == q) { > /* The compiler decides that q->c is same as p->c. */ > r2 = p->c; /* Could get 44 on weakly order system. */ > } > > The loads from gp1 and gp2 are, as you say, unordered, so you get what > you get. > > And publishing a structure via one RCU-protected pointer, updating it, > then publishing it via another pointer seems to me to be asking for > trouble anyway. If you really want to do something like that and still > see consistency across all the fields in the structure, please put a lock > in the structure and use it to guard updates and accesses to those fields. And here is a patch documenting the restrictions for the current Linux kernel. The rules change a bit due to rcu_dereference() acting a bit differently than atomic_load_explicit(&p, memory_order_consume). Thoughts? Thanx, Paul ------------------------------------------------------------------------ documentation: Record rcu_dereference() value mishandling Recent LKML discussings (see http://lwn.net/Articles/586838/ and http://lwn.net/Articles/588300/ for the LWN writeups) brought out some ways of misusing the return value from rcu_dereference() that are not necessarily completely intuitive. This commit therefore documents what can and cannot safely be done with these values. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX index fa57139f50bf..f773a264ae02 100644 --- a/Documentation/RCU/00-INDEX +++ b/Documentation/RCU/00-INDEX @@ -12,6 +12,8 @@ lockdep-splat.txt - RCU Lockdep splats explained. NMI-RCU.txt - Using RCU to Protect Dynamic NMI Handlers +rcu_dereference.txt + - Proper care and feeding of return values from rcu_dereference() rcubarrier.txt - RCU and Unloadable Modules rculist_nulls.txt diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt index 9d10d1db16a5..877947130ebe 100644 --- a/Documentation/RCU/checklist.txt +++ b/Documentation/RCU/checklist.txt @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! http://www.openvms.compaq.com/wizard/wiz_2637.html The rcu_dereference() primitive is also an excellent - documentation aid, letting the person reading the code - know exactly which pointers are protected by RCU. + documentation aid, letting the person reading the + code know exactly which pointers are protected by RCU. Please note that compilers can also reorder code, and they are becoming increasingly aggressive about doing - just that. The rcu_dereference() primitive therefore - also prevents destructive compiler optimizations. + just that. The rcu_dereference() primitive therefore also + prevents destructive compiler optimizations. However, + with a bit of devious creativity, it is possible to + mishandle the return value from rcu_dereference(). + Please see rcu_dereference.txt in this directory for + more information. The rcu_dereference() primitive is used by the various "_rcu()" list-traversal primitives, such diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt new file mode 100644 index 000000000000..6e72cd8622df --- /dev/null +++ b/Documentation/RCU/rcu_dereference.txt @@ -0,0 +1,365 @@ +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() + +Most of the time, you can use values from rcu_dereference() or one of +the similar primitives without worries. Dereferencing (prefix "*"), +field selection ("->"), assignment ("="), address-of ("&"), addition and +subtraction of constants, and casts all work quite naturally and safely. + +It is nevertheless possible to get into trouble with other operations. +Follow these rules to keep your RCU code working properly: + +o You must use one of the rcu_dereference() family of primitives + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU + will complain. Worse yet, your code can see random memory-corruption + bugs due to games that compilers and DEC Alpha can play. + Without one of the rcu_dereference() primitives, compilers + can reload the value, and won't your code have fun with two + different values for a single pointer! Without rcu_dereference(), + DEC Alpha can load a pointer, dereference that pointer, and + return data preceding initialization that preceded the store of + the pointer. + + In addition, the volatile cast in rcu_dereference() prevents the + compiler from deducing the resulting pointer value. Please see + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" + for an example where the compiler can in fact deduce the exact + value of the pointer, and thus cause misordering. + +o Do not use single-element RCU-protected arrays. The compiler + is within its right to assume that the value of an index into + such an array must necessarily evaluate to zero. The compiler + could then substitute the constant zero for the computation, so + that the array index no longer depended on the value returned + by rcu_dereference(). If the array index no longer depends + on rcu_dereference(), then both the compiler and the CPU + are within their rights to order the array access before the + rcu_dereference(), which can cause the array access to return + garbage. + +o Avoid cancellation when using the "+" and "-" infix arithmetic + operators. For example, for a given variable "x", avoid + "(x-x)". There are similar arithmetic pitfalls from other + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". + The compiler is within its rights to substitute zero for all of + these expressions, so that subsequent accesses no longer depend + on the rcu_dereference(), again possibly resulting in bugs due + to misordering. + + Of course, if "p" is a pointer from rcu_dereference(), and "a" + and "b" are integers that happen to be equal, the expression + "p+a-b" is safe because its value still necessarily depends on + the rcu_dereference(), thus maintaining proper ordering. + +o Avoid all-zero operands to the bitwise "&" operator, and + similarly avoid all-ones operands to the bitwise "|" operator. + If the compiler is able to deduce the value of such operands, + it is within its rights to substitute the corresponding constant + for the bitwise operation. Once again, this causes subsequent + accesses to no longer depend on the rcu_dereference(), causing + bugs due to misordering. + + Please note that single-bit operands to bitwise "&" can also + be dangerous. At this point, the compiler knows that the + resulting value can only take on one of two possible values. + Therefore, a very small amount of additional information will + allow the compiler to deduce the exact value, which again can + result in misordering. + +o If you are using RCU to protect JITed functions, so that the + "()" function-invocation operator is applied to a value obtained + (directly or indirectly) from rcu_dereference(), you may need to + interact directly with the hardware to flush instruction caches. + This issue arises on some systems when a newly JITed function is + using the same memory that was used by an earlier JITed function. + +o Do not use the results from the boolean "&&" and "||" when + dereferencing. For example, the following (rather improbable) + code is buggy: + + int a[2]; + int index; + int force_zero_index = 1; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ + + The reason this is buggy is that "&&" and "||" are often compiled + using branches. While weak-memory machines such as ARM or PowerPC + do order stores after such branches, they can speculate loads, + which can result in misordering bugs. + +o Do not use the results from relational operators ("==", "!=", + ">", ">=", "<", or "<=") when dereferencing. For example, + the following (quite strange) code is buggy: + + int a[2]; + int index; + int flip_index = 0; + + ... + + r1 = rcu_dereference(i1) + r2 = a[r1 != flip_index]; /* BUGGY!!! */ + + As before, the reason this is buggy is that relational operators + are often compiled using branches. And as before, although + weak-memory machines such as ARM or PowerPC do order stores + after such branches, but can speculate loads, which can again + result in misordering bugs. + +o Be very careful about comparing pointers obtained from + rcu_dereference() against non-NULL values. As Linus Torvalds + explained, if the two pointers are equal, the compiler could + substitute the pointer you are comparing against for the pointer + obtained from rcu_dereference(). For example: + + p = rcu_dereference(gp); + if (p == &default_struct) + do_default(p->a); + + Because the compiler now knows that the value of "p" is exactly + the address of the variable "default_struct", it is free to + transform this code into the following: + + p = rcu_dereference(gp); + if (p == &default_struct) + do_default(default_struct.a); + + On ARM and Power hardware, the load from "default_struct.a" + can now be speculated, such that it might happen before the + rcu_dereference(). This could result in bugs due to misordering. + + However, comparisons are OK in the following cases: + + o The comparison was against the NULL pointer. If the + compiler knows that the pointer is NULL, you had better + not be dereferencing it anyway. If the comparison is + non-equal, the compiler is none the wiser. Therefore, + it is safe to compare pointers from rcu_dereference() + against NULL pointers. + + o The pointer is never dereferenced after being compared. + Since there are no subsequent dereferences, the compiler + cannot use anything it learned from the comparison + to reorder the non-existent subsequent dereferences. + This sort of comparison occurs frequently when scanning + RCU-protected circular linked lists. + + o The comparison is against a pointer pointer that + references memory that was initialized "a long time ago." + The reason this is safe is that even if misordering + occurs, the misordering will not affect the accesses + that follow the comparison. So exactly how long ago is + "a long time ago"? Here are some possibilities: + + o Compile time. + + o Boot time. + + o Module-init time for module code. + + o Prior to kthread creation for kthread code. + + o During some prior acquisition of the lock that + we now hold. + + o Before mod_timer() time for a timer handler. + + There are many other possibilities involving the Linux + kernel's wide array of primitives that cause code to + be invoked at a later time. + + o The pointer being compared against also came from + rcu_dereference(). In this case, both pointers depend + on one rcu_dereference() or another, so you get proper + ordering either way. + + That said, this situation can make certain RCU usage + bugs more likely to happen. Which can be a good thing, + at least if they happen during testing. An example + of such an RCU usage bug is shown in the section titled + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". + + o All of the accesses following the comparison are stores, + so that a control dependency preserves the needed ordering. + That said, it is easy to get control dependencies wrong. + Please see the "CONTROL DEPENDENCIES" section of + Documentation/memory-barriers.txt for more details. + + o The pointers compared not-equal -and- the compiler does + not have enough information to deduce the value of the + pointer. Note that the volatile cast in rcu_dereference() + will normally prevent the compiler from knowing too much. + +o Disable any value-speculation optimizations that your compiler + might provide, especially if you are making use of feedback-based + optimizations that take data collected from prior runs. Such + value-speculation optimizations reorder operations by design. + + There is one exception to this rule: Value-speculation + optimizations that leverage the branch-prediction hardware are + safe on strongly ordered systems (such as x86), but not on weakly + ordered systems (such as ARM or Power). Choose your compiler + command-line options wisely! + + +EXAMPLE OF AMPLIFIED RCU-USAGE BUG + +Because updaters can run concurrently with RCU readers, RCU readers can +see stale and/or inconsistent values. If RCU readers need fresh or +consistent values, which they sometimes do, they need to take proper +precautions. To see this, consider the following code fragment: + + struct foo { + int a; + int b; + int c; + }; + struct foo *gp1; + struct foo *gp2; + + void updater(void) + { + struct foo *p; + + p = kmalloc(...); + if (p == NULL) + deal_with_it(); + p->a = 42; /* Each field in its own cache line. */ + p->b = 43; + p->c = 44; + rcu_assign_pointer(gp1, p); + p->b = 143; + p->c = 144; + rcu_assign_pointer(gp2, p); + } + + void reader(void) + { + struct foo *p; + struct foo *q; + int r1, r2; + + p = rcu_dereference(gp2); + r1 = p->b; /* Guaranteed to get 143. */ + q = rcu_dereference(gp1); + if (p == q) { + /* The compiler decides that q->c is same as p->c. */ + r2 = p->c; /* Could get 44 on weakly order system. */ + } + } + +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, +but you should not be. After all, the updater might have been invoked +a second time between the time reader() loaded into "r1" and the time +that it loaded into "r2". The fact that this same result can occur due +to some reordering from the compiler and CPUs is beside the point. + +But suppose that the reader needs a consistent view? + +Then one approach is to use locking, for example, as follows: + + struct foo { + int a; + int b; + int c; + spinlock_t lock; + }; + struct foo *gp1; + struct foo *gp2; + + void updater(void) + { + struct foo *p; + + p = kmalloc(...); + if (p == NULL) + deal_with_it(); + spin_lock(&p->lock); + p->a = 42; /* Each field in its own cache line. */ + p->b = 43; + p->c = 44; + spin_unlock(&p->lock); + rcu_assign_pointer(gp1, p); + spin_lock(&p->lock); + p->b = 143; + p->c = 144; + spin_unlock(&p->lock); + rcu_assign_pointer(gp2, p); + } + + void reader(void) + { + struct foo *p; + struct foo *q; + int r1, r2; + + p = rcu_dereference(gp2); + spin_lock(&p->lock); + r1 = p->b; /* Guaranteed to get 143. */ + q = rcu_dereference(gp1); + if (p == q) { + /* The compiler decides that q->c is same as p->c. */ + r2 = p->c; /* Could get 44 on weakly order system. */ + } + spin_unlock(&p->lock); + } + +As always, use the right tool for the job! + + +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH + +If a pointer obtained from rcu_dereference() compares not-equal to some +other pointer, the compiler normally has no clue what the value of the +first pointer might be. This lack of knowledge prevents the compiler +from carrying out optimizations that otherwise might destroy the ordering +guarantees that RCU depends on. And the volatile cast in rcu_dereference() +should prevent the compiler from guessing the value. + +But without rcu_dereference(), the compiler knows more than you might +expect. Consider the following code fragment: + + struct foo { + int a; + int b; + }; + static struct foo variable1; + static struct foo variable2; + static struct foo *gp = &variable1; + + void updater(void) + { + initialize_foo(&variable2); + rcu_assign_pointer(gp, &variable2); + /* + * The above is the only store to gp in this translation unit, + * and the address of gp is not exported in any way. + */ + } + + int reader(void) + { + struct foo *p; + + p = gp; + barrier(); + if (p == &variable1) + return p->a; /* Must be variable1.a. */ + else + return p->b; /* Must be variable2.b. */ + } + +Because the compiler can see all stores to "gp", it knows that the only +possible values of "gp" are "variable1" on the one hand and "variable2" +on the other. The comparison in reader() therefore tells the compiler +the exact value of "p" even in the not-equals case. This allows the +compiler to make the return values independent of the load from "gp", +in turn destroying the ordering between this load and the loads of the +return values. This can result in "p->b" returning pre-initialization +garbage values. + +In short, rcu_dereference() is -not- optional when you are going to +dereference the resulting pointer. ^ permalink raw reply related [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-01 0:50 ` Paul E. McKenney @ 2014-03-01 10:06 ` Peter Sewell 2014-03-01 14:03 ` Paul E. McKenney 2014-03-03 18:55 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Peter Sewell @ 2014-03-01 10:06 UTC (permalink / raw) To: Paul McKenney Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc Hi Paul, On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> > <paulmck@linux.vnet.ibm.com> wrote: >> > > >> > > 3. The comparison was against another RCU-protected pointer, >> > > where that other pointer was properly fetched using one >> > > of the RCU primitives. Here it doesn't matter which pointer >> > > you use. At least as long as the rcu_assign_pointer() for >> > > that other pointer happened after the last update to the >> > > pointed-to structure. >> > > >> > > I am a bit nervous about #3. Any thoughts on it? >> > >> > I think that it might be worth pointing out as an example, and saying >> > that code like >> > >> > p = atomic_read(consume); >> > X; >> > q = atomic_read(consume); >> > Y; >> > if (p == q) >> > data = p->val; >> > >> > then the access of "p->val" is constrained to be data-dependent on >> > *either* p or q, but you can't really tell which, since the compiler >> > can decide that the values are interchangeable. >> > >> > I cannot for the life of me come up with a situation where this would >> > matter, though. If "X" contains a fence, then that fence will be a >> > stronger ordering than anything the consume through "p" would >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> > atomic reads of p and q are unordered *anyway*, so then whether the >> > ordering to the access through "p" is through p or q is kind of >> > irrelevant. No? >> >> I can make a contrived litmus test for it, but you are right, the only >> time you can see it happen is when X has no barriers, in which case >> you don't have any ordering anyway -- both the compiler and the CPU can >> reorder the loads into p and q, and the read from p->val can, as you say, >> come from either pointer. >> >> For whatever it is worth, hear is the litmus test: >> >> T1: p = kmalloc(...); >> if (p == NULL) >> deal_with_it(); >> p->a = 42; /* Each field in its own cache line. */ >> p->b = 43; >> p->c = 44; >> atomic_store_explicit(&gp1, p, memory_order_release); >> p->b = 143; >> p->c = 144; >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> r1 = p->b; /* Guaranteed to get 143. */ >> q = atomic_load_explicit(&gp1, memory_order_consume); >> if (p == q) { >> /* The compiler decides that q->c is same as p->c. */ >> r2 = p->c; /* Could get 44 on weakly order system. */ >> } >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> you get. >> >> And publishing a structure via one RCU-protected pointer, updating it, >> then publishing it via another pointer seems to me to be asking for >> trouble anyway. If you really want to do something like that and still >> see consistency across all the fields in the structure, please put a lock >> in the structure and use it to guard updates and accesses to those fields. > > And here is a patch documenting the restrictions for the current Linux > kernel. The rules change a bit due to rcu_dereference() acting a bit > differently than atomic_load_explicit(&p, memory_order_consume). > > Thoughts? That might serve as informal documentation for linux kernel programmers about the bounds on the optimisations that you expect compilers to do for common-case RCU code - and I guess that's what you intend it to be for. But I don't see how one can make it precise enough to serve as a language definition, so that compiler people could confidently say "yes, we respect that", which I guess is what you really need. As a useful criterion, we should aim for something precise enough that in a verified-compiler context you can mathematically prove that the compiler will satisfy it (even though that won't happen anytime soon for GCC), and that analysis tool authors can actually know what they're working with. All this stuff about "you should avoid cancellation", and "avoid masking with just a small number of bits" is just too vague. The basic problem is that the compiler may be doing sophisticated reasoning with a bunch of non-local knowledge that it's deduced from the code, neither of which are well-understood, and here we have to identify some envelope, expressive enough for RCU idioms, in which that reasoning doesn't allow data/address dependencies to be removed (and hence the hardware guarantee about them will be maintained at the source level). The C11 syntactic notion of dependency, whatever its faults, was at least precise, could be reasoned about locally (just looking at the syntactic code in question), and did do that. The fact that current compilers do optimisations that remove dependencies and will likely have many bugs at present is besides the point - this was surely intended as a *new* constraint on what they are allowed to do. The interesting question is really whether the compiler writers think that they *could* implement it in a reasonable way - I'd like to hear Torvald and his colleagues' opinion on that. What you're doing above seems to be basically a very cut-down version of that, but with a fuzzy boundary. If you want it to be precise, maybe it needs to be much simpler (which might force you into ruling out some current code idioms). best, Peter > Thanx, Paul > > ------------------------------------------------------------------------ > > documentation: Record rcu_dereference() value mishandling > > Recent LKML discussings (see http://lwn.net/Articles/586838/ and > http://lwn.net/Articles/588300/ for the LWN writeups) brought out > some ways of misusing the return value from rcu_dereference() that > are not necessarily completely intuitive. This commit therefore > documents what can and cannot safely be done with these values. > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX > index fa57139f50bf..f773a264ae02 100644 > --- a/Documentation/RCU/00-INDEX > +++ b/Documentation/RCU/00-INDEX > @@ -12,6 +12,8 @@ lockdep-splat.txt > - RCU Lockdep splats explained. > NMI-RCU.txt > - Using RCU to Protect Dynamic NMI Handlers > +rcu_dereference.txt > + - Proper care and feeding of return values from rcu_dereference() > rcubarrier.txt > - RCU and Unloadable Modules > rculist_nulls.txt > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt > index 9d10d1db16a5..877947130ebe 100644 > --- a/Documentation/RCU/checklist.txt > +++ b/Documentation/RCU/checklist.txt > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! > http://www.openvms.compaq.com/wizard/wiz_2637.html > > The rcu_dereference() primitive is also an excellent > - documentation aid, letting the person reading the code > - know exactly which pointers are protected by RCU. > + documentation aid, letting the person reading the > + code know exactly which pointers are protected by RCU. > Please note that compilers can also reorder code, and > they are becoming increasingly aggressive about doing > - just that. The rcu_dereference() primitive therefore > - also prevents destructive compiler optimizations. > + just that. The rcu_dereference() primitive therefore also > + prevents destructive compiler optimizations. However, > + with a bit of devious creativity, it is possible to > + mishandle the return value from rcu_dereference(). > + Please see rcu_dereference.txt in this directory for > + more information. > > The rcu_dereference() primitive is used by the > various "_rcu()" list-traversal primitives, such > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt > new file mode 100644 > index 000000000000..6e72cd8622df > --- /dev/null > +++ b/Documentation/RCU/rcu_dereference.txt > @@ -0,0 +1,365 @@ > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() > + > +Most of the time, you can use values from rcu_dereference() or one of > +the similar primitives without worries. Dereferencing (prefix "*"), > +field selection ("->"), assignment ("="), address-of ("&"), addition and > +subtraction of constants, and casts all work quite naturally and safely. > + > +It is nevertheless possible to get into trouble with other operations. > +Follow these rules to keep your RCU code working properly: > + > +o You must use one of the rcu_dereference() family of primitives > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU > + will complain. Worse yet, your code can see random memory-corruption > + bugs due to games that compilers and DEC Alpha can play. > + Without one of the rcu_dereference() primitives, compilers > + can reload the value, and won't your code have fun with two > + different values for a single pointer! Without rcu_dereference(), > + DEC Alpha can load a pointer, dereference that pointer, and > + return data preceding initialization that preceded the store of > + the pointer. > + > + In addition, the volatile cast in rcu_dereference() prevents the > + compiler from deducing the resulting pointer value. Please see > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" > + for an example where the compiler can in fact deduce the exact > + value of the pointer, and thus cause misordering. > + > +o Do not use single-element RCU-protected arrays. The compiler > + is within its right to assume that the value of an index into > + such an array must necessarily evaluate to zero. The compiler > + could then substitute the constant zero for the computation, so > + that the array index no longer depended on the value returned > + by rcu_dereference(). If the array index no longer depends > + on rcu_dereference(), then both the compiler and the CPU > + are within their rights to order the array access before the > + rcu_dereference(), which can cause the array access to return > + garbage. > + > +o Avoid cancellation when using the "+" and "-" infix arithmetic > + operators. For example, for a given variable "x", avoid > + "(x-x)". There are similar arithmetic pitfalls from other > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". > + The compiler is within its rights to substitute zero for all of > + these expressions, so that subsequent accesses no longer depend > + on the rcu_dereference(), again possibly resulting in bugs due > + to misordering. > + > + Of course, if "p" is a pointer from rcu_dereference(), and "a" > + and "b" are integers that happen to be equal, the expression > + "p+a-b" is safe because its value still necessarily depends on > + the rcu_dereference(), thus maintaining proper ordering. > + > +o Avoid all-zero operands to the bitwise "&" operator, and > + similarly avoid all-ones operands to the bitwise "|" operator. > + If the compiler is able to deduce the value of such operands, > + it is within its rights to substitute the corresponding constant > + for the bitwise operation. Once again, this causes subsequent > + accesses to no longer depend on the rcu_dereference(), causing > + bugs due to misordering. > + > + Please note that single-bit operands to bitwise "&" can also > + be dangerous. At this point, the compiler knows that the > + resulting value can only take on one of two possible values. > + Therefore, a very small amount of additional information will > + allow the compiler to deduce the exact value, which again can > + result in misordering. > + > +o If you are using RCU to protect JITed functions, so that the > + "()" function-invocation operator is applied to a value obtained > + (directly or indirectly) from rcu_dereference(), you may need to > + interact directly with the hardware to flush instruction caches. > + This issue arises on some systems when a newly JITed function is > + using the same memory that was used by an earlier JITed function. > + > +o Do not use the results from the boolean "&&" and "||" when > + dereferencing. For example, the following (rather improbable) > + code is buggy: > + > + int a[2]; > + int index; > + int force_zero_index = 1; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > + > + The reason this is buggy is that "&&" and "||" are often compiled > + using branches. While weak-memory machines such as ARM or PowerPC > + do order stores after such branches, they can speculate loads, > + which can result in misordering bugs. > + > +o Do not use the results from relational operators ("==", "!=", > + ">", ">=", "<", or "<=") when dereferencing. For example, > + the following (quite strange) code is buggy: > + > + int a[2]; > + int index; > + int flip_index = 0; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > + > + As before, the reason this is buggy is that relational operators > + are often compiled using branches. And as before, although > + weak-memory machines such as ARM or PowerPC do order stores > + after such branches, but can speculate loads, which can again > + result in misordering bugs. > + > +o Be very careful about comparing pointers obtained from > + rcu_dereference() against non-NULL values. As Linus Torvalds > + explained, if the two pointers are equal, the compiler could > + substitute the pointer you are comparing against for the pointer > + obtained from rcu_dereference(). For example: > + > + p = rcu_dereference(gp); > + if (p == &default_struct) > + do_default(p->a); > + > + Because the compiler now knows that the value of "p" is exactly > + the address of the variable "default_struct", it is free to > + transform this code into the following: > + > + p = rcu_dereference(gp); > + if (p == &default_struct) > + do_default(default_struct.a); > + > + On ARM and Power hardware, the load from "default_struct.a" > + can now be speculated, such that it might happen before the > + rcu_dereference(). This could result in bugs due to misordering. > + > + However, comparisons are OK in the following cases: > + > + o The comparison was against the NULL pointer. If the > + compiler knows that the pointer is NULL, you had better > + not be dereferencing it anyway. If the comparison is > + non-equal, the compiler is none the wiser. Therefore, > + it is safe to compare pointers from rcu_dereference() > + against NULL pointers. > + > + o The pointer is never dereferenced after being compared. > + Since there are no subsequent dereferences, the compiler > + cannot use anything it learned from the comparison > + to reorder the non-existent subsequent dereferences. > + This sort of comparison occurs frequently when scanning > + RCU-protected circular linked lists. > + > + o The comparison is against a pointer pointer that > + references memory that was initialized "a long time ago." > + The reason this is safe is that even if misordering > + occurs, the misordering will not affect the accesses > + that follow the comparison. So exactly how long ago is > + "a long time ago"? Here are some possibilities: > + > + o Compile time. > + > + o Boot time. > + > + o Module-init time for module code. > + > + o Prior to kthread creation for kthread code. > + > + o During some prior acquisition of the lock that > + we now hold. > + > + o Before mod_timer() time for a timer handler. > + > + There are many other possibilities involving the Linux > + kernel's wide array of primitives that cause code to > + be invoked at a later time. > + > + o The pointer being compared against also came from > + rcu_dereference(). In this case, both pointers depend > + on one rcu_dereference() or another, so you get proper > + ordering either way. > + > + That said, this situation can make certain RCU usage > + bugs more likely to happen. Which can be a good thing, > + at least if they happen during testing. An example > + of such an RCU usage bug is shown in the section titled > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". > + > + o All of the accesses following the comparison are stores, > + so that a control dependency preserves the needed ordering. > + That said, it is easy to get control dependencies wrong. > + Please see the "CONTROL DEPENDENCIES" section of > + Documentation/memory-barriers.txt for more details. > + > + o The pointers compared not-equal -and- the compiler does > + not have enough information to deduce the value of the > + pointer. Note that the volatile cast in rcu_dereference() > + will normally prevent the compiler from knowing too much. > + > +o Disable any value-speculation optimizations that your compiler > + might provide, especially if you are making use of feedback-based > + optimizations that take data collected from prior runs. Such > + value-speculation optimizations reorder operations by design. > + > + There is one exception to this rule: Value-speculation > + optimizations that leverage the branch-prediction hardware are > + safe on strongly ordered systems (such as x86), but not on weakly > + ordered systems (such as ARM or Power). Choose your compiler > + command-line options wisely! > + > + > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG > + > +Because updaters can run concurrently with RCU readers, RCU readers can > +see stale and/or inconsistent values. If RCU readers need fresh or > +consistent values, which they sometimes do, they need to take proper > +precautions. To see this, consider the following code fragment: > + > + struct foo { > + int a; > + int b; > + int c; > + }; > + struct foo *gp1; > + struct foo *gp2; > + > + void updater(void) > + { > + struct foo *p; > + > + p = kmalloc(...); > + if (p == NULL) > + deal_with_it(); > + p->a = 42; /* Each field in its own cache line. */ > + p->b = 43; > + p->c = 44; > + rcu_assign_pointer(gp1, p); > + p->b = 143; > + p->c = 144; > + rcu_assign_pointer(gp2, p); > + } > + > + void reader(void) > + { > + struct foo *p; > + struct foo *q; > + int r1, r2; > + > + p = rcu_dereference(gp2); > + r1 = p->b; /* Guaranteed to get 143. */ > + q = rcu_dereference(gp1); > + if (p == q) { > + /* The compiler decides that q->c is same as p->c. */ > + r2 = p->c; /* Could get 44 on weakly order system. */ > + } > + } > + > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, > +but you should not be. After all, the updater might have been invoked > +a second time between the time reader() loaded into "r1" and the time > +that it loaded into "r2". The fact that this same result can occur due > +to some reordering from the compiler and CPUs is beside the point. > + > +But suppose that the reader needs a consistent view? > + > +Then one approach is to use locking, for example, as follows: > + > + struct foo { > + int a; > + int b; > + int c; > + spinlock_t lock; > + }; > + struct foo *gp1; > + struct foo *gp2; > + > + void updater(void) > + { > + struct foo *p; > + > + p = kmalloc(...); > + if (p == NULL) > + deal_with_it(); > + spin_lock(&p->lock); > + p->a = 42; /* Each field in its own cache line. */ > + p->b = 43; > + p->c = 44; > + spin_unlock(&p->lock); > + rcu_assign_pointer(gp1, p); > + spin_lock(&p->lock); > + p->b = 143; > + p->c = 144; > + spin_unlock(&p->lock); > + rcu_assign_pointer(gp2, p); > + } > + > + void reader(void) > + { > + struct foo *p; > + struct foo *q; > + int r1, r2; > + > + p = rcu_dereference(gp2); > + spin_lock(&p->lock); > + r1 = p->b; /* Guaranteed to get 143. */ > + q = rcu_dereference(gp1); > + if (p == q) { > + /* The compiler decides that q->c is same as p->c. */ > + r2 = p->c; /* Could get 44 on weakly order system. */ > + } > + spin_unlock(&p->lock); > + } > + > +As always, use the right tool for the job! > + > + > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH > + > +If a pointer obtained from rcu_dereference() compares not-equal to some > +other pointer, the compiler normally has no clue what the value of the > +first pointer might be. This lack of knowledge prevents the compiler > +from carrying out optimizations that otherwise might destroy the ordering > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() > +should prevent the compiler from guessing the value. > + > +But without rcu_dereference(), the compiler knows more than you might > +expect. Consider the following code fragment: > + > + struct foo { > + int a; > + int b; > + }; > + static struct foo variable1; > + static struct foo variable2; > + static struct foo *gp = &variable1; > + > + void updater(void) > + { > + initialize_foo(&variable2); > + rcu_assign_pointer(gp, &variable2); > + /* > + * The above is the only store to gp in this translation unit, > + * and the address of gp is not exported in any way. > + */ > + } > + > + int reader(void) > + { > + struct foo *p; > + > + p = gp; > + barrier(); > + if (p == &variable1) > + return p->a; /* Must be variable1.a. */ > + else > + return p->b; /* Must be variable2.b. */ > + } > + > +Because the compiler can see all stores to "gp", it knows that the only > +possible values of "gp" are "variable1" on the one hand and "variable2" > +on the other. The comparison in reader() therefore tells the compiler > +the exact value of "p" even in the not-equals case. This allows the > +compiler to make the return values independent of the load from "gp", > +in turn destroying the ordering between this load and the loads of the > +return values. This can result in "p->b" returning pre-initialization > +garbage values. > + > +In short, rcu_dereference() is -not- optional when you are going to > +dereference the resulting pointer. > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-01 10:06 ` Peter Sewell @ 2014-03-01 14:03 ` Paul E. McKenney 2014-03-02 10:05 ` Peter Sewell 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-01 14:03 UTC (permalink / raw) To: Peter Sewell Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > Hi Paul, > > On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> > <paulmck@linux.vnet.ibm.com> wrote: > >> > > > >> > > 3. The comparison was against another RCU-protected pointer, > >> > > where that other pointer was properly fetched using one > >> > > of the RCU primitives. Here it doesn't matter which pointer > >> > > you use. At least as long as the rcu_assign_pointer() for > >> > > that other pointer happened after the last update to the > >> > > pointed-to structure. > >> > > > >> > > I am a bit nervous about #3. Any thoughts on it? > >> > > >> > I think that it might be worth pointing out as an example, and saying > >> > that code like > >> > > >> > p = atomic_read(consume); > >> > X; > >> > q = atomic_read(consume); > >> > Y; > >> > if (p == q) > >> > data = p->val; > >> > > >> > then the access of "p->val" is constrained to be data-dependent on > >> > *either* p or q, but you can't really tell which, since the compiler > >> > can decide that the values are interchangeable. > >> > > >> > I cannot for the life of me come up with a situation where this would > >> > matter, though. If "X" contains a fence, then that fence will be a > >> > stronger ordering than anything the consume through "p" would > >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> > ordering to the access through "p" is through p or q is kind of > >> > irrelevant. No? > >> > >> I can make a contrived litmus test for it, but you are right, the only > >> time you can see it happen is when X has no barriers, in which case > >> you don't have any ordering anyway -- both the compiler and the CPU can > >> reorder the loads into p and q, and the read from p->val can, as you say, > >> come from either pointer. > >> > >> For whatever it is worth, hear is the litmus test: > >> > >> T1: p = kmalloc(...); > >> if (p == NULL) > >> deal_with_it(); > >> p->a = 42; /* Each field in its own cache line. */ > >> p->b = 43; > >> p->c = 44; > >> atomic_store_explicit(&gp1, p, memory_order_release); > >> p->b = 143; > >> p->c = 144; > >> atomic_store_explicit(&gp2, p, memory_order_release); > >> > >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> r1 = p->b; /* Guaranteed to get 143. */ > >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> if (p == q) { > >> /* The compiler decides that q->c is same as p->c. */ > >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> } > >> > >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> you get. > >> > >> And publishing a structure via one RCU-protected pointer, updating it, > >> then publishing it via another pointer seems to me to be asking for > >> trouble anyway. If you really want to do something like that and still > >> see consistency across all the fields in the structure, please put a lock > >> in the structure and use it to guard updates and accesses to those fields. > > > > And here is a patch documenting the restrictions for the current Linux > > kernel. The rules change a bit due to rcu_dereference() acting a bit > > differently than atomic_load_explicit(&p, memory_order_consume). > > > > Thoughts? > > That might serve as informal documentation for linux kernel > programmers about the bounds on the optimisations that you expect > compilers to do for common-case RCU code - and I guess that's what you > intend it to be for. But I don't see how one can make it precise > enough to serve as a language definition, so that compiler people > could confidently say "yes, we respect that", which I guess is what > you really need. As a useful criterion, we should aim for something > precise enough that in a verified-compiler context you can > mathematically prove that the compiler will satisfy it (even though > that won't happen anytime soon for GCC), and that analysis tool > authors can actually know what they're working with. All this stuff > about "you should avoid cancellation", and "avoid masking with just a > small number of bits" is just too vague. Understood, and yes, this is intended to document current compiler behavior for the Linux kernel community. It would not make sense to show it to the C11 or C++11 communities, except perhaps as an informational piece on current practice. > The basic problem is that the compiler may be doing sophisticated > reasoning with a bunch of non-local knowledge that it's deduced from > the code, neither of which are well-understood, and here we have to > identify some envelope, expressive enough for RCU idioms, in which > that reasoning doesn't allow data/address dependencies to be removed > (and hence the hardware guarantee about them will be maintained at the > source level). > > The C11 syntactic notion of dependency, whatever its faults, was at > least precise, could be reasoned about locally (just looking at the > syntactic code in question), and did do that. The fact that current > compilers do optimisations that remove dependencies and will likely > have many bugs at present is besides the point - this was surely > intended as a *new* constraint on what they are allowed to do. The > interesting question is really whether the compiler writers think that > they *could* implement it in a reasonable way - I'd like to hear > Torvald and his colleagues' opinion on that. > > What you're doing above seems to be basically a very cut-down version > of that, but with a fuzzy boundary. If you want it to be precise, > maybe it needs to be much simpler (which might force you into ruling > out some current code idioms). I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) can be developed to serve this purpose. Thanx, Paul > best, > Peter > > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > documentation: Record rcu_dereference() value mishandling > > > > Recent LKML discussings (see http://lwn.net/Articles/586838/ and > > http://lwn.net/Articles/588300/ for the LWN writeups) brought out > > some ways of misusing the return value from rcu_dereference() that > > are not necessarily completely intuitive. This commit therefore > > documents what can and cannot safely be done with these values. > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > > > > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX > > index fa57139f50bf..f773a264ae02 100644 > > --- a/Documentation/RCU/00-INDEX > > +++ b/Documentation/RCU/00-INDEX > > @@ -12,6 +12,8 @@ lockdep-splat.txt > > - RCU Lockdep splats explained. > > NMI-RCU.txt > > - Using RCU to Protect Dynamic NMI Handlers > > +rcu_dereference.txt > > + - Proper care and feeding of return values from rcu_dereference() > > rcubarrier.txt > > - RCU and Unloadable Modules > > rculist_nulls.txt > > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt > > index 9d10d1db16a5..877947130ebe 100644 > > --- a/Documentation/RCU/checklist.txt > > +++ b/Documentation/RCU/checklist.txt > > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! > > http://www.openvms.compaq.com/wizard/wiz_2637.html > > > > The rcu_dereference() primitive is also an excellent > > - documentation aid, letting the person reading the code > > - know exactly which pointers are protected by RCU. > > + documentation aid, letting the person reading the > > + code know exactly which pointers are protected by RCU. > > Please note that compilers can also reorder code, and > > they are becoming increasingly aggressive about doing > > - just that. The rcu_dereference() primitive therefore > > - also prevents destructive compiler optimizations. > > + just that. The rcu_dereference() primitive therefore also > > + prevents destructive compiler optimizations. However, > > + with a bit of devious creativity, it is possible to > > + mishandle the return value from rcu_dereference(). > > + Please see rcu_dereference.txt in this directory for > > + more information. > > > > The rcu_dereference() primitive is used by the > > various "_rcu()" list-traversal primitives, such > > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt > > new file mode 100644 > > index 000000000000..6e72cd8622df > > --- /dev/null > > +++ b/Documentation/RCU/rcu_dereference.txt > > @@ -0,0 +1,365 @@ > > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() > > + > > +Most of the time, you can use values from rcu_dereference() or one of > > +the similar primitives without worries. Dereferencing (prefix "*"), > > +field selection ("->"), assignment ("="), address-of ("&"), addition and > > +subtraction of constants, and casts all work quite naturally and safely. > > + > > +It is nevertheless possible to get into trouble with other operations. > > +Follow these rules to keep your RCU code working properly: > > + > > +o You must use one of the rcu_dereference() family of primitives > > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU > > + will complain. Worse yet, your code can see random memory-corruption > > + bugs due to games that compilers and DEC Alpha can play. > > + Without one of the rcu_dereference() primitives, compilers > > + can reload the value, and won't your code have fun with two > > + different values for a single pointer! Without rcu_dereference(), > > + DEC Alpha can load a pointer, dereference that pointer, and > > + return data preceding initialization that preceded the store of > > + the pointer. > > + > > + In addition, the volatile cast in rcu_dereference() prevents the > > + compiler from deducing the resulting pointer value. Please see > > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" > > + for an example where the compiler can in fact deduce the exact > > + value of the pointer, and thus cause misordering. > > + > > +o Do not use single-element RCU-protected arrays. The compiler > > + is within its right to assume that the value of an index into > > + such an array must necessarily evaluate to zero. The compiler > > + could then substitute the constant zero for the computation, so > > + that the array index no longer depended on the value returned > > + by rcu_dereference(). If the array index no longer depends > > + on rcu_dereference(), then both the compiler and the CPU > > + are within their rights to order the array access before the > > + rcu_dereference(), which can cause the array access to return > > + garbage. > > + > > +o Avoid cancellation when using the "+" and "-" infix arithmetic > > + operators. For example, for a given variable "x", avoid > > + "(x-x)". There are similar arithmetic pitfalls from other > > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". > > + The compiler is within its rights to substitute zero for all of > > + these expressions, so that subsequent accesses no longer depend > > + on the rcu_dereference(), again possibly resulting in bugs due > > + to misordering. > > + > > + Of course, if "p" is a pointer from rcu_dereference(), and "a" > > + and "b" are integers that happen to be equal, the expression > > + "p+a-b" is safe because its value still necessarily depends on > > + the rcu_dereference(), thus maintaining proper ordering. > > + > > +o Avoid all-zero operands to the bitwise "&" operator, and > > + similarly avoid all-ones operands to the bitwise "|" operator. > > + If the compiler is able to deduce the value of such operands, > > + it is within its rights to substitute the corresponding constant > > + for the bitwise operation. Once again, this causes subsequent > > + accesses to no longer depend on the rcu_dereference(), causing > > + bugs due to misordering. > > + > > + Please note that single-bit operands to bitwise "&" can also > > + be dangerous. At this point, the compiler knows that the > > + resulting value can only take on one of two possible values. > > + Therefore, a very small amount of additional information will > > + allow the compiler to deduce the exact value, which again can > > + result in misordering. > > + > > +o If you are using RCU to protect JITed functions, so that the > > + "()" function-invocation operator is applied to a value obtained > > + (directly or indirectly) from rcu_dereference(), you may need to > > + interact directly with the hardware to flush instruction caches. > > + This issue arises on some systems when a newly JITed function is > > + using the same memory that was used by an earlier JITed function. > > + > > +o Do not use the results from the boolean "&&" and "||" when > > + dereferencing. For example, the following (rather improbable) > > + code is buggy: > > + > > + int a[2]; > > + int index; > > + int force_zero_index = 1; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > + > > + The reason this is buggy is that "&&" and "||" are often compiled > > + using branches. While weak-memory machines such as ARM or PowerPC > > + do order stores after such branches, they can speculate loads, > > + which can result in misordering bugs. > > + > > +o Do not use the results from relational operators ("==", "!=", > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > + the following (quite strange) code is buggy: > > + > > + int a[2]; > > + int index; > > + int flip_index = 0; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > + > > + As before, the reason this is buggy is that relational operators > > + are often compiled using branches. And as before, although > > + weak-memory machines such as ARM or PowerPC do order stores > > + after such branches, but can speculate loads, which can again > > + result in misordering bugs. > > + > > +o Be very careful about comparing pointers obtained from > > + rcu_dereference() against non-NULL values. As Linus Torvalds > > + explained, if the two pointers are equal, the compiler could > > + substitute the pointer you are comparing against for the pointer > > + obtained from rcu_dereference(). For example: > > + > > + p = rcu_dereference(gp); > > + if (p == &default_struct) > > + do_default(p->a); > > + > > + Because the compiler now knows that the value of "p" is exactly > > + the address of the variable "default_struct", it is free to > > + transform this code into the following: > > + > > + p = rcu_dereference(gp); > > + if (p == &default_struct) > > + do_default(default_struct.a); > > + > > + On ARM and Power hardware, the load from "default_struct.a" > > + can now be speculated, such that it might happen before the > > + rcu_dereference(). This could result in bugs due to misordering. > > + > > + However, comparisons are OK in the following cases: > > + > > + o The comparison was against the NULL pointer. If the > > + compiler knows that the pointer is NULL, you had better > > + not be dereferencing it anyway. If the comparison is > > + non-equal, the compiler is none the wiser. Therefore, > > + it is safe to compare pointers from rcu_dereference() > > + against NULL pointers. > > + > > + o The pointer is never dereferenced after being compared. > > + Since there are no subsequent dereferences, the compiler > > + cannot use anything it learned from the comparison > > + to reorder the non-existent subsequent dereferences. > > + This sort of comparison occurs frequently when scanning > > + RCU-protected circular linked lists. > > + > > + o The comparison is against a pointer pointer that > > + references memory that was initialized "a long time ago." > > + The reason this is safe is that even if misordering > > + occurs, the misordering will not affect the accesses > > + that follow the comparison. So exactly how long ago is > > + "a long time ago"? Here are some possibilities: > > + > > + o Compile time. > > + > > + o Boot time. > > + > > + o Module-init time for module code. > > + > > + o Prior to kthread creation for kthread code. > > + > > + o During some prior acquisition of the lock that > > + we now hold. > > + > > + o Before mod_timer() time for a timer handler. > > + > > + There are many other possibilities involving the Linux > > + kernel's wide array of primitives that cause code to > > + be invoked at a later time. > > + > > + o The pointer being compared against also came from > > + rcu_dereference(). In this case, both pointers depend > > + on one rcu_dereference() or another, so you get proper > > + ordering either way. > > + > > + That said, this situation can make certain RCU usage > > + bugs more likely to happen. Which can be a good thing, > > + at least if they happen during testing. An example > > + of such an RCU usage bug is shown in the section titled > > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". > > + > > + o All of the accesses following the comparison are stores, > > + so that a control dependency preserves the needed ordering. > > + That said, it is easy to get control dependencies wrong. > > + Please see the "CONTROL DEPENDENCIES" section of > > + Documentation/memory-barriers.txt for more details. > > + > > + o The pointers compared not-equal -and- the compiler does > > + not have enough information to deduce the value of the > > + pointer. Note that the volatile cast in rcu_dereference() > > + will normally prevent the compiler from knowing too much. > > + > > +o Disable any value-speculation optimizations that your compiler > > + might provide, especially if you are making use of feedback-based > > + optimizations that take data collected from prior runs. Such > > + value-speculation optimizations reorder operations by design. > > + > > + There is one exception to this rule: Value-speculation > > + optimizations that leverage the branch-prediction hardware are > > + safe on strongly ordered systems (such as x86), but not on weakly > > + ordered systems (such as ARM or Power). Choose your compiler > > + command-line options wisely! > > + > > + > > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG > > + > > +Because updaters can run concurrently with RCU readers, RCU readers can > > +see stale and/or inconsistent values. If RCU readers need fresh or > > +consistent values, which they sometimes do, they need to take proper > > +precautions. To see this, consider the following code fragment: > > + > > + struct foo { > > + int a; > > + int b; > > + int c; > > + }; > > + struct foo *gp1; > > + struct foo *gp2; > > + > > + void updater(void) > > + { > > + struct foo *p; > > + > > + p = kmalloc(...); > > + if (p == NULL) > > + deal_with_it(); > > + p->a = 42; /* Each field in its own cache line. */ > > + p->b = 43; > > + p->c = 44; > > + rcu_assign_pointer(gp1, p); > > + p->b = 143; > > + p->c = 144; > > + rcu_assign_pointer(gp2, p); > > + } > > + > > + void reader(void) > > + { > > + struct foo *p; > > + struct foo *q; > > + int r1, r2; > > + > > + p = rcu_dereference(gp2); > > + r1 = p->b; /* Guaranteed to get 143. */ > > + q = rcu_dereference(gp1); > > + if (p == q) { > > + /* The compiler decides that q->c is same as p->c. */ > > + r2 = p->c; /* Could get 44 on weakly order system. */ > > + } > > + } > > + > > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, > > +but you should not be. After all, the updater might have been invoked > > +a second time between the time reader() loaded into "r1" and the time > > +that it loaded into "r2". The fact that this same result can occur due > > +to some reordering from the compiler and CPUs is beside the point. > > + > > +But suppose that the reader needs a consistent view? > > + > > +Then one approach is to use locking, for example, as follows: > > + > > + struct foo { > > + int a; > > + int b; > > + int c; > > + spinlock_t lock; > > + }; > > + struct foo *gp1; > > + struct foo *gp2; > > + > > + void updater(void) > > + { > > + struct foo *p; > > + > > + p = kmalloc(...); > > + if (p == NULL) > > + deal_with_it(); > > + spin_lock(&p->lock); > > + p->a = 42; /* Each field in its own cache line. */ > > + p->b = 43; > > + p->c = 44; > > + spin_unlock(&p->lock); > > + rcu_assign_pointer(gp1, p); > > + spin_lock(&p->lock); > > + p->b = 143; > > + p->c = 144; > > + spin_unlock(&p->lock); > > + rcu_assign_pointer(gp2, p); > > + } > > + > > + void reader(void) > > + { > > + struct foo *p; > > + struct foo *q; > > + int r1, r2; > > + > > + p = rcu_dereference(gp2); > > + spin_lock(&p->lock); > > + r1 = p->b; /* Guaranteed to get 143. */ > > + q = rcu_dereference(gp1); > > + if (p == q) { > > + /* The compiler decides that q->c is same as p->c. */ > > + r2 = p->c; /* Could get 44 on weakly order system. */ > > + } > > + spin_unlock(&p->lock); > > + } > > + > > +As always, use the right tool for the job! > > + > > + > > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH > > + > > +If a pointer obtained from rcu_dereference() compares not-equal to some > > +other pointer, the compiler normally has no clue what the value of the > > +first pointer might be. This lack of knowledge prevents the compiler > > +from carrying out optimizations that otherwise might destroy the ordering > > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() > > +should prevent the compiler from guessing the value. > > + > > +But without rcu_dereference(), the compiler knows more than you might > > +expect. Consider the following code fragment: > > + > > + struct foo { > > + int a; > > + int b; > > + }; > > + static struct foo variable1; > > + static struct foo variable2; > > + static struct foo *gp = &variable1; > > + > > + void updater(void) > > + { > > + initialize_foo(&variable2); > > + rcu_assign_pointer(gp, &variable2); > > + /* > > + * The above is the only store to gp in this translation unit, > > + * and the address of gp is not exported in any way. > > + */ > > + } > > + > > + int reader(void) > > + { > > + struct foo *p; > > + > > + p = gp; > > + barrier(); > > + if (p == &variable1) > > + return p->a; /* Must be variable1.a. */ > > + else > > + return p->b; /* Must be variable2.b. */ > > + } > > + > > +Because the compiler can see all stores to "gp", it knows that the only > > +possible values of "gp" are "variable1" on the one hand and "variable2" > > +on the other. The comparison in reader() therefore tells the compiler > > +the exact value of "p" even in the not-equals case. This allows the > > +compiler to make the return values independent of the load from "gp", > > +in turn destroying the ordering between this load and the loads of the > > +return values. This can result in "p->b" returning pre-initialization > > +garbage values. > > + > > +In short, rcu_dereference() is -not- optional when you are going to > > +dereference the resulting pointer. > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > Please read the FAQ at http://www.tux.org/lkml/ > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-01 14:03 ` Paul E. McKenney @ 2014-03-02 10:05 ` Peter Sewell 2014-03-02 23:20 ` Paul E. McKenney 2014-03-03 20:44 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Peter Sewell @ 2014-03-02 10:05 UTC (permalink / raw) To: Paul McKenney Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> Hi Paul, >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> > <paulmck@linux.vnet.ibm.com> wrote: >> >> > > >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> > > where that other pointer was properly fetched using one >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> > > that other pointer happened after the last update to the >> >> > > pointed-to structure. >> >> > > >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> > >> >> > I think that it might be worth pointing out as an example, and saying >> >> > that code like >> >> > >> >> > p = atomic_read(consume); >> >> > X; >> >> > q = atomic_read(consume); >> >> > Y; >> >> > if (p == q) >> >> > data = p->val; >> >> > >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> > can decide that the values are interchangeable. >> >> > >> >> > I cannot for the life of me come up with a situation where this would >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> > stronger ordering than anything the consume through "p" would >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> > ordering to the access through "p" is through p or q is kind of >> >> > irrelevant. No? >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> time you can see it happen is when X has no barriers, in which case >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> come from either pointer. >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> T1: p = kmalloc(...); >> >> if (p == NULL) >> >> deal_with_it(); >> >> p->a = 42; /* Each field in its own cache line. */ >> >> p->b = 43; >> >> p->c = 44; >> >> atomic_store_explicit(&gp1, p, memory_order_release); >> >> p->b = 143; >> >> p->c = 144; >> >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> q = atomic_load_explicit(&gp1, memory_order_consume); >> >> if (p == q) { >> >> /* The compiler decides that q->c is same as p->c. */ >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> } >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> you get. >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> then publishing it via another pointer seems to me to be asking for >> >> trouble anyway. If you really want to do something like that and still >> >> see consistency across all the fields in the structure, please put a lock >> >> in the structure and use it to guard updates and accesses to those fields. >> > >> > And here is a patch documenting the restrictions for the current Linux >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> > differently than atomic_load_explicit(&p, memory_order_consume). >> > >> > Thoughts? >> >> That might serve as informal documentation for linux kernel >> programmers about the bounds on the optimisations that you expect >> compilers to do for common-case RCU code - and I guess that's what you >> intend it to be for. But I don't see how one can make it precise >> enough to serve as a language definition, so that compiler people >> could confidently say "yes, we respect that", which I guess is what >> you really need. As a useful criterion, we should aim for something >> precise enough that in a verified-compiler context you can >> mathematically prove that the compiler will satisfy it (even though >> that won't happen anytime soon for GCC), and that analysis tool >> authors can actually know what they're working with. All this stuff >> about "you should avoid cancellation", and "avoid masking with just a >> small number of bits" is just too vague. > > Understood, and yes, this is intended to document current compiler > behavior for the Linux kernel community. It would not make sense to show > it to the C11 or C++11 communities, except perhaps as an informational > piece on current practice. > >> The basic problem is that the compiler may be doing sophisticated >> reasoning with a bunch of non-local knowledge that it's deduced from >> the code, neither of which are well-understood, and here we have to >> identify some envelope, expressive enough for RCU idioms, in which >> that reasoning doesn't allow data/address dependencies to be removed >> (and hence the hardware guarantee about them will be maintained at the >> source level). >> >> The C11 syntactic notion of dependency, whatever its faults, was at >> least precise, could be reasoned about locally (just looking at the >> syntactic code in question), and did do that. The fact that current >> compilers do optimisations that remove dependencies and will likely >> have many bugs at present is besides the point - this was surely >> intended as a *new* constraint on what they are allowed to do. The >> interesting question is really whether the compiler writers think that >> they *could* implement it in a reasonable way - I'd like to hear >> Torvald and his colleagues' opinion on that. >> >> What you're doing above seems to be basically a very cut-down version >> of that, but with a fuzzy boundary. If you want it to be precise, >> maybe it needs to be much simpler (which might force you into ruling >> out some current code idioms). > > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) > can be developed to serve this purpose. (I missed that mail when it first came past, sorry) That's also going to be tricky, I'm afraid. The key condition there is: "* at the time of execution of E, L [PS: I assume that L is a typo and should be E] can possibly have returned at least two different values under the assumption that L itself could have returned any value allowed by L's type." First, the evaluation of E might be nondeterministic - e.g., for an artificial example, if it's just a nondeterministic value obtained from the result of a race on SC atomics. The above doesn't distinguish between that (which doesn't have a real dependency on L) and that XOR'd with L (which does). And it does so in the wrong direction: it'll say there the former has a dependency on L. Second, it involves reasoning about counterfactual executions. That doesn't necessarily make it wrong, per se, but probably makes it hard to work with. For example, suppose that in all the actual whole-program executions, a runtime occurrence of L only ever returns one particular value (perhaps because of some simple #define'd configuration), and that the code used in the evaluation of E depends on some invariant which is related to that configuration. The hypothetical execution used above in which a different value is used is one in the code is being run in a situation with broken invariants. Then there will be technical difficulties in using the definition: I don't see how one would persuade oneself that a compiler always satisfies it, because these hypothetical executions are far removed from what it's actually working on. (Aside: The notion of a thread "observing" another thread's load, dating back a long time and adopted in the Power and ARM architecture texts, relies on counterfactual executions in a broadly similar way; we're happy to have escaped that now :-) Peter > Thanx, Paul > >> best, >> Peter >> >> >> >> > Thanx, Paul >> > >> > ------------------------------------------------------------------------ >> > >> > documentation: Record rcu_dereference() value mishandling >> > >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out >> > some ways of misusing the return value from rcu_dereference() that >> > are not necessarily completely intuitive. This commit therefore >> > documents what can and cannot safely be done with these values. >> > >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> > >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX >> > index fa57139f50bf..f773a264ae02 100644 >> > --- a/Documentation/RCU/00-INDEX >> > +++ b/Documentation/RCU/00-INDEX >> > @@ -12,6 +12,8 @@ lockdep-splat.txt >> > - RCU Lockdep splats explained. >> > NMI-RCU.txt >> > - Using RCU to Protect Dynamic NMI Handlers >> > +rcu_dereference.txt >> > + - Proper care and feeding of return values from rcu_dereference() >> > rcubarrier.txt >> > - RCU and Unloadable Modules >> > rculist_nulls.txt >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt >> > index 9d10d1db16a5..877947130ebe 100644 >> > --- a/Documentation/RCU/checklist.txt >> > +++ b/Documentation/RCU/checklist.txt >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! >> > http://www.openvms.compaq.com/wizard/wiz_2637.html >> > >> > The rcu_dereference() primitive is also an excellent >> > - documentation aid, letting the person reading the code >> > - know exactly which pointers are protected by RCU. >> > + documentation aid, letting the person reading the >> > + code know exactly which pointers are protected by RCU. >> > Please note that compilers can also reorder code, and >> > they are becoming increasingly aggressive about doing >> > - just that. The rcu_dereference() primitive therefore >> > - also prevents destructive compiler optimizations. >> > + just that. The rcu_dereference() primitive therefore also >> > + prevents destructive compiler optimizations. However, >> > + with a bit of devious creativity, it is possible to >> > + mishandle the return value from rcu_dereference(). >> > + Please see rcu_dereference.txt in this directory for >> > + more information. >> > >> > The rcu_dereference() primitive is used by the >> > various "_rcu()" list-traversal primitives, such >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt >> > new file mode 100644 >> > index 000000000000..6e72cd8622df >> > --- /dev/null >> > +++ b/Documentation/RCU/rcu_dereference.txt >> > @@ -0,0 +1,365 @@ >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() >> > + >> > +Most of the time, you can use values from rcu_dereference() or one of >> > +the similar primitives without worries. Dereferencing (prefix "*"), >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and >> > +subtraction of constants, and casts all work quite naturally and safely. >> > + >> > +It is nevertheless possible to get into trouble with other operations. >> > +Follow these rules to keep your RCU code working properly: >> > + >> > +o You must use one of the rcu_dereference() family of primitives >> > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU >> > + will complain. Worse yet, your code can see random memory-corruption >> > + bugs due to games that compilers and DEC Alpha can play. >> > + Without one of the rcu_dereference() primitives, compilers >> > + can reload the value, and won't your code have fun with two >> > + different values for a single pointer! Without rcu_dereference(), >> > + DEC Alpha can load a pointer, dereference that pointer, and >> > + return data preceding initialization that preceded the store of >> > + the pointer. >> > + >> > + In addition, the volatile cast in rcu_dereference() prevents the >> > + compiler from deducing the resulting pointer value. Please see >> > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" >> > + for an example where the compiler can in fact deduce the exact >> > + value of the pointer, and thus cause misordering. >> > + >> > +o Do not use single-element RCU-protected arrays. The compiler >> > + is within its right to assume that the value of an index into >> > + such an array must necessarily evaluate to zero. The compiler >> > + could then substitute the constant zero for the computation, so >> > + that the array index no longer depended on the value returned >> > + by rcu_dereference(). If the array index no longer depends >> > + on rcu_dereference(), then both the compiler and the CPU >> > + are within their rights to order the array access before the >> > + rcu_dereference(), which can cause the array access to return >> > + garbage. >> > + >> > +o Avoid cancellation when using the "+" and "-" infix arithmetic >> > + operators. For example, for a given variable "x", avoid >> > + "(x-x)". There are similar arithmetic pitfalls from other >> > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". >> > + The compiler is within its rights to substitute zero for all of >> > + these expressions, so that subsequent accesses no longer depend >> > + on the rcu_dereference(), again possibly resulting in bugs due >> > + to misordering. >> > + >> > + Of course, if "p" is a pointer from rcu_dereference(), and "a" >> > + and "b" are integers that happen to be equal, the expression >> > + "p+a-b" is safe because its value still necessarily depends on >> > + the rcu_dereference(), thus maintaining proper ordering. >> > + >> > +o Avoid all-zero operands to the bitwise "&" operator, and >> > + similarly avoid all-ones operands to the bitwise "|" operator. >> > + If the compiler is able to deduce the value of such operands, >> > + it is within its rights to substitute the corresponding constant >> > + for the bitwise operation. Once again, this causes subsequent >> > + accesses to no longer depend on the rcu_dereference(), causing >> > + bugs due to misordering. >> > + >> > + Please note that single-bit operands to bitwise "&" can also >> > + be dangerous. At this point, the compiler knows that the >> > + resulting value can only take on one of two possible values. >> > + Therefore, a very small amount of additional information will >> > + allow the compiler to deduce the exact value, which again can >> > + result in misordering. >> > + >> > +o If you are using RCU to protect JITed functions, so that the >> > + "()" function-invocation operator is applied to a value obtained >> > + (directly or indirectly) from rcu_dereference(), you may need to >> > + interact directly with the hardware to flush instruction caches. >> > + This issue arises on some systems when a newly JITed function is >> > + using the same memory that was used by an earlier JITed function. >> > + >> > +o Do not use the results from the boolean "&&" and "||" when >> > + dereferencing. For example, the following (rather improbable) >> > + code is buggy: >> > + >> > + int a[2]; >> > + int index; >> > + int force_zero_index = 1; >> > + >> > + ... >> > + >> > + r1 = rcu_dereference(i1) >> > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ >> > + >> > + The reason this is buggy is that "&&" and "||" are often compiled >> > + using branches. While weak-memory machines such as ARM or PowerPC >> > + do order stores after such branches, they can speculate loads, >> > + which can result in misordering bugs. >> > + >> > +o Do not use the results from relational operators ("==", "!=", >> > + ">", ">=", "<", or "<=") when dereferencing. For example, >> > + the following (quite strange) code is buggy: >> > + >> > + int a[2]; >> > + int index; >> > + int flip_index = 0; >> > + >> > + ... >> > + >> > + r1 = rcu_dereference(i1) >> > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ >> > + >> > + As before, the reason this is buggy is that relational operators >> > + are often compiled using branches. And as before, although >> > + weak-memory machines such as ARM or PowerPC do order stores >> > + after such branches, but can speculate loads, which can again >> > + result in misordering bugs. >> > + >> > +o Be very careful about comparing pointers obtained from >> > + rcu_dereference() against non-NULL values. As Linus Torvalds >> > + explained, if the two pointers are equal, the compiler could >> > + substitute the pointer you are comparing against for the pointer >> > + obtained from rcu_dereference(). For example: >> > + >> > + p = rcu_dereference(gp); >> > + if (p == &default_struct) >> > + do_default(p->a); >> > + >> > + Because the compiler now knows that the value of "p" is exactly >> > + the address of the variable "default_struct", it is free to >> > + transform this code into the following: >> > + >> > + p = rcu_dereference(gp); >> > + if (p == &default_struct) >> > + do_default(default_struct.a); >> > + >> > + On ARM and Power hardware, the load from "default_struct.a" >> > + can now be speculated, such that it might happen before the >> > + rcu_dereference(). This could result in bugs due to misordering. >> > + >> > + However, comparisons are OK in the following cases: >> > + >> > + o The comparison was against the NULL pointer. If the >> > + compiler knows that the pointer is NULL, you had better >> > + not be dereferencing it anyway. If the comparison is >> > + non-equal, the compiler is none the wiser. Therefore, >> > + it is safe to compare pointers from rcu_dereference() >> > + against NULL pointers. >> > + >> > + o The pointer is never dereferenced after being compared. >> > + Since there are no subsequent dereferences, the compiler >> > + cannot use anything it learned from the comparison >> > + to reorder the non-existent subsequent dereferences. >> > + This sort of comparison occurs frequently when scanning >> > + RCU-protected circular linked lists. >> > + >> > + o The comparison is against a pointer pointer that >> > + references memory that was initialized "a long time ago." >> > + The reason this is safe is that even if misordering >> > + occurs, the misordering will not affect the accesses >> > + that follow the comparison. So exactly how long ago is >> > + "a long time ago"? Here are some possibilities: >> > + >> > + o Compile time. >> > + >> > + o Boot time. >> > + >> > + o Module-init time for module code. >> > + >> > + o Prior to kthread creation for kthread code. >> > + >> > + o During some prior acquisition of the lock that >> > + we now hold. >> > + >> > + o Before mod_timer() time for a timer handler. >> > + >> > + There are many other possibilities involving the Linux >> > + kernel's wide array of primitives that cause code to >> > + be invoked at a later time. >> > + >> > + o The pointer being compared against also came from >> > + rcu_dereference(). In this case, both pointers depend >> > + on one rcu_dereference() or another, so you get proper >> > + ordering either way. >> > + >> > + That said, this situation can make certain RCU usage >> > + bugs more likely to happen. Which can be a good thing, >> > + at least if they happen during testing. An example >> > + of such an RCU usage bug is shown in the section titled >> > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". >> > + >> > + o All of the accesses following the comparison are stores, >> > + so that a control dependency preserves the needed ordering. >> > + That said, it is easy to get control dependencies wrong. >> > + Please see the "CONTROL DEPENDENCIES" section of >> > + Documentation/memory-barriers.txt for more details. >> > + >> > + o The pointers compared not-equal -and- the compiler does >> > + not have enough information to deduce the value of the >> > + pointer. Note that the volatile cast in rcu_dereference() >> > + will normally prevent the compiler from knowing too much. >> > + >> > +o Disable any value-speculation optimizations that your compiler >> > + might provide, especially if you are making use of feedback-based >> > + optimizations that take data collected from prior runs. Such >> > + value-speculation optimizations reorder operations by design. >> > + >> > + There is one exception to this rule: Value-speculation >> > + optimizations that leverage the branch-prediction hardware are >> > + safe on strongly ordered systems (such as x86), but not on weakly >> > + ordered systems (such as ARM or Power). Choose your compiler >> > + command-line options wisely! >> > + >> > + >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG >> > + >> > +Because updaters can run concurrently with RCU readers, RCU readers can >> > +see stale and/or inconsistent values. If RCU readers need fresh or >> > +consistent values, which they sometimes do, they need to take proper >> > +precautions. To see this, consider the following code fragment: >> > + >> > + struct foo { >> > + int a; >> > + int b; >> > + int c; >> > + }; >> > + struct foo *gp1; >> > + struct foo *gp2; >> > + >> > + void updater(void) >> > + { >> > + struct foo *p; >> > + >> > + p = kmalloc(...); >> > + if (p == NULL) >> > + deal_with_it(); >> > + p->a = 42; /* Each field in its own cache line. */ >> > + p->b = 43; >> > + p->c = 44; >> > + rcu_assign_pointer(gp1, p); >> > + p->b = 143; >> > + p->c = 144; >> > + rcu_assign_pointer(gp2, p); >> > + } >> > + >> > + void reader(void) >> > + { >> > + struct foo *p; >> > + struct foo *q; >> > + int r1, r2; >> > + >> > + p = rcu_dereference(gp2); >> > + r1 = p->b; /* Guaranteed to get 143. */ >> > + q = rcu_dereference(gp1); >> > + if (p == q) { >> > + /* The compiler decides that q->c is same as p->c. */ >> > + r2 = p->c; /* Could get 44 on weakly order system. */ >> > + } >> > + } >> > + >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, >> > +but you should not be. After all, the updater might have been invoked >> > +a second time between the time reader() loaded into "r1" and the time >> > +that it loaded into "r2". The fact that this same result can occur due >> > +to some reordering from the compiler and CPUs is beside the point. >> > + >> > +But suppose that the reader needs a consistent view? >> > + >> > +Then one approach is to use locking, for example, as follows: >> > + >> > + struct foo { >> > + int a; >> > + int b; >> > + int c; >> > + spinlock_t lock; >> > + }; >> > + struct foo *gp1; >> > + struct foo *gp2; >> > + >> > + void updater(void) >> > + { >> > + struct foo *p; >> > + >> > + p = kmalloc(...); >> > + if (p == NULL) >> > + deal_with_it(); >> > + spin_lock(&p->lock); >> > + p->a = 42; /* Each field in its own cache line. */ >> > + p->b = 43; >> > + p->c = 44; >> > + spin_unlock(&p->lock); >> > + rcu_assign_pointer(gp1, p); >> > + spin_lock(&p->lock); >> > + p->b = 143; >> > + p->c = 144; >> > + spin_unlock(&p->lock); >> > + rcu_assign_pointer(gp2, p); >> > + } >> > + >> > + void reader(void) >> > + { >> > + struct foo *p; >> > + struct foo *q; >> > + int r1, r2; >> > + >> > + p = rcu_dereference(gp2); >> > + spin_lock(&p->lock); >> > + r1 = p->b; /* Guaranteed to get 143. */ >> > + q = rcu_dereference(gp1); >> > + if (p == q) { >> > + /* The compiler decides that q->c is same as p->c. */ >> > + r2 = p->c; /* Could get 44 on weakly order system. */ >> > + } >> > + spin_unlock(&p->lock); >> > + } >> > + >> > +As always, use the right tool for the job! >> > + >> > + >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH >> > + >> > +If a pointer obtained from rcu_dereference() compares not-equal to some >> > +other pointer, the compiler normally has no clue what the value of the >> > +first pointer might be. This lack of knowledge prevents the compiler >> > +from carrying out optimizations that otherwise might destroy the ordering >> > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() >> > +should prevent the compiler from guessing the value. >> > + >> > +But without rcu_dereference(), the compiler knows more than you might >> > +expect. Consider the following code fragment: >> > + >> > + struct foo { >> > + int a; >> > + int b; >> > + }; >> > + static struct foo variable1; >> > + static struct foo variable2; >> > + static struct foo *gp = &variable1; >> > + >> > + void updater(void) >> > + { >> > + initialize_foo(&variable2); >> > + rcu_assign_pointer(gp, &variable2); >> > + /* >> > + * The above is the only store to gp in this translation unit, >> > + * and the address of gp is not exported in any way. >> > + */ >> > + } >> > + >> > + int reader(void) >> > + { >> > + struct foo *p; >> > + >> > + p = gp; >> > + barrier(); >> > + if (p == &variable1) >> > + return p->a; /* Must be variable1.a. */ >> > + else >> > + return p->b; /* Must be variable2.b. */ >> > + } >> > + >> > +Because the compiler can see all stores to "gp", it knows that the only >> > +possible values of "gp" are "variable1" on the one hand and "variable2" >> > +on the other. The comparison in reader() therefore tells the compiler >> > +the exact value of "p" even in the not-equals case. This allows the >> > +compiler to make the return values independent of the load from "gp", >> > +in turn destroying the ordering between this load and the loads of the >> > +return values. This can result in "p->b" returning pre-initialization >> > +garbage values. >> > + >> > +In short, rcu_dereference() is -not- optional when you are going to >> > +dereference the resulting pointer. >> > >> > -- >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> > the body of a message to majordomo@vger.kernel.org >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> > Please read the FAQ at http://www.tux.org/lkml/ >> > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-02 10:05 ` Peter Sewell @ 2014-03-02 23:20 ` Paul E. McKenney 2014-03-02 23:44 ` Peter Sewell 2014-03-03 20:44 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-02 23:20 UTC (permalink / raw) To: Peter Sewell Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: > On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> Hi Paul, > >> > >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> > <paulmck@linux.vnet.ibm.com> wrote: > >> >> > > > >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> > > where that other pointer was properly fetched using one > >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> > > that other pointer happened after the last update to the > >> >> > > pointed-to structure. > >> >> > > > >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> > > >> >> > I think that it might be worth pointing out as an example, and saying > >> >> > that code like > >> >> > > >> >> > p = atomic_read(consume); > >> >> > X; > >> >> > q = atomic_read(consume); > >> >> > Y; > >> >> > if (p == q) > >> >> > data = p->val; > >> >> > > >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> > can decide that the values are interchangeable. > >> >> > > >> >> > I cannot for the life of me come up with a situation where this would > >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> > stronger ordering than anything the consume through "p" would > >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> > ordering to the access through "p" is through p or q is kind of > >> >> > irrelevant. No? > >> >> > >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> time you can see it happen is when X has no barriers, in which case > >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> reorder the loads into p and q, and the read from p->val can, as you say, > >> >> come from either pointer. > >> >> > >> >> For whatever it is worth, hear is the litmus test: > >> >> > >> >> T1: p = kmalloc(...); > >> >> if (p == NULL) > >> >> deal_with_it(); > >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> p->b = 43; > >> >> p->c = 44; > >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> p->b = 143; > >> >> p->c = 144; > >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> > >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> >> if (p == q) { > >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> } > >> >> > >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> you get. > >> >> > >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> then publishing it via another pointer seems to me to be asking for > >> >> trouble anyway. If you really want to do something like that and still > >> >> see consistency across all the fields in the structure, please put a lock > >> >> in the structure and use it to guard updates and accesses to those fields. > >> > > >> > And here is a patch documenting the restrictions for the current Linux > >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> > differently than atomic_load_explicit(&p, memory_order_consume). > >> > > >> > Thoughts? > >> > >> That might serve as informal documentation for linux kernel > >> programmers about the bounds on the optimisations that you expect > >> compilers to do for common-case RCU code - and I guess that's what you > >> intend it to be for. But I don't see how one can make it precise > >> enough to serve as a language definition, so that compiler people > >> could confidently say "yes, we respect that", which I guess is what > >> you really need. As a useful criterion, we should aim for something > >> precise enough that in a verified-compiler context you can > >> mathematically prove that the compiler will satisfy it (even though > >> that won't happen anytime soon for GCC), and that analysis tool > >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a > >> small number of bits" is just too vague. > > > > Understood, and yes, this is intended to document current compiler > > behavior for the Linux kernel community. It would not make sense to show > > it to the C11 or C++11 communities, except perhaps as an informational > > piece on current practice. > > > >> The basic problem is that the compiler may be doing sophisticated > >> reasoning with a bunch of non-local knowledge that it's deduced from > >> the code, neither of which are well-understood, and here we have to > >> identify some envelope, expressive enough for RCU idioms, in which > >> that reasoning doesn't allow data/address dependencies to be removed > >> (and hence the hardware guarantee about them will be maintained at the > >> source level). > >> > >> The C11 syntactic notion of dependency, whatever its faults, was at > >> least precise, could be reasoned about locally (just looking at the > >> syntactic code in question), and did do that. The fact that current > >> compilers do optimisations that remove dependencies and will likely > >> have many bugs at present is besides the point - this was surely > >> intended as a *new* constraint on what they are allowed to do. The > >> interesting question is really whether the compiler writers think that > >> they *could* implement it in a reasonable way - I'd like to hear > >> Torvald and his colleagues' opinion on that. > >> > >> What you're doing above seems to be basically a very cut-down version > >> of that, but with a fuzzy boundary. If you want it to be precise, > >> maybe it needs to be much simpler (which might force you into ruling > >> out some current code idioms). > > > > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) > > can be developed to serve this purpose. > > (I missed that mail when it first came past, sorry) No worries! > That's also going to be tricky, I'm afraid. The key condition there is: > > "* at the time of execution of E, L [PS: I assume that L is a > typo and should be E] I believe it really is "L". As I understand it (and Torvald will correct me if I am wrong), the idea is that the implementation is prohibited from guessing the value of "L" -- it must assume that any value from L's type might be returned, regardless of what it might otherwise know. However, after L's value is loaded, the implementation -is- permitted to learn constraint's on this value based on "if" statements and the like between the load from "L" and the execution of "E". Does that help? > can possibly have returned at > least two different values under the assumption that L itself > could have returned any value allowed by L's type." > > First, the evaluation of E might be nondeterministic - e.g., for an > artificial example, if it's just a nondeterministic value obtained > from the result of a race on SC atomics. The above doesn't > distinguish between that (which doesn't have a real dependency on L) > and that XOR'd with L (which does). And it does so in the wrong > direction: it'll say there the former has a dependency on L. Right, it is only any dependency that E has on L that would be constrained. If E also depends on other quantities obtained some other way than a memory_order_consume load into a value_dep_preserving, variable, then as I understand it, the compiler is within its rights to optimize these other quantities to within an inch of their lives. It is quite possible that E depends on L only sometimes. For example: p = atomic_load_explicit(&gp, memory_order_consume); p = random() & 0x8 ? p : &default_structure; E(p); My guess is that in this case, the ordering would be guaranteed only for those executions where there is a value dependency. In my naive view, this should be no different than something like this: if (random() & 0x10) p = atomic_load_explicit(&gp, memory_order_acquire); else p = &default_structure; E(p); Or am I missing your point? > Second, it involves reasoning about counterfactual executions. That > doesn't necessarily make it wrong, per se, but probably makes it hard > to work with. For example, suppose that in all the actual > whole-program executions, a runtime occurrence of L only ever returns > one particular value (perhaps because of some simple #define'd > configuration), and that the code used in the evaluation of E depends > on some invariant which is related to that configuration. The > hypothetical execution used above in which a different value is used > is one in the code is being run in a situation with broken invariants. > Then there will be technical difficulties in using the definition: > I don't see how one would persuade oneself that a compiler always > satisfies it, because these hypothetical executions are far removed > from what it's actually working on. The developer answer would be something like "all it really means is that the implementation is required to actually emit the memory_order_consume load and actually use the value," which is probably not much comfort to someone trying to model it. Maybe there is a better way of wording this constraint so as to avoid the counterfactuals? > (Aside: The notion of a thread "observing" another thread's load, > dating back a long time and adopted in the Power and ARM architecture > texts, relies on counterfactual executions in a broadly similar way; > we're happy to have escaped that now :-) Here is hoping that there is a way to escape it in this case as well. ;-) Thanx, Paul > Peter > > > > > > Thanx, Paul > > > >> best, > >> Peter > >> > >> > >> > >> > Thanx, Paul > >> > > >> > ------------------------------------------------------------------------ > >> > > >> > documentation: Record rcu_dereference() value mishandling > >> > > >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and > >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out > >> > some ways of misusing the return value from rcu_dereference() that > >> > are not necessarily completely intuitive. This commit therefore > >> > documents what can and cannot safely be done with these values. > >> > > >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > >> > > >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX > >> > index fa57139f50bf..f773a264ae02 100644 > >> > --- a/Documentation/RCU/00-INDEX > >> > +++ b/Documentation/RCU/00-INDEX > >> > @@ -12,6 +12,8 @@ lockdep-splat.txt > >> > - RCU Lockdep splats explained. > >> > NMI-RCU.txt > >> > - Using RCU to Protect Dynamic NMI Handlers > >> > +rcu_dereference.txt > >> > + - Proper care and feeding of return values from rcu_dereference() > >> > rcubarrier.txt > >> > - RCU and Unloadable Modules > >> > rculist_nulls.txt > >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt > >> > index 9d10d1db16a5..877947130ebe 100644 > >> > --- a/Documentation/RCU/checklist.txt > >> > +++ b/Documentation/RCU/checklist.txt > >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! > >> > http://www.openvms.compaq.com/wizard/wiz_2637.html > >> > > >> > The rcu_dereference() primitive is also an excellent > >> > - documentation aid, letting the person reading the code > >> > - know exactly which pointers are protected by RCU. > >> > + documentation aid, letting the person reading the > >> > + code know exactly which pointers are protected by RCU. > >> > Please note that compilers can also reorder code, and > >> > they are becoming increasingly aggressive about doing > >> > - just that. The rcu_dereference() primitive therefore > >> > - also prevents destructive compiler optimizations. > >> > + just that. The rcu_dereference() primitive therefore also > >> > + prevents destructive compiler optimizations. However, > >> > + with a bit of devious creativity, it is possible to > >> > + mishandle the return value from rcu_dereference(). > >> > + Please see rcu_dereference.txt in this directory for > >> > + more information. > >> > > >> > The rcu_dereference() primitive is used by the > >> > various "_rcu()" list-traversal primitives, such > >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt > >> > new file mode 100644 > >> > index 000000000000..6e72cd8622df > >> > --- /dev/null > >> > +++ b/Documentation/RCU/rcu_dereference.txt > >> > @@ -0,0 +1,365 @@ > >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() > >> > + > >> > +Most of the time, you can use values from rcu_dereference() or one of > >> > +the similar primitives without worries. Dereferencing (prefix "*"), > >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and > >> > +subtraction of constants, and casts all work quite naturally and safely. > >> > + > >> > +It is nevertheless possible to get into trouble with other operations. > >> > +Follow these rules to keep your RCU code working properly: > >> > + > >> > +o You must use one of the rcu_dereference() family of primitives > >> > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU > >> > + will complain. Worse yet, your code can see random memory-corruption > >> > + bugs due to games that compilers and DEC Alpha can play. > >> > + Without one of the rcu_dereference() primitives, compilers > >> > + can reload the value, and won't your code have fun with two > >> > + different values for a single pointer! Without rcu_dereference(), > >> > + DEC Alpha can load a pointer, dereference that pointer, and > >> > + return data preceding initialization that preceded the store of > >> > + the pointer. > >> > + > >> > + In addition, the volatile cast in rcu_dereference() prevents the > >> > + compiler from deducing the resulting pointer value. Please see > >> > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" > >> > + for an example where the compiler can in fact deduce the exact > >> > + value of the pointer, and thus cause misordering. > >> > + > >> > +o Do not use single-element RCU-protected arrays. The compiler > >> > + is within its right to assume that the value of an index into > >> > + such an array must necessarily evaluate to zero. The compiler > >> > + could then substitute the constant zero for the computation, so > >> > + that the array index no longer depended on the value returned > >> > + by rcu_dereference(). If the array index no longer depends > >> > + on rcu_dereference(), then both the compiler and the CPU > >> > + are within their rights to order the array access before the > >> > + rcu_dereference(), which can cause the array access to return > >> > + garbage. > >> > + > >> > +o Avoid cancellation when using the "+" and "-" infix arithmetic > >> > + operators. For example, for a given variable "x", avoid > >> > + "(x-x)". There are similar arithmetic pitfalls from other > >> > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". > >> > + The compiler is within its rights to substitute zero for all of > >> > + these expressions, so that subsequent accesses no longer depend > >> > + on the rcu_dereference(), again possibly resulting in bugs due > >> > + to misordering. > >> > + > >> > + Of course, if "p" is a pointer from rcu_dereference(), and "a" > >> > + and "b" are integers that happen to be equal, the expression > >> > + "p+a-b" is safe because its value still necessarily depends on > >> > + the rcu_dereference(), thus maintaining proper ordering. > >> > + > >> > +o Avoid all-zero operands to the bitwise "&" operator, and > >> > + similarly avoid all-ones operands to the bitwise "|" operator. > >> > + If the compiler is able to deduce the value of such operands, > >> > + it is within its rights to substitute the corresponding constant > >> > + for the bitwise operation. Once again, this causes subsequent > >> > + accesses to no longer depend on the rcu_dereference(), causing > >> > + bugs due to misordering. > >> > + > >> > + Please note that single-bit operands to bitwise "&" can also > >> > + be dangerous. At this point, the compiler knows that the > >> > + resulting value can only take on one of two possible values. > >> > + Therefore, a very small amount of additional information will > >> > + allow the compiler to deduce the exact value, which again can > >> > + result in misordering. > >> > + > >> > +o If you are using RCU to protect JITed functions, so that the > >> > + "()" function-invocation operator is applied to a value obtained > >> > + (directly or indirectly) from rcu_dereference(), you may need to > >> > + interact directly with the hardware to flush instruction caches. > >> > + This issue arises on some systems when a newly JITed function is > >> > + using the same memory that was used by an earlier JITed function. > >> > + > >> > +o Do not use the results from the boolean "&&" and "||" when > >> > + dereferencing. For example, the following (rather improbable) > >> > + code is buggy: > >> > + > >> > + int a[2]; > >> > + int index; > >> > + int force_zero_index = 1; > >> > + > >> > + ... > >> > + > >> > + r1 = rcu_dereference(i1) > >> > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > >> > + > >> > + The reason this is buggy is that "&&" and "||" are often compiled > >> > + using branches. While weak-memory machines such as ARM or PowerPC > >> > + do order stores after such branches, they can speculate loads, > >> > + which can result in misordering bugs. > >> > + > >> > +o Do not use the results from relational operators ("==", "!=", > >> > + ">", ">=", "<", or "<=") when dereferencing. For example, > >> > + the following (quite strange) code is buggy: > >> > + > >> > + int a[2]; > >> > + int index; > >> > + int flip_index = 0; > >> > + > >> > + ... > >> > + > >> > + r1 = rcu_dereference(i1) > >> > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > >> > + > >> > + As before, the reason this is buggy is that relational operators > >> > + are often compiled using branches. And as before, although > >> > + weak-memory machines such as ARM or PowerPC do order stores > >> > + after such branches, but can speculate loads, which can again > >> > + result in misordering bugs. > >> > + > >> > +o Be very careful about comparing pointers obtained from > >> > + rcu_dereference() against non-NULL values. As Linus Torvalds > >> > + explained, if the two pointers are equal, the compiler could > >> > + substitute the pointer you are comparing against for the pointer > >> > + obtained from rcu_dereference(). For example: > >> > + > >> > + p = rcu_dereference(gp); > >> > + if (p == &default_struct) > >> > + do_default(p->a); > >> > + > >> > + Because the compiler now knows that the value of "p" is exactly > >> > + the address of the variable "default_struct", it is free to > >> > + transform this code into the following: > >> > + > >> > + p = rcu_dereference(gp); > >> > + if (p == &default_struct) > >> > + do_default(default_struct.a); > >> > + > >> > + On ARM and Power hardware, the load from "default_struct.a" > >> > + can now be speculated, such that it might happen before the > >> > + rcu_dereference(). This could result in bugs due to misordering. > >> > + > >> > + However, comparisons are OK in the following cases: > >> > + > >> > + o The comparison was against the NULL pointer. If the > >> > + compiler knows that the pointer is NULL, you had better > >> > + not be dereferencing it anyway. If the comparison is > >> > + non-equal, the compiler is none the wiser. Therefore, > >> > + it is safe to compare pointers from rcu_dereference() > >> > + against NULL pointers. > >> > + > >> > + o The pointer is never dereferenced after being compared. > >> > + Since there are no subsequent dereferences, the compiler > >> > + cannot use anything it learned from the comparison > >> > + to reorder the non-existent subsequent dereferences. > >> > + This sort of comparison occurs frequently when scanning > >> > + RCU-protected circular linked lists. > >> > + > >> > + o The comparison is against a pointer pointer that > >> > + references memory that was initialized "a long time ago." > >> > + The reason this is safe is that even if misordering > >> > + occurs, the misordering will not affect the accesses > >> > + that follow the comparison. So exactly how long ago is > >> > + "a long time ago"? Here are some possibilities: > >> > + > >> > + o Compile time. > >> > + > >> > + o Boot time. > >> > + > >> > + o Module-init time for module code. > >> > + > >> > + o Prior to kthread creation for kthread code. > >> > + > >> > + o During some prior acquisition of the lock that > >> > + we now hold. > >> > + > >> > + o Before mod_timer() time for a timer handler. > >> > + > >> > + There are many other possibilities involving the Linux > >> > + kernel's wide array of primitives that cause code to > >> > + be invoked at a later time. > >> > + > >> > + o The pointer being compared against also came from > >> > + rcu_dereference(). In this case, both pointers depend > >> > + on one rcu_dereference() or another, so you get proper > >> > + ordering either way. > >> > + > >> > + That said, this situation can make certain RCU usage > >> > + bugs more likely to happen. Which can be a good thing, > >> > + at least if they happen during testing. An example > >> > + of such an RCU usage bug is shown in the section titled > >> > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". > >> > + > >> > + o All of the accesses following the comparison are stores, > >> > + so that a control dependency preserves the needed ordering. > >> > + That said, it is easy to get control dependencies wrong. > >> > + Please see the "CONTROL DEPENDENCIES" section of > >> > + Documentation/memory-barriers.txt for more details. > >> > + > >> > + o The pointers compared not-equal -and- the compiler does > >> > + not have enough information to deduce the value of the > >> > + pointer. Note that the volatile cast in rcu_dereference() > >> > + will normally prevent the compiler from knowing too much. > >> > + > >> > +o Disable any value-speculation optimizations that your compiler > >> > + might provide, especially if you are making use of feedback-based > >> > + optimizations that take data collected from prior runs. Such > >> > + value-speculation optimizations reorder operations by design. > >> > + > >> > + There is one exception to this rule: Value-speculation > >> > + optimizations that leverage the branch-prediction hardware are > >> > + safe on strongly ordered systems (such as x86), but not on weakly > >> > + ordered systems (such as ARM or Power). Choose your compiler > >> > + command-line options wisely! > >> > + > >> > + > >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG > >> > + > >> > +Because updaters can run concurrently with RCU readers, RCU readers can > >> > +see stale and/or inconsistent values. If RCU readers need fresh or > >> > +consistent values, which they sometimes do, they need to take proper > >> > +precautions. To see this, consider the following code fragment: > >> > + > >> > + struct foo { > >> > + int a; > >> > + int b; > >> > + int c; > >> > + }; > >> > + struct foo *gp1; > >> > + struct foo *gp2; > >> > + > >> > + void updater(void) > >> > + { > >> > + struct foo *p; > >> > + > >> > + p = kmalloc(...); > >> > + if (p == NULL) > >> > + deal_with_it(); > >> > + p->a = 42; /* Each field in its own cache line. */ > >> > + p->b = 43; > >> > + p->c = 44; > >> > + rcu_assign_pointer(gp1, p); > >> > + p->b = 143; > >> > + p->c = 144; > >> > + rcu_assign_pointer(gp2, p); > >> > + } > >> > + > >> > + void reader(void) > >> > + { > >> > + struct foo *p; > >> > + struct foo *q; > >> > + int r1, r2; > >> > + > >> > + p = rcu_dereference(gp2); > >> > + r1 = p->b; /* Guaranteed to get 143. */ > >> > + q = rcu_dereference(gp1); > >> > + if (p == q) { > >> > + /* The compiler decides that q->c is same as p->c. */ > >> > + r2 = p->c; /* Could get 44 on weakly order system. */ > >> > + } > >> > + } > >> > + > >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, > >> > +but you should not be. After all, the updater might have been invoked > >> > +a second time between the time reader() loaded into "r1" and the time > >> > +that it loaded into "r2". The fact that this same result can occur due > >> > +to some reordering from the compiler and CPUs is beside the point. > >> > + > >> > +But suppose that the reader needs a consistent view? > >> > + > >> > +Then one approach is to use locking, for example, as follows: > >> > + > >> > + struct foo { > >> > + int a; > >> > + int b; > >> > + int c; > >> > + spinlock_t lock; > >> > + }; > >> > + struct foo *gp1; > >> > + struct foo *gp2; > >> > + > >> > + void updater(void) > >> > + { > >> > + struct foo *p; > >> > + > >> > + p = kmalloc(...); > >> > + if (p == NULL) > >> > + deal_with_it(); > >> > + spin_lock(&p->lock); > >> > + p->a = 42; /* Each field in its own cache line. */ > >> > + p->b = 43; > >> > + p->c = 44; > >> > + spin_unlock(&p->lock); > >> > + rcu_assign_pointer(gp1, p); > >> > + spin_lock(&p->lock); > >> > + p->b = 143; > >> > + p->c = 144; > >> > + spin_unlock(&p->lock); > >> > + rcu_assign_pointer(gp2, p); > >> > + } > >> > + > >> > + void reader(void) > >> > + { > >> > + struct foo *p; > >> > + struct foo *q; > >> > + int r1, r2; > >> > + > >> > + p = rcu_dereference(gp2); > >> > + spin_lock(&p->lock); > >> > + r1 = p->b; /* Guaranteed to get 143. */ > >> > + q = rcu_dereference(gp1); > >> > + if (p == q) { > >> > + /* The compiler decides that q->c is same as p->c. */ > >> > + r2 = p->c; /* Could get 44 on weakly order system. */ > >> > + } > >> > + spin_unlock(&p->lock); > >> > + } > >> > + > >> > +As always, use the right tool for the job! > >> > + > >> > + > >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH > >> > + > >> > +If a pointer obtained from rcu_dereference() compares not-equal to some > >> > +other pointer, the compiler normally has no clue what the value of the > >> > +first pointer might be. This lack of knowledge prevents the compiler > >> > +from carrying out optimizations that otherwise might destroy the ordering > >> > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() > >> > +should prevent the compiler from guessing the value. > >> > + > >> > +But without rcu_dereference(), the compiler knows more than you might > >> > +expect. Consider the following code fragment: > >> > + > >> > + struct foo { > >> > + int a; > >> > + int b; > >> > + }; > >> > + static struct foo variable1; > >> > + static struct foo variable2; > >> > + static struct foo *gp = &variable1; > >> > + > >> > + void updater(void) > >> > + { > >> > + initialize_foo(&variable2); > >> > + rcu_assign_pointer(gp, &variable2); > >> > + /* > >> > + * The above is the only store to gp in this translation unit, > >> > + * and the address of gp is not exported in any way. > >> > + */ > >> > + } > >> > + > >> > + int reader(void) > >> > + { > >> > + struct foo *p; > >> > + > >> > + p = gp; > >> > + barrier(); > >> > + if (p == &variable1) > >> > + return p->a; /* Must be variable1.a. */ > >> > + else > >> > + return p->b; /* Must be variable2.b. */ > >> > + } > >> > + > >> > +Because the compiler can see all stores to "gp", it knows that the only > >> > +possible values of "gp" are "variable1" on the one hand and "variable2" > >> > +on the other. The comparison in reader() therefore tells the compiler > >> > +the exact value of "p" even in the not-equals case. This allows the > >> > +compiler to make the return values independent of the load from "gp", > >> > +in turn destroying the ordering between this load and the loads of the > >> > +return values. This can result in "p->b" returning pre-initialization > >> > +garbage values. > >> > + > >> > +In short, rcu_dereference() is -not- optional when you are going to > >> > +dereference the resulting pointer. > >> > > >> > -- > >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> > the body of a message to majordomo@vger.kernel.org > >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >> > Please read the FAQ at http://www.tux.org/lkml/ > >> > > > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-02 23:20 ` Paul E. McKenney @ 2014-03-02 23:44 ` Peter Sewell 2014-03-03 4:25 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Peter Sewell @ 2014-03-02 23:44 UTC (permalink / raw) To: Paul McKenney Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 2 March 2014 23:20, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> Hi Paul, >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> > <paulmck@linux.vnet.ibm.com> wrote: >> >> >> > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> > > where that other pointer was properly fetched using one >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> > > that other pointer happened after the last update to the >> >> >> > > pointed-to structure. >> >> >> > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> > >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> > that code like >> >> >> > >> >> >> > p = atomic_read(consume); >> >> >> > X; >> >> >> > q = atomic_read(consume); >> >> >> > Y; >> >> >> > if (p == q) >> >> >> > data = p->val; >> >> >> > >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> > can decide that the values are interchangeable. >> >> >> > >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> > irrelevant. No? >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> >> come from either pointer. >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> if (p == NULL) >> >> >> deal_with_it(); >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> p->b = 43; >> >> >> p->c = 44; >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); >> >> >> p->b = 143; >> >> >> p->c = 144; >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> >> >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); >> >> >> if (p == q) { >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> } >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> you get. >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> see consistency across all the fields in the structure, please put a lock >> >> >> in the structure and use it to guard updates and accesses to those fields. >> >> > >> >> > And here is a patch documenting the restrictions for the current Linux >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> > differently than atomic_load_explicit(&p, memory_order_consume). >> >> > >> >> > Thoughts? >> >> >> >> That might serve as informal documentation for linux kernel >> >> programmers about the bounds on the optimisations that you expect >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> intend it to be for. But I don't see how one can make it precise >> >> enough to serve as a language definition, so that compiler people >> >> could confidently say "yes, we respect that", which I guess is what >> >> you really need. As a useful criterion, we should aim for something >> >> precise enough that in a verified-compiler context you can >> >> mathematically prove that the compiler will satisfy it (even though >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a >> >> small number of bits" is just too vague. >> > >> > Understood, and yes, this is intended to document current compiler >> > behavior for the Linux kernel community. It would not make sense to show >> > it to the C11 or C++11 communities, except perhaps as an informational >> > piece on current practice. >> > >> >> The basic problem is that the compiler may be doing sophisticated >> >> reasoning with a bunch of non-local knowledge that it's deduced from >> >> the code, neither of which are well-understood, and here we have to >> >> identify some envelope, expressive enough for RCU idioms, in which >> >> that reasoning doesn't allow data/address dependencies to be removed >> >> (and hence the hardware guarantee about them will be maintained at the >> >> source level). >> >> >> >> The C11 syntactic notion of dependency, whatever its faults, was at >> >> least precise, could be reasoned about locally (just looking at the >> >> syntactic code in question), and did do that. The fact that current >> >> compilers do optimisations that remove dependencies and will likely >> >> have many bugs at present is besides the point - this was surely >> >> intended as a *new* constraint on what they are allowed to do. The >> >> interesting question is really whether the compiler writers think that >> >> they *could* implement it in a reasonable way - I'd like to hear >> >> Torvald and his colleagues' opinion on that. >> >> >> >> What you're doing above seems to be basically a very cut-down version >> >> of that, but with a fuzzy boundary. If you want it to be precise, >> >> maybe it needs to be much simpler (which might force you into ruling >> >> out some current code idioms). >> > >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) >> > can be developed to serve this purpose. >> >> (I missed that mail when it first came past, sorry) > > No worries! > >> That's also going to be tricky, I'm afraid. The key condition there is: >> >> "* at the time of execution of E, L [PS: I assume that L is a >> typo and should be E] > > I believe it really is "L". As I understand it (and Torvald will correct > me if I am wrong), the idea is that the implementation is prohibited > from guessing the value of "L" -- it must assume that any value from > L's type might be returned, regardless of what it might otherwise know. > > However, after L's value is loaded, the implementation -is- permitted > to learn constraint's on this value based on "if" statements and the > like between the load from "L" and the execution of "E". > > Does that help? Not sure (i.e., not really :-). I thought Torvald wanted to say that "E really-depends on L if there exist two different values that (just according to typing) might be read for L that give rise to two different values for E". >> can possibly have returned at >> least two different values under the assumption that L itself >> could have returned any value allowed by L's type." >> >> First, the evaluation of E might be nondeterministic - e.g., for an >> artificial example, if it's just a nondeterministic value obtained >> from the result of a race on SC atomics. The above doesn't >> distinguish between that (which doesn't have a real dependency on L) >> and that XOR'd with L (which does). And it does so in the wrong >> direction: it'll say there the former has a dependency on L. > > Right, it is only any dependency that E has on L that would be > constrained. If E also depends on other quantities obtained some > other way than a memory_order_consume load into a value_dep_preserving, > variable, then as I understand it, the compiler is within its rights > to optimize these other quantities to within an inch of their lives. > > It is quite possible that E depends on L only sometimes. For example: > > p = atomic_load_explicit(&gp, memory_order_consume); > p = random() & 0x8 ? p : &default_structure; > E(p); > > My guess is that in this case, the ordering would be guaranteed only > for those executions where there is a value dependency. In my naive > view, this should be no different than something like this: > > if (random() & 0x10) > p = atomic_load_explicit(&gp, memory_order_acquire); > else > p = &default_structure; > E(p); all this is fine, but... > Or am I missing your point? ...if the idea was to identify "real dependencies" as cases where two values of E are possible based on different values of L, then if two values of E are possible *just anyway* (e.g. because of nondeterminism), the definition gets confused. >> Second, it involves reasoning about counterfactual executions. That >> doesn't necessarily make it wrong, per se, but probably makes it hard >> to work with. For example, suppose that in all the actual >> whole-program executions, a runtime occurrence of L only ever returns >> one particular value (perhaps because of some simple #define'd >> configuration), and that the code used in the evaluation of E depends >> on some invariant which is related to that configuration. The >> hypothetical execution used above in which a different value is used >> is one in the code is being run in a situation with broken invariants. >> Then there will be technical difficulties in using the definition: >> I don't see how one would persuade oneself that a compiler always >> satisfies it, because these hypothetical executions are far removed >> from what it's actually working on. > > The developer answer would be something like "all it really means is that > the implementation is required to actually emit the memory_order_consume > load and actually use the value," which is probably not much comfort > to someone trying to model it. Maybe there is a better way of wording > this constraint so as to avoid the counterfactuals? maybe. I don't have one right now, though. >> (Aside: The notion of a thread "observing" another thread's load, >> dating back a long time and adopted in the Power and ARM architecture >> texts, relies on counterfactual executions in a broadly similar way; >> we're happy to have escaped that now :-) > > Here is hoping that there is a way to escape it in this case as well. ;-) ta, Peter > Thanx, Paul > >> Peter >> >> >> >> >> > Thanx, Paul >> > >> >> best, >> >> Peter >> >> >> >> >> >> >> >> > Thanx, Paul >> >> > >> >> > ------------------------------------------------------------------------ >> >> > >> >> > documentation: Record rcu_dereference() value mishandling >> >> > >> >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and >> >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out >> >> > some ways of misusing the return value from rcu_dereference() that >> >> > are not necessarily completely intuitive. This commit therefore >> >> > documents what can and cannot safely be done with these values. >> >> > >> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> >> >> > >> >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX >> >> > index fa57139f50bf..f773a264ae02 100644 >> >> > --- a/Documentation/RCU/00-INDEX >> >> > +++ b/Documentation/RCU/00-INDEX >> >> > @@ -12,6 +12,8 @@ lockdep-splat.txt >> >> > - RCU Lockdep splats explained. >> >> > NMI-RCU.txt >> >> > - Using RCU to Protect Dynamic NMI Handlers >> >> > +rcu_dereference.txt >> >> > + - Proper care and feeding of return values from rcu_dereference() >> >> > rcubarrier.txt >> >> > - RCU and Unloadable Modules >> >> > rculist_nulls.txt >> >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt >> >> > index 9d10d1db16a5..877947130ebe 100644 >> >> > --- a/Documentation/RCU/checklist.txt >> >> > +++ b/Documentation/RCU/checklist.txt >> >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! >> >> > http://www.openvms.compaq.com/wizard/wiz_2637.html >> >> > >> >> > The rcu_dereference() primitive is also an excellent >> >> > - documentation aid, letting the person reading the code >> >> > - know exactly which pointers are protected by RCU. >> >> > + documentation aid, letting the person reading the >> >> > + code know exactly which pointers are protected by RCU. >> >> > Please note that compilers can also reorder code, and >> >> > they are becoming increasingly aggressive about doing >> >> > - just that. The rcu_dereference() primitive therefore >> >> > - also prevents destructive compiler optimizations. >> >> > + just that. The rcu_dereference() primitive therefore also >> >> > + prevents destructive compiler optimizations. However, >> >> > + with a bit of devious creativity, it is possible to >> >> > + mishandle the return value from rcu_dereference(). >> >> > + Please see rcu_dereference.txt in this directory for >> >> > + more information. >> >> > >> >> > The rcu_dereference() primitive is used by the >> >> > various "_rcu()" list-traversal primitives, such >> >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt >> >> > new file mode 100644 >> >> > index 000000000000..6e72cd8622df >> >> > --- /dev/null >> >> > +++ b/Documentation/RCU/rcu_dereference.txt >> >> > @@ -0,0 +1,365 @@ >> >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() >> >> > + >> >> > +Most of the time, you can use values from rcu_dereference() or one of >> >> > +the similar primitives without worries. Dereferencing (prefix "*"), >> >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and >> >> > +subtraction of constants, and casts all work quite naturally and safely. >> >> > + >> >> > +It is nevertheless possible to get into trouble with other operations. >> >> > +Follow these rules to keep your RCU code working properly: >> >> > + >> >> > +o You must use one of the rcu_dereference() family of primitives >> >> > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU >> >> > + will complain. Worse yet, your code can see random memory-corruption >> >> > + bugs due to games that compilers and DEC Alpha can play. >> >> > + Without one of the rcu_dereference() primitives, compilers >> >> > + can reload the value, and won't your code have fun with two >> >> > + different values for a single pointer! Without rcu_dereference(), >> >> > + DEC Alpha can load a pointer, dereference that pointer, and >> >> > + return data preceding initialization that preceded the store of >> >> > + the pointer. >> >> > + >> >> > + In addition, the volatile cast in rcu_dereference() prevents the >> >> > + compiler from deducing the resulting pointer value. Please see >> >> > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" >> >> > + for an example where the compiler can in fact deduce the exact >> >> > + value of the pointer, and thus cause misordering. >> >> > + >> >> > +o Do not use single-element RCU-protected arrays. The compiler >> >> > + is within its right to assume that the value of an index into >> >> > + such an array must necessarily evaluate to zero. The compiler >> >> > + could then substitute the constant zero for the computation, so >> >> > + that the array index no longer depended on the value returned >> >> > + by rcu_dereference(). If the array index no longer depends >> >> > + on rcu_dereference(), then both the compiler and the CPU >> >> > + are within their rights to order the array access before the >> >> > + rcu_dereference(), which can cause the array access to return >> >> > + garbage. >> >> > + >> >> > +o Avoid cancellation when using the "+" and "-" infix arithmetic >> >> > + operators. For example, for a given variable "x", avoid >> >> > + "(x-x)". There are similar arithmetic pitfalls from other >> >> > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". >> >> > + The compiler is within its rights to substitute zero for all of >> >> > + these expressions, so that subsequent accesses no longer depend >> >> > + on the rcu_dereference(), again possibly resulting in bugs due >> >> > + to misordering. >> >> > + >> >> > + Of course, if "p" is a pointer from rcu_dereference(), and "a" >> >> > + and "b" are integers that happen to be equal, the expression >> >> > + "p+a-b" is safe because its value still necessarily depends on >> >> > + the rcu_dereference(), thus maintaining proper ordering. >> >> > + >> >> > +o Avoid all-zero operands to the bitwise "&" operator, and >> >> > + similarly avoid all-ones operands to the bitwise "|" operator. >> >> > + If the compiler is able to deduce the value of such operands, >> >> > + it is within its rights to substitute the corresponding constant >> >> > + for the bitwise operation. Once again, this causes subsequent >> >> > + accesses to no longer depend on the rcu_dereference(), causing >> >> > + bugs due to misordering. >> >> > + >> >> > + Please note that single-bit operands to bitwise "&" can also >> >> > + be dangerous. At this point, the compiler knows that the >> >> > + resulting value can only take on one of two possible values. >> >> > + Therefore, a very small amount of additional information will >> >> > + allow the compiler to deduce the exact value, which again can >> >> > + result in misordering. >> >> > + >> >> > +o If you are using RCU to protect JITed functions, so that the >> >> > + "()" function-invocation operator is applied to a value obtained >> >> > + (directly or indirectly) from rcu_dereference(), you may need to >> >> > + interact directly with the hardware to flush instruction caches. >> >> > + This issue arises on some systems when a newly JITed function is >> >> > + using the same memory that was used by an earlier JITed function. >> >> > + >> >> > +o Do not use the results from the boolean "&&" and "||" when >> >> > + dereferencing. For example, the following (rather improbable) >> >> > + code is buggy: >> >> > + >> >> > + int a[2]; >> >> > + int index; >> >> > + int force_zero_index = 1; >> >> > + >> >> > + ... >> >> > + >> >> > + r1 = rcu_dereference(i1) >> >> > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ >> >> > + >> >> > + The reason this is buggy is that "&&" and "||" are often compiled >> >> > + using branches. While weak-memory machines such as ARM or PowerPC >> >> > + do order stores after such branches, they can speculate loads, >> >> > + which can result in misordering bugs. >> >> > + >> >> > +o Do not use the results from relational operators ("==", "!=", >> >> > + ">", ">=", "<", or "<=") when dereferencing. For example, >> >> > + the following (quite strange) code is buggy: >> >> > + >> >> > + int a[2]; >> >> > + int index; >> >> > + int flip_index = 0; >> >> > + >> >> > + ... >> >> > + >> >> > + r1 = rcu_dereference(i1) >> >> > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ >> >> > + >> >> > + As before, the reason this is buggy is that relational operators >> >> > + are often compiled using branches. And as before, although >> >> > + weak-memory machines such as ARM or PowerPC do order stores >> >> > + after such branches, but can speculate loads, which can again >> >> > + result in misordering bugs. >> >> > + >> >> > +o Be very careful about comparing pointers obtained from >> >> > + rcu_dereference() against non-NULL values. As Linus Torvalds >> >> > + explained, if the two pointers are equal, the compiler could >> >> > + substitute the pointer you are comparing against for the pointer >> >> > + obtained from rcu_dereference(). For example: >> >> > + >> >> > + p = rcu_dereference(gp); >> >> > + if (p == &default_struct) >> >> > + do_default(p->a); >> >> > + >> >> > + Because the compiler now knows that the value of "p" is exactly >> >> > + the address of the variable "default_struct", it is free to >> >> > + transform this code into the following: >> >> > + >> >> > + p = rcu_dereference(gp); >> >> > + if (p == &default_struct) >> >> > + do_default(default_struct.a); >> >> > + >> >> > + On ARM and Power hardware, the load from "default_struct.a" >> >> > + can now be speculated, such that it might happen before the >> >> > + rcu_dereference(). This could result in bugs due to misordering. >> >> > + >> >> > + However, comparisons are OK in the following cases: >> >> > + >> >> > + o The comparison was against the NULL pointer. If the >> >> > + compiler knows that the pointer is NULL, you had better >> >> > + not be dereferencing it anyway. If the comparison is >> >> > + non-equal, the compiler is none the wiser. Therefore, >> >> > + it is safe to compare pointers from rcu_dereference() >> >> > + against NULL pointers. >> >> > + >> >> > + o The pointer is never dereferenced after being compared. >> >> > + Since there are no subsequent dereferences, the compiler >> >> > + cannot use anything it learned from the comparison >> >> > + to reorder the non-existent subsequent dereferences. >> >> > + This sort of comparison occurs frequently when scanning >> >> > + RCU-protected circular linked lists. >> >> > + >> >> > + o The comparison is against a pointer pointer that >> >> > + references memory that was initialized "a long time ago." >> >> > + The reason this is safe is that even if misordering >> >> > + occurs, the misordering will not affect the accesses >> >> > + that follow the comparison. So exactly how long ago is >> >> > + "a long time ago"? Here are some possibilities: >> >> > + >> >> > + o Compile time. >> >> > + >> >> > + o Boot time. >> >> > + >> >> > + o Module-init time for module code. >> >> > + >> >> > + o Prior to kthread creation for kthread code. >> >> > + >> >> > + o During some prior acquisition of the lock that >> >> > + we now hold. >> >> > + >> >> > + o Before mod_timer() time for a timer handler. >> >> > + >> >> > + There are many other possibilities involving the Linux >> >> > + kernel's wide array of primitives that cause code to >> >> > + be invoked at a later time. >> >> > + >> >> > + o The pointer being compared against also came from >> >> > + rcu_dereference(). In this case, both pointers depend >> >> > + on one rcu_dereference() or another, so you get proper >> >> > + ordering either way. >> >> > + >> >> > + That said, this situation can make certain RCU usage >> >> > + bugs more likely to happen. Which can be a good thing, >> >> > + at least if they happen during testing. An example >> >> > + of such an RCU usage bug is shown in the section titled >> >> > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". >> >> > + >> >> > + o All of the accesses following the comparison are stores, >> >> > + so that a control dependency preserves the needed ordering. >> >> > + That said, it is easy to get control dependencies wrong. >> >> > + Please see the "CONTROL DEPENDENCIES" section of >> >> > + Documentation/memory-barriers.txt for more details. >> >> > + >> >> > + o The pointers compared not-equal -and- the compiler does >> >> > + not have enough information to deduce the value of the >> >> > + pointer. Note that the volatile cast in rcu_dereference() >> >> > + will normally prevent the compiler from knowing too much. >> >> > + >> >> > +o Disable any value-speculation optimizations that your compiler >> >> > + might provide, especially if you are making use of feedback-based >> >> > + optimizations that take data collected from prior runs. Such >> >> > + value-speculation optimizations reorder operations by design. >> >> > + >> >> > + There is one exception to this rule: Value-speculation >> >> > + optimizations that leverage the branch-prediction hardware are >> >> > + safe on strongly ordered systems (such as x86), but not on weakly >> >> > + ordered systems (such as ARM or Power). Choose your compiler >> >> > + command-line options wisely! >> >> > + >> >> > + >> >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG >> >> > + >> >> > +Because updaters can run concurrently with RCU readers, RCU readers can >> >> > +see stale and/or inconsistent values. If RCU readers need fresh or >> >> > +consistent values, which they sometimes do, they need to take proper >> >> > +precautions. To see this, consider the following code fragment: >> >> > + >> >> > + struct foo { >> >> > + int a; >> >> > + int b; >> >> > + int c; >> >> > + }; >> >> > + struct foo *gp1; >> >> > + struct foo *gp2; >> >> > + >> >> > + void updater(void) >> >> > + { >> >> > + struct foo *p; >> >> > + >> >> > + p = kmalloc(...); >> >> > + if (p == NULL) >> >> > + deal_with_it(); >> >> > + p->a = 42; /* Each field in its own cache line. */ >> >> > + p->b = 43; >> >> > + p->c = 44; >> >> > + rcu_assign_pointer(gp1, p); >> >> > + p->b = 143; >> >> > + p->c = 144; >> >> > + rcu_assign_pointer(gp2, p); >> >> > + } >> >> > + >> >> > + void reader(void) >> >> > + { >> >> > + struct foo *p; >> >> > + struct foo *q; >> >> > + int r1, r2; >> >> > + >> >> > + p = rcu_dereference(gp2); >> >> > + r1 = p->b; /* Guaranteed to get 143. */ >> >> > + q = rcu_dereference(gp1); >> >> > + if (p == q) { >> >> > + /* The compiler decides that q->c is same as p->c. */ >> >> > + r2 = p->c; /* Could get 44 on weakly order system. */ >> >> > + } >> >> > + } >> >> > + >> >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, >> >> > +but you should not be. After all, the updater might have been invoked >> >> > +a second time between the time reader() loaded into "r1" and the time >> >> > +that it loaded into "r2". The fact that this same result can occur due >> >> > +to some reordering from the compiler and CPUs is beside the point. >> >> > + >> >> > +But suppose that the reader needs a consistent view? >> >> > + >> >> > +Then one approach is to use locking, for example, as follows: >> >> > + >> >> > + struct foo { >> >> > + int a; >> >> > + int b; >> >> > + int c; >> >> > + spinlock_t lock; >> >> > + }; >> >> > + struct foo *gp1; >> >> > + struct foo *gp2; >> >> > + >> >> > + void updater(void) >> >> > + { >> >> > + struct foo *p; >> >> > + >> >> > + p = kmalloc(...); >> >> > + if (p == NULL) >> >> > + deal_with_it(); >> >> > + spin_lock(&p->lock); >> >> > + p->a = 42; /* Each field in its own cache line. */ >> >> > + p->b = 43; >> >> > + p->c = 44; >> >> > + spin_unlock(&p->lock); >> >> > + rcu_assign_pointer(gp1, p); >> >> > + spin_lock(&p->lock); >> >> > + p->b = 143; >> >> > + p->c = 144; >> >> > + spin_unlock(&p->lock); >> >> > + rcu_assign_pointer(gp2, p); >> >> > + } >> >> > + >> >> > + void reader(void) >> >> > + { >> >> > + struct foo *p; >> >> > + struct foo *q; >> >> > + int r1, r2; >> >> > + >> >> > + p = rcu_dereference(gp2); >> >> > + spin_lock(&p->lock); >> >> > + r1 = p->b; /* Guaranteed to get 143. */ >> >> > + q = rcu_dereference(gp1); >> >> > + if (p == q) { >> >> > + /* The compiler decides that q->c is same as p->c. */ >> >> > + r2 = p->c; /* Could get 44 on weakly order system. */ >> >> > + } >> >> > + spin_unlock(&p->lock); >> >> > + } >> >> > + >> >> > +As always, use the right tool for the job! >> >> > + >> >> > + >> >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH >> >> > + >> >> > +If a pointer obtained from rcu_dereference() compares not-equal to some >> >> > +other pointer, the compiler normally has no clue what the value of the >> >> > +first pointer might be. This lack of knowledge prevents the compiler >> >> > +from carrying out optimizations that otherwise might destroy the ordering >> >> > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() >> >> > +should prevent the compiler from guessing the value. >> >> > + >> >> > +But without rcu_dereference(), the compiler knows more than you might >> >> > +expect. Consider the following code fragment: >> >> > + >> >> > + struct foo { >> >> > + int a; >> >> > + int b; >> >> > + }; >> >> > + static struct foo variable1; >> >> > + static struct foo variable2; >> >> > + static struct foo *gp = &variable1; >> >> > + >> >> > + void updater(void) >> >> > + { >> >> > + initialize_foo(&variable2); >> >> > + rcu_assign_pointer(gp, &variable2); >> >> > + /* >> >> > + * The above is the only store to gp in this translation unit, >> >> > + * and the address of gp is not exported in any way. >> >> > + */ >> >> > + } >> >> > + >> >> > + int reader(void) >> >> > + { >> >> > + struct foo *p; >> >> > + >> >> > + p = gp; >> >> > + barrier(); >> >> > + if (p == &variable1) >> >> > + return p->a; /* Must be variable1.a. */ >> >> > + else >> >> > + return p->b; /* Must be variable2.b. */ >> >> > + } >> >> > + >> >> > +Because the compiler can see all stores to "gp", it knows that the only >> >> > +possible values of "gp" are "variable1" on the one hand and "variable2" >> >> > +on the other. The comparison in reader() therefore tells the compiler >> >> > +the exact value of "p" even in the not-equals case. This allows the >> >> > +compiler to make the return values independent of the load from "gp", >> >> > +in turn destroying the ordering between this load and the loads of the >> >> > +return values. This can result in "p->b" returning pre-initialization >> >> > +garbage values. >> >> > + >> >> > +In short, rcu_dereference() is -not- optional when you are going to >> >> > +dereference the resulting pointer. >> >> > >> >> > -- >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in >> >> > the body of a message to majordomo@vger.kernel.org >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html >> >> > Please read the FAQ at http://www.tux.org/lkml/ >> >> >> > >> > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-02 23:44 ` Peter Sewell @ 2014-03-03 4:25 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-03-03 4:25 UTC (permalink / raw) To: Peter Sewell Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, Mar 02, 2014 at 11:44:52PM +0000, Peter Sewell wrote: > On 2 March 2014 23:20, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Sun, Mar 02, 2014 at 04:05:52AM -0600, Peter Sewell wrote: > >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> >> Hi Paul, > >> >> > >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> >> > <paulmck@linux.vnet.ibm.com> wrote: > >> >> >> > > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> >> > > where that other pointer was properly fetched using one > >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> >> > > that other pointer happened after the last update to the > >> >> >> > > pointed-to structure. > >> >> >> > > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> >> > > >> >> >> > I think that it might be worth pointing out as an example, and saying > >> >> >> > that code like > >> >> >> > > >> >> >> > p = atomic_read(consume); > >> >> >> > X; > >> >> >> > q = atomic_read(consume); > >> >> >> > Y; > >> >> >> > if (p == q) > >> >> >> > data = p->val; > >> >> >> > > >> >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> >> > can decide that the values are interchangeable. > >> >> >> > > >> >> >> > I cannot for the life of me come up with a situation where this would > >> >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> >> > stronger ordering than anything the consume through "p" would > >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> >> > ordering to the access through "p" is through p or q is kind of > >> >> >> > irrelevant. No? > >> >> >> > >> >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> >> time you can see it happen is when X has no barriers, in which case > >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, > >> >> >> come from either pointer. > >> >> >> > >> >> >> For whatever it is worth, hear is the litmus test: > >> >> >> > >> >> >> T1: p = kmalloc(...); > >> >> >> if (p == NULL) > >> >> >> deal_with_it(); > >> >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> >> p->b = 43; > >> >> >> p->c = 44; > >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> >> p->b = 143; > >> >> >> p->c = 144; > >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> >> > >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> >> >> if (p == q) { > >> >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> >> } > >> >> >> > >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> >> you get. > >> >> >> > >> >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> >> then publishing it via another pointer seems to me to be asking for > >> >> >> trouble anyway. If you really want to do something like that and still > >> >> >> see consistency across all the fields in the structure, please put a lock > >> >> >> in the structure and use it to guard updates and accesses to those fields. > >> >> > > >> >> > And here is a patch documenting the restrictions for the current Linux > >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> >> > differently than atomic_load_explicit(&p, memory_order_consume). > >> >> > > >> >> > Thoughts? > >> >> > >> >> That might serve as informal documentation for linux kernel > >> >> programmers about the bounds on the optimisations that you expect > >> >> compilers to do for common-case RCU code - and I guess that's what you > >> >> intend it to be for. But I don't see how one can make it precise > >> >> enough to serve as a language definition, so that compiler people > >> >> could confidently say "yes, we respect that", which I guess is what > >> >> you really need. As a useful criterion, we should aim for something > >> >> precise enough that in a verified-compiler context you can > >> >> mathematically prove that the compiler will satisfy it (even though > >> >> that won't happen anytime soon for GCC), and that analysis tool > >> >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a > >> >> small number of bits" is just too vague. > >> > > >> > Understood, and yes, this is intended to document current compiler > >> > behavior for the Linux kernel community. It would not make sense to show > >> > it to the C11 or C++11 communities, except perhaps as an informational > >> > piece on current practice. > >> > > >> >> The basic problem is that the compiler may be doing sophisticated > >> >> reasoning with a bunch of non-local knowledge that it's deduced from > >> >> the code, neither of which are well-understood, and here we have to > >> >> identify some envelope, expressive enough for RCU idioms, in which > >> >> that reasoning doesn't allow data/address dependencies to be removed > >> >> (and hence the hardware guarantee about them will be maintained at the > >> >> source level). > >> >> > >> >> The C11 syntactic notion of dependency, whatever its faults, was at > >> >> least precise, could be reasoned about locally (just looking at the > >> >> syntactic code in question), and did do that. The fact that current > >> >> compilers do optimisations that remove dependencies and will likely > >> >> have many bugs at present is besides the point - this was surely > >> >> intended as a *new* constraint on what they are allowed to do. The > >> >> interesting question is really whether the compiler writers think that > >> >> they *could* implement it in a reasonable way - I'd like to hear > >> >> Torvald and his colleagues' opinion on that. > >> >> > >> >> What you're doing above seems to be basically a very cut-down version > >> >> of that, but with a fuzzy boundary. If you want it to be precise, > >> >> maybe it needs to be much simpler (which might force you into ruling > >> >> out some current code idioms). > >> > > >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) > >> > can be developed to serve this purpose. > >> > >> (I missed that mail when it first came past, sorry) > > > > No worries! > > > >> That's also going to be tricky, I'm afraid. The key condition there is: > >> > >> "* at the time of execution of E, L [PS: I assume that L is a > >> typo and should be E] > > > > I believe it really is "L". As I understand it (and Torvald will correct > > me if I am wrong), the idea is that the implementation is prohibited > > from guessing the value of "L" -- it must assume that any value from > > L's type might be returned, regardless of what it might otherwise know. > > > > However, after L's value is loaded, the implementation -is- permitted > > to learn constraint's on this value based on "if" statements and the > > like between the load from "L" and the execution of "E". > > > > Does that help? > > Not sure (i.e., not really :-). I thought Torvald wanted to say that > "E really-depends on L if there exist two different values that (just > according to typing) might be read for L that give rise to two > different values for E". My interpretation was that for there to be a value dependency from L to E, L must have the possibility of taking on at least two values at the point in the code where E resides, but that the computation of E might well result in only one possible value. I guess we need Torvald to tell us which he meant. ;-) > >> can possibly have returned at > >> least two different values under the assumption that L itself > >> could have returned any value allowed by L's type." > >> > >> First, the evaluation of E might be nondeterministic - e.g., for an > >> artificial example, if it's just a nondeterministic value obtained > >> from the result of a race on SC atomics. The above doesn't > >> distinguish between that (which doesn't have a real dependency on L) > >> and that XOR'd with L (which does). And it does so in the wrong > >> direction: it'll say there the former has a dependency on L. > > > > Right, it is only any dependency that E has on L that would be > > constrained. If E also depends on other quantities obtained some > > other way than a memory_order_consume load into a value_dep_preserving, > > variable, then as I understand it, the compiler is within its rights > > to optimize these other quantities to within an inch of their lives. > > > > It is quite possible that E depends on L only sometimes. For example: > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > p = random() & 0x8 ? p : &default_structure; > > E(p); > > > > My guess is that in this case, the ordering would be guaranteed only > > for those executions where there is a value dependency. In my naive > > view, this should be no different than something like this: > > > > if (random() & 0x10) > > p = atomic_load_explicit(&gp, memory_order_acquire); > > else > > p = &default_structure; > > E(p); > > all this is fine, but... > > > Or am I missing your point? > > ...if the idea was to identify "real dependencies" as cases where two > values of E are possible based on different values of L, then if two > values of E are possible *just anyway* (e.g. because of > nondeterminism), the definition gets confused. Understood. Does this confusion persist in the case where it is only L that is required to have the possibility of taking on two or more values? > >> Second, it involves reasoning about counterfactual executions. That > >> doesn't necessarily make it wrong, per se, but probably makes it hard > >> to work with. For example, suppose that in all the actual > >> whole-program executions, a runtime occurrence of L only ever returns > >> one particular value (perhaps because of some simple #define'd > >> configuration), and that the code used in the evaluation of E depends > >> on some invariant which is related to that configuration. The > >> hypothetical execution used above in which a different value is used > >> is one in the code is being run in a situation with broken invariants. > >> Then there will be technical difficulties in using the definition: > >> I don't see how one would persuade oneself that a compiler always > >> satisfies it, because these hypothetical executions are far removed > >> from what it's actually working on. > > > > The developer answer would be something like "all it really means is that > > the implementation is required to actually emit the memory_order_consume > > load and actually use the value," which is probably not much comfort > > to someone trying to model it. Maybe there is a better way of wording > > this constraint so as to avoid the counterfactuals? > > maybe. I don't have one right now, though. > > >> (Aside: The notion of a thread "observing" another thread's load, > >> dating back a long time and adopted in the Power and ARM architecture > >> texts, relies on counterfactual executions in a broadly similar way; > >> we're happy to have escaped that now :-) > > > > Here is hoping that there is a way to escape it in this case as well. ;-) > > > ta, > Peter Thanx, Paul > >> Peter > >> > >> > >> > >> > >> > Thanx, Paul > >> > > >> >> best, > >> >> Peter > >> >> > >> >> > >> >> > >> >> > Thanx, Paul > >> >> > > >> >> > ------------------------------------------------------------------------ > >> >> > > >> >> > documentation: Record rcu_dereference() value mishandling > >> >> > > >> >> > Recent LKML discussings (see http://lwn.net/Articles/586838/ and > >> >> > http://lwn.net/Articles/588300/ for the LWN writeups) brought out > >> >> > some ways of misusing the return value from rcu_dereference() that > >> >> > are not necessarily completely intuitive. This commit therefore > >> >> > documents what can and cannot safely be done with these values. > >> >> > > >> >> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > >> >> > > >> >> > diff --git a/Documentation/RCU/00-INDEX b/Documentation/RCU/00-INDEX > >> >> > index fa57139f50bf..f773a264ae02 100644 > >> >> > --- a/Documentation/RCU/00-INDEX > >> >> > +++ b/Documentation/RCU/00-INDEX > >> >> > @@ -12,6 +12,8 @@ lockdep-splat.txt > >> >> > - RCU Lockdep splats explained. > >> >> > NMI-RCU.txt > >> >> > - Using RCU to Protect Dynamic NMI Handlers > >> >> > +rcu_dereference.txt > >> >> > + - Proper care and feeding of return values from rcu_dereference() > >> >> > rcubarrier.txt > >> >> > - RCU and Unloadable Modules > >> >> > rculist_nulls.txt > >> >> > diff --git a/Documentation/RCU/checklist.txt b/Documentation/RCU/checklist.txt > >> >> > index 9d10d1db16a5..877947130ebe 100644 > >> >> > --- a/Documentation/RCU/checklist.txt > >> >> > +++ b/Documentation/RCU/checklist.txt > >> >> > @@ -114,12 +114,16 @@ over a rather long period of time, but improvements are always welcome! > >> >> > http://www.openvms.compaq.com/wizard/wiz_2637.html > >> >> > > >> >> > The rcu_dereference() primitive is also an excellent > >> >> > - documentation aid, letting the person reading the code > >> >> > - know exactly which pointers are protected by RCU. > >> >> > + documentation aid, letting the person reading the > >> >> > + code know exactly which pointers are protected by RCU. > >> >> > Please note that compilers can also reorder code, and > >> >> > they are becoming increasingly aggressive about doing > >> >> > - just that. The rcu_dereference() primitive therefore > >> >> > - also prevents destructive compiler optimizations. > >> >> > + just that. The rcu_dereference() primitive therefore also > >> >> > + prevents destructive compiler optimizations. However, > >> >> > + with a bit of devious creativity, it is possible to > >> >> > + mishandle the return value from rcu_dereference(). > >> >> > + Please see rcu_dereference.txt in this directory for > >> >> > + more information. > >> >> > > >> >> > The rcu_dereference() primitive is used by the > >> >> > various "_rcu()" list-traversal primitives, such > >> >> > diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt > >> >> > new file mode 100644 > >> >> > index 000000000000..6e72cd8622df > >> >> > --- /dev/null > >> >> > +++ b/Documentation/RCU/rcu_dereference.txt > >> >> > @@ -0,0 +1,365 @@ > >> >> > +PROPER CARE AND FEEDING OF RETURN VALUES FROM rcu_dereference() > >> >> > + > >> >> > +Most of the time, you can use values from rcu_dereference() or one of > >> >> > +the similar primitives without worries. Dereferencing (prefix "*"), > >> >> > +field selection ("->"), assignment ("="), address-of ("&"), addition and > >> >> > +subtraction of constants, and casts all work quite naturally and safely. > >> >> > + > >> >> > +It is nevertheless possible to get into trouble with other operations. > >> >> > +Follow these rules to keep your RCU code working properly: > >> >> > + > >> >> > +o You must use one of the rcu_dereference() family of primitives > >> >> > + to load an RCU-protected pointer, otherwise CONFIG_PROVE_RCU > >> >> > + will complain. Worse yet, your code can see random memory-corruption > >> >> > + bugs due to games that compilers and DEC Alpha can play. > >> >> > + Without one of the rcu_dereference() primitives, compilers > >> >> > + can reload the value, and won't your code have fun with two > >> >> > + different values for a single pointer! Without rcu_dereference(), > >> >> > + DEC Alpha can load a pointer, dereference that pointer, and > >> >> > + return data preceding initialization that preceded the store of > >> >> > + the pointer. > >> >> > + > >> >> > + In addition, the volatile cast in rcu_dereference() prevents the > >> >> > + compiler from deducing the resulting pointer value. Please see > >> >> > + the section entitled "EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH" > >> >> > + for an example where the compiler can in fact deduce the exact > >> >> > + value of the pointer, and thus cause misordering. > >> >> > + > >> >> > +o Do not use single-element RCU-protected arrays. The compiler > >> >> > + is within its right to assume that the value of an index into > >> >> > + such an array must necessarily evaluate to zero. The compiler > >> >> > + could then substitute the constant zero for the computation, so > >> >> > + that the array index no longer depended on the value returned > >> >> > + by rcu_dereference(). If the array index no longer depends > >> >> > + on rcu_dereference(), then both the compiler and the CPU > >> >> > + are within their rights to order the array access before the > >> >> > + rcu_dereference(), which can cause the array access to return > >> >> > + garbage. > >> >> > + > >> >> > +o Avoid cancellation when using the "+" and "-" infix arithmetic > >> >> > + operators. For example, for a given variable "x", avoid > >> >> > + "(x-x)". There are similar arithmetic pitfalls from other > >> >> > + arithmetic operatiors, such as "(x*0)", "(x/(x+1))" or "(x%1)". > >> >> > + The compiler is within its rights to substitute zero for all of > >> >> > + these expressions, so that subsequent accesses no longer depend > >> >> > + on the rcu_dereference(), again possibly resulting in bugs due > >> >> > + to misordering. > >> >> > + > >> >> > + Of course, if "p" is a pointer from rcu_dereference(), and "a" > >> >> > + and "b" are integers that happen to be equal, the expression > >> >> > + "p+a-b" is safe because its value still necessarily depends on > >> >> > + the rcu_dereference(), thus maintaining proper ordering. > >> >> > + > >> >> > +o Avoid all-zero operands to the bitwise "&" operator, and > >> >> > + similarly avoid all-ones operands to the bitwise "|" operator. > >> >> > + If the compiler is able to deduce the value of such operands, > >> >> > + it is within its rights to substitute the corresponding constant > >> >> > + for the bitwise operation. Once again, this causes subsequent > >> >> > + accesses to no longer depend on the rcu_dereference(), causing > >> >> > + bugs due to misordering. > >> >> > + > >> >> > + Please note that single-bit operands to bitwise "&" can also > >> >> > + be dangerous. At this point, the compiler knows that the > >> >> > + resulting value can only take on one of two possible values. > >> >> > + Therefore, a very small amount of additional information will > >> >> > + allow the compiler to deduce the exact value, which again can > >> >> > + result in misordering. > >> >> > + > >> >> > +o If you are using RCU to protect JITed functions, so that the > >> >> > + "()" function-invocation operator is applied to a value obtained > >> >> > + (directly or indirectly) from rcu_dereference(), you may need to > >> >> > + interact directly with the hardware to flush instruction caches. > >> >> > + This issue arises on some systems when a newly JITed function is > >> >> > + using the same memory that was used by an earlier JITed function. > >> >> > + > >> >> > +o Do not use the results from the boolean "&&" and "||" when > >> >> > + dereferencing. For example, the following (rather improbable) > >> >> > + code is buggy: > >> >> > + > >> >> > + int a[2]; > >> >> > + int index; > >> >> > + int force_zero_index = 1; > >> >> > + > >> >> > + ... > >> >> > + > >> >> > + r1 = rcu_dereference(i1) > >> >> > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > >> >> > + > >> >> > + The reason this is buggy is that "&&" and "||" are often compiled > >> >> > + using branches. While weak-memory machines such as ARM or PowerPC > >> >> > + do order stores after such branches, they can speculate loads, > >> >> > + which can result in misordering bugs. > >> >> > + > >> >> > +o Do not use the results from relational operators ("==", "!=", > >> >> > + ">", ">=", "<", or "<=") when dereferencing. For example, > >> >> > + the following (quite strange) code is buggy: > >> >> > + > >> >> > + int a[2]; > >> >> > + int index; > >> >> > + int flip_index = 0; > >> >> > + > >> >> > + ... > >> >> > + > >> >> > + r1 = rcu_dereference(i1) > >> >> > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > >> >> > + > >> >> > + As before, the reason this is buggy is that relational operators > >> >> > + are often compiled using branches. And as before, although > >> >> > + weak-memory machines such as ARM or PowerPC do order stores > >> >> > + after such branches, but can speculate loads, which can again > >> >> > + result in misordering bugs. > >> >> > + > >> >> > +o Be very careful about comparing pointers obtained from > >> >> > + rcu_dereference() against non-NULL values. As Linus Torvalds > >> >> > + explained, if the two pointers are equal, the compiler could > >> >> > + substitute the pointer you are comparing against for the pointer > >> >> > + obtained from rcu_dereference(). For example: > >> >> > + > >> >> > + p = rcu_dereference(gp); > >> >> > + if (p == &default_struct) > >> >> > + do_default(p->a); > >> >> > + > >> >> > + Because the compiler now knows that the value of "p" is exactly > >> >> > + the address of the variable "default_struct", it is free to > >> >> > + transform this code into the following: > >> >> > + > >> >> > + p = rcu_dereference(gp); > >> >> > + if (p == &default_struct) > >> >> > + do_default(default_struct.a); > >> >> > + > >> >> > + On ARM and Power hardware, the load from "default_struct.a" > >> >> > + can now be speculated, such that it might happen before the > >> >> > + rcu_dereference(). This could result in bugs due to misordering. > >> >> > + > >> >> > + However, comparisons are OK in the following cases: > >> >> > + > >> >> > + o The comparison was against the NULL pointer. If the > >> >> > + compiler knows that the pointer is NULL, you had better > >> >> > + not be dereferencing it anyway. If the comparison is > >> >> > + non-equal, the compiler is none the wiser. Therefore, > >> >> > + it is safe to compare pointers from rcu_dereference() > >> >> > + against NULL pointers. > >> >> > + > >> >> > + o The pointer is never dereferenced after being compared. > >> >> > + Since there are no subsequent dereferences, the compiler > >> >> > + cannot use anything it learned from the comparison > >> >> > + to reorder the non-existent subsequent dereferences. > >> >> > + This sort of comparison occurs frequently when scanning > >> >> > + RCU-protected circular linked lists. > >> >> > + > >> >> > + o The comparison is against a pointer pointer that > >> >> > + references memory that was initialized "a long time ago." > >> >> > + The reason this is safe is that even if misordering > >> >> > + occurs, the misordering will not affect the accesses > >> >> > + that follow the comparison. So exactly how long ago is > >> >> > + "a long time ago"? Here are some possibilities: > >> >> > + > >> >> > + o Compile time. > >> >> > + > >> >> > + o Boot time. > >> >> > + > >> >> > + o Module-init time for module code. > >> >> > + > >> >> > + o Prior to kthread creation for kthread code. > >> >> > + > >> >> > + o During some prior acquisition of the lock that > >> >> > + we now hold. > >> >> > + > >> >> > + o Before mod_timer() time for a timer handler. > >> >> > + > >> >> > + There are many other possibilities involving the Linux > >> >> > + kernel's wide array of primitives that cause code to > >> >> > + be invoked at a later time. > >> >> > + > >> >> > + o The pointer being compared against also came from > >> >> > + rcu_dereference(). In this case, both pointers depend > >> >> > + on one rcu_dereference() or another, so you get proper > >> >> > + ordering either way. > >> >> > + > >> >> > + That said, this situation can make certain RCU usage > >> >> > + bugs more likely to happen. Which can be a good thing, > >> >> > + at least if they happen during testing. An example > >> >> > + of such an RCU usage bug is shown in the section titled > >> >> > + "EXAMPLE OF AMPLIFIED RCU-USAGE BUG". > >> >> > + > >> >> > + o All of the accesses following the comparison are stores, > >> >> > + so that a control dependency preserves the needed ordering. > >> >> > + That said, it is easy to get control dependencies wrong. > >> >> > + Please see the "CONTROL DEPENDENCIES" section of > >> >> > + Documentation/memory-barriers.txt for more details. > >> >> > + > >> >> > + o The pointers compared not-equal -and- the compiler does > >> >> > + not have enough information to deduce the value of the > >> >> > + pointer. Note that the volatile cast in rcu_dereference() > >> >> > + will normally prevent the compiler from knowing too much. > >> >> > + > >> >> > +o Disable any value-speculation optimizations that your compiler > >> >> > + might provide, especially if you are making use of feedback-based > >> >> > + optimizations that take data collected from prior runs. Such > >> >> > + value-speculation optimizations reorder operations by design. > >> >> > + > >> >> > + There is one exception to this rule: Value-speculation > >> >> > + optimizations that leverage the branch-prediction hardware are > >> >> > + safe on strongly ordered systems (such as x86), but not on weakly > >> >> > + ordered systems (such as ARM or Power). Choose your compiler > >> >> > + command-line options wisely! > >> >> > + > >> >> > + > >> >> > +EXAMPLE OF AMPLIFIED RCU-USAGE BUG > >> >> > + > >> >> > +Because updaters can run concurrently with RCU readers, RCU readers can > >> >> > +see stale and/or inconsistent values. If RCU readers need fresh or > >> >> > +consistent values, which they sometimes do, they need to take proper > >> >> > +precautions. To see this, consider the following code fragment: > >> >> > + > >> >> > + struct foo { > >> >> > + int a; > >> >> > + int b; > >> >> > + int c; > >> >> > + }; > >> >> > + struct foo *gp1; > >> >> > + struct foo *gp2; > >> >> > + > >> >> > + void updater(void) > >> >> > + { > >> >> > + struct foo *p; > >> >> > + > >> >> > + p = kmalloc(...); > >> >> > + if (p == NULL) > >> >> > + deal_with_it(); > >> >> > + p->a = 42; /* Each field in its own cache line. */ > >> >> > + p->b = 43; > >> >> > + p->c = 44; > >> >> > + rcu_assign_pointer(gp1, p); > >> >> > + p->b = 143; > >> >> > + p->c = 144; > >> >> > + rcu_assign_pointer(gp2, p); > >> >> > + } > >> >> > + > >> >> > + void reader(void) > >> >> > + { > >> >> > + struct foo *p; > >> >> > + struct foo *q; > >> >> > + int r1, r2; > >> >> > + > >> >> > + p = rcu_dereference(gp2); > >> >> > + r1 = p->b; /* Guaranteed to get 143. */ > >> >> > + q = rcu_dereference(gp1); > >> >> > + if (p == q) { > >> >> > + /* The compiler decides that q->c is same as p->c. */ > >> >> > + r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> > + } > >> >> > + } > >> >> > + > >> >> > +You might be surprised that the outcome (r1 == 143 && r2 == 44) is possible, > >> >> > +but you should not be. After all, the updater might have been invoked > >> >> > +a second time between the time reader() loaded into "r1" and the time > >> >> > +that it loaded into "r2". The fact that this same result can occur due > >> >> > +to some reordering from the compiler and CPUs is beside the point. > >> >> > + > >> >> > +But suppose that the reader needs a consistent view? > >> >> > + > >> >> > +Then one approach is to use locking, for example, as follows: > >> >> > + > >> >> > + struct foo { > >> >> > + int a; > >> >> > + int b; > >> >> > + int c; > >> >> > + spinlock_t lock; > >> >> > + }; > >> >> > + struct foo *gp1; > >> >> > + struct foo *gp2; > >> >> > + > >> >> > + void updater(void) > >> >> > + { > >> >> > + struct foo *p; > >> >> > + > >> >> > + p = kmalloc(...); > >> >> > + if (p == NULL) > >> >> > + deal_with_it(); > >> >> > + spin_lock(&p->lock); > >> >> > + p->a = 42; /* Each field in its own cache line. */ > >> >> > + p->b = 43; > >> >> > + p->c = 44; > >> >> > + spin_unlock(&p->lock); > >> >> > + rcu_assign_pointer(gp1, p); > >> >> > + spin_lock(&p->lock); > >> >> > + p->b = 143; > >> >> > + p->c = 144; > >> >> > + spin_unlock(&p->lock); > >> >> > + rcu_assign_pointer(gp2, p); > >> >> > + } > >> >> > + > >> >> > + void reader(void) > >> >> > + { > >> >> > + struct foo *p; > >> >> > + struct foo *q; > >> >> > + int r1, r2; > >> >> > + > >> >> > + p = rcu_dereference(gp2); > >> >> > + spin_lock(&p->lock); > >> >> > + r1 = p->b; /* Guaranteed to get 143. */ > >> >> > + q = rcu_dereference(gp1); > >> >> > + if (p == q) { > >> >> > + /* The compiler decides that q->c is same as p->c. */ > >> >> > + r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> > + } > >> >> > + spin_unlock(&p->lock); > >> >> > + } > >> >> > + > >> >> > +As always, use the right tool for the job! > >> >> > + > >> >> > + > >> >> > +EXAMPLE WHERE THE COMPILER KNOWS TOO MUCH > >> >> > + > >> >> > +If a pointer obtained from rcu_dereference() compares not-equal to some > >> >> > +other pointer, the compiler normally has no clue what the value of the > >> >> > +first pointer might be. This lack of knowledge prevents the compiler > >> >> > +from carrying out optimizations that otherwise might destroy the ordering > >> >> > +guarantees that RCU depends on. And the volatile cast in rcu_dereference() > >> >> > +should prevent the compiler from guessing the value. > >> >> > + > >> >> > +But without rcu_dereference(), the compiler knows more than you might > >> >> > +expect. Consider the following code fragment: > >> >> > + > >> >> > + struct foo { > >> >> > + int a; > >> >> > + int b; > >> >> > + }; > >> >> > + static struct foo variable1; > >> >> > + static struct foo variable2; > >> >> > + static struct foo *gp = &variable1; > >> >> > + > >> >> > + void updater(void) > >> >> > + { > >> >> > + initialize_foo(&variable2); > >> >> > + rcu_assign_pointer(gp, &variable2); > >> >> > + /* > >> >> > + * The above is the only store to gp in this translation unit, > >> >> > + * and the address of gp is not exported in any way. > >> >> > + */ > >> >> > + } > >> >> > + > >> >> > + int reader(void) > >> >> > + { > >> >> > + struct foo *p; > >> >> > + > >> >> > + p = gp; > >> >> > + barrier(); > >> >> > + if (p == &variable1) > >> >> > + return p->a; /* Must be variable1.a. */ > >> >> > + else > >> >> > + return p->b; /* Must be variable2.b. */ > >> >> > + } > >> >> > + > >> >> > +Because the compiler can see all stores to "gp", it knows that the only > >> >> > +possible values of "gp" are "variable1" on the one hand and "variable2" > >> >> > +on the other. The comparison in reader() therefore tells the compiler > >> >> > +the exact value of "p" even in the not-equals case. This allows the > >> >> > +compiler to make the return values independent of the load from "gp", > >> >> > +in turn destroying the ordering between this load and the loads of the > >> >> > +return values. This can result in "p->b" returning pre-initialization > >> >> > +garbage values. > >> >> > + > >> >> > +In short, rcu_dereference() is -not- optional when you are going to > >> >> > +dereference the resulting pointer. > >> >> > > >> >> > -- > >> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > >> >> > the body of a message to majordomo@vger.kernel.org > >> >> > More majordomo info at http://vger.kernel.org/majordomo-info.html > >> >> > Please read the FAQ at http://www.tux.org/lkml/ > >> >> > >> > > >> > > > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-02 10:05 ` Peter Sewell 2014-03-02 23:20 ` Paul E. McKenney @ 2014-03-03 20:44 ` Torvald Riegel 2014-03-04 22:11 ` Peter Sewell 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 20:44 UTC (permalink / raw) To: Peter.Sewell Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> Hi Paul, > >> > >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> > <paulmck@linux.vnet.ibm.com> wrote: > >> >> > > > >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> > > where that other pointer was properly fetched using one > >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> > > that other pointer happened after the last update to the > >> >> > > pointed-to structure. > >> >> > > > >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> > > >> >> > I think that it might be worth pointing out as an example, and saying > >> >> > that code like > >> >> > > >> >> > p = atomic_read(consume); > >> >> > X; > >> >> > q = atomic_read(consume); > >> >> > Y; > >> >> > if (p == q) > >> >> > data = p->val; > >> >> > > >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> > can decide that the values are interchangeable. > >> >> > > >> >> > I cannot for the life of me come up with a situation where this would > >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> > stronger ordering than anything the consume through "p" would > >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> > ordering to the access through "p" is through p or q is kind of > >> >> > irrelevant. No? > >> >> > >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> time you can see it happen is when X has no barriers, in which case > >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> reorder the loads into p and q, and the read from p->val can, as you say, > >> >> come from either pointer. > >> >> > >> >> For whatever it is worth, hear is the litmus test: > >> >> > >> >> T1: p = kmalloc(...); > >> >> if (p == NULL) > >> >> deal_with_it(); > >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> p->b = 43; > >> >> p->c = 44; > >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> p->b = 143; > >> >> p->c = 144; > >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> > >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> >> if (p == q) { > >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> } > >> >> > >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> you get. > >> >> > >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> then publishing it via another pointer seems to me to be asking for > >> >> trouble anyway. If you really want to do something like that and still > >> >> see consistency across all the fields in the structure, please put a lock > >> >> in the structure and use it to guard updates and accesses to those fields. > >> > > >> > And here is a patch documenting the restrictions for the current Linux > >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> > differently than atomic_load_explicit(&p, memory_order_consume). > >> > > >> > Thoughts? > >> > >> That might serve as informal documentation for linux kernel > >> programmers about the bounds on the optimisations that you expect > >> compilers to do for common-case RCU code - and I guess that's what you > >> intend it to be for. But I don't see how one can make it precise > >> enough to serve as a language definition, so that compiler people > >> could confidently say "yes, we respect that", which I guess is what > >> you really need. As a useful criterion, we should aim for something > >> precise enough that in a verified-compiler context you can > >> mathematically prove that the compiler will satisfy it (even though > >> that won't happen anytime soon for GCC), and that analysis tool > >> authors can actually know what they're working with. All this stuff > >> about "you should avoid cancellation", and "avoid masking with just a > >> small number of bits" is just too vague. > > > > Understood, and yes, this is intended to document current compiler > > behavior for the Linux kernel community. It would not make sense to show > > it to the C11 or C++11 communities, except perhaps as an informational > > piece on current practice. > > > >> The basic problem is that the compiler may be doing sophisticated > >> reasoning with a bunch of non-local knowledge that it's deduced from > >> the code, neither of which are well-understood, and here we have to > >> identify some envelope, expressive enough for RCU idioms, in which > >> that reasoning doesn't allow data/address dependencies to be removed > >> (and hence the hardware guarantee about them will be maintained at the > >> source level). > >> > >> The C11 syntactic notion of dependency, whatever its faults, was at > >> least precise, could be reasoned about locally (just looking at the > >> syntactic code in question), and did do that. The fact that current > >> compilers do optimisations that remove dependencies and will likely > >> have many bugs at present is besides the point - this was surely > >> intended as a *new* constraint on what they are allowed to do. The > >> interesting question is really whether the compiler writers think that > >> they *could* implement it in a reasonable way - I'd like to hear > >> Torvald and his colleagues' opinion on that. > >> > >> What you're doing above seems to be basically a very cut-down version > >> of that, but with a fuzzy boundary. If you want it to be precise, > >> maybe it needs to be much simpler (which might force you into ruling > >> out some current code idioms). > > > > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) > > can be developed to serve this purpose. > > (I missed that mail when it first came past, sorry) > > That's also going to be tricky, I'm afraid. The key condition there is: > > "* at the time of execution of E, L [PS: I assume that L is a > typo and should be E] No, L was intended. (But see below.) > can possibly have returned at > least two different values under the assumption that L itself > could have returned any value allowed by L's type." > > First, the evaluation of E might be nondeterministic - e.g., for an > artificial example, if it's just a nondeterministic value obtained > from the result of a race on SC atomics. The above doesn't > distinguish between that (which doesn't have a real dependency on L) > and that XOR'd with L (which does). And it does so in the wrong > direction: it'll say there the former has a dependency on L. I'm not quite sure I understand the examples you want to point out (could you add brief code snippets, perhaps?) -- but the informal definition I proposed also says that E must have used L in some way to make it's computation. That's pretty vague, and the way I phrased the requirement is probably not optimal. So let me expand a bit on the background first. What I tried to capture with the rules is that an evaluation (ie, E), really uses the value returned by L, and not just a, for example, constant value that can be inferred from whatever was executed between L and E. This (is intended to) prevent value prediction and such by programmers. The compiler in turn knows what a real program is allowed to do that still wants to rely on the ordering by the consume. Basing all of this on the values seemed to be helpful because basing it on *purely* syntax didn't seem implementable; for the latter, we'd need to disallow cases such as x-x or require compilers to include artificial dependencies if a program indeed does value speculation. IOW, it didn't seem possible to draw a clear distinction between a data dependency defined purely based on syntax and control dependencies. Another way to perhaps phrase the requirement might be to construct it inductively, so that "E uses the value returned by L" becomes clearer. However, I currently don't really know how to do that without any holes. For example, let's say that an atomic mo_consume load has a value dependency to itself; the dependency has an associated set of values, which is equal to all values allowed by the type (but see also below). Then, an evaluation A that uses B as operand has a value dependency on B if the associated set of values can still have more than one element. Whether that's the case depends on the operation, and the other operands, including their values at the time of execution. However, it's not just results of evaluations that effectively constrain the set of values. Conditional execution does as well, because the condition being true might establish a constraint on L, which might remove any dependency when L (or a result of a computation involving L) is used. So I'm not sure the inductive approach would work. You said you thought I might have wanted to say that: "E really-depends on L if there exist two different values that (just according to typing) might be read for L that give rise to two different values for E". Which should follow from the above, or at least should not conflict with it. My "definition" tried to capture that the program must not establish so many constraints between E and L that the value of L is clear in the sense of having exactly one value. If we dereference such an E whose execution in fact constrains L to one value, there is no real dependency anymore. However, by itself, this doesn't cover that E must use L in it's computation -- which your formulation does, I think. Another way might be to try to define which constraints established by an execution (based on the executed code's semantics) actually remove value dependencies on L. Do you have any suggestions for how to define this (ie, true value dependencies) in a better way? > Second, it involves reasoning about counterfactual executions. That > doesn't necessarily make it wrong, per se, but probably makes it hard > to work with. For example, suppose that in all the actual > whole-program executions, a runtime occurrence of L only ever returns > one particular value (perhaps because of some simple #define'd > configuration) Right, or just because it's a deterministic program. This is a problem for the definition of value dependencies, AFAICT, because where do you put the line for what the compiler is allowed to speculate about? When does the program indeed "reveal" the value of a load, and when does it not? Is a compiler to do out-of-band tracking for certain memory locations, and thus predict the value? Adding the assumption that L, at least conceptually, might return any value seemed to be a simple way to remove that uncertainty regarding what the compiler might know about. > , and that the code used in the evaluation of E depends > on some invariant which is related to that configuration. A related issue that I've been thinking about is whether things like divide-by-zero (ie, operations that give rise to undefined behavior) should be considered as constraints on the values or not. I guess they should, but then this seems closer what you mention, invariants established by the whole program. > The > hypothetical execution used above in which a different value is used > is one in the code is being run in a situation with broken invariants. > Then there will be technical difficulties in using the definition: > I don't see how one would persuade oneself that a compiler always > satisfies it, because these hypothetical executions are far removed > from what it's actually working on. I think the above is easy to implement for a compiler. At the mo_consume load, you simply forget any information you have about this value / memory location; for operations on value_dep_preserving types, you ignore any out-of-band information that might originate from code run before the load (IOW, you take a fresh start on each load in terms of analysis info). Having the requirement that the program indeed needs to have a value dependency should allow the compiler to run normal optimizations on the code. I'd really appreciate any feedback or alternative suggestions for how to formulate the requirements more precisely. I think many of us kind of agree about the general approach we'd like to try, but specifying this precisely is another step. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-03 20:44 ` Torvald Riegel @ 2014-03-04 22:11 ` Peter Sewell 2014-03-05 17:15 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Peter Sewell @ 2014-03-04 22:11 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote: > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> Hi Paul, >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> > <paulmck@linux.vnet.ibm.com> wrote: >> >> >> > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> > > where that other pointer was properly fetched using one >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> > > that other pointer happened after the last update to the >> >> >> > > pointed-to structure. >> >> >> > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> > >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> > that code like >> >> >> > >> >> >> > p = atomic_read(consume); >> >> >> > X; >> >> >> > q = atomic_read(consume); >> >> >> > Y; >> >> >> > if (p == q) >> >> >> > data = p->val; >> >> >> > >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> > can decide that the values are interchangeable. >> >> >> > >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> > irrelevant. No? >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> >> come from either pointer. >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> if (p == NULL) >> >> >> deal_with_it(); >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> p->b = 43; >> >> >> p->c = 44; >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); >> >> >> p->b = 143; >> >> >> p->c = 144; >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> >> >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); >> >> >> if (p == q) { >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> } >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> you get. >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> see consistency across all the fields in the structure, please put a lock >> >> >> in the structure and use it to guard updates and accesses to those fields. >> >> > >> >> > And here is a patch documenting the restrictions for the current Linux >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> > differently than atomic_load_explicit(&p, memory_order_consume). >> >> > >> >> > Thoughts? >> >> >> >> That might serve as informal documentation for linux kernel >> >> programmers about the bounds on the optimisations that you expect >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> intend it to be for. But I don't see how one can make it precise >> >> enough to serve as a language definition, so that compiler people >> >> could confidently say "yes, we respect that", which I guess is what >> >> you really need. As a useful criterion, we should aim for something >> >> precise enough that in a verified-compiler context you can >> >> mathematically prove that the compiler will satisfy it (even though >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> authors can actually know what they're working with. All this stuff >> >> about "you should avoid cancellation", and "avoid masking with just a >> >> small number of bits" is just too vague. >> > >> > Understood, and yes, this is intended to document current compiler >> > behavior for the Linux kernel community. It would not make sense to show >> > it to the C11 or C++11 communities, except perhaps as an informational >> > piece on current practice. >> > >> >> The basic problem is that the compiler may be doing sophisticated >> >> reasoning with a bunch of non-local knowledge that it's deduced from >> >> the code, neither of which are well-understood, and here we have to >> >> identify some envelope, expressive enough for RCU idioms, in which >> >> that reasoning doesn't allow data/address dependencies to be removed >> >> (and hence the hardware guarantee about them will be maintained at the >> >> source level). >> >> >> >> The C11 syntactic notion of dependency, whatever its faults, was at >> >> least precise, could be reasoned about locally (just looking at the >> >> syntactic code in question), and did do that. The fact that current >> >> compilers do optimisations that remove dependencies and will likely >> >> have many bugs at present is besides the point - this was surely >> >> intended as a *new* constraint on what they are allowed to do. The >> >> interesting question is really whether the compiler writers think that >> >> they *could* implement it in a reasonable way - I'd like to hear >> >> Torvald and his colleagues' opinion on that. >> >> >> >> What you're doing above seems to be basically a very cut-down version >> >> of that, but with a fuzzy boundary. If you want it to be precise, >> >> maybe it needs to be much simpler (which might force you into ruling >> >> out some current code idioms). >> > >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) >> > can be developed to serve this purpose. >> >> (I missed that mail when it first came past, sorry) >> >> That's also going to be tricky, I'm afraid. The key condition there is: >> >> "* at the time of execution of E, L [PS: I assume that L is a >> typo and should be E] > > No, L was intended. (But see below.) then I misunderstood... >> can possibly have returned at >> least two different values under the assumption that L itself >> could have returned any value allowed by L's type." >> >> First, the evaluation of E might be nondeterministic - e.g., for an >> artificial example, if it's just a nondeterministic value obtained >> from the result of a race on SC atomics. The above doesn't >> distinguish between that (which doesn't have a real dependency on L) >> and that XOR'd with L (which does). And it does so in the wrong >> direction: it'll say there the former has a dependency on L. > > I'm not quite sure I understand the examples you want to point out > (could you add brief code snippets, perhaps?) -- but the informal > definition I proposed also says that E must have used L in some way to > make it's computation. That's pretty vague, and the way I phrased the > requirement is probably not optimal. ...so this example isn't so relevant, but it's maybe interesting anyway. In free-wheeling psuedocode: x = read_consume(L) // a memory_order_consume read of a 1-bit value from L y = nondet() // some code that nondeterministically produces another 1-bit value, eg by spawning two threads that each do an SC-atomic write to some other location, returning 0 or 1 depending on which wins then evaluate either just z=y or z = x XOR y In my misinterpretation of what you wrote, your definition would say there's a dependency from the load of L to the evalution of y, even though there isn't. > So let me expand a bit on the background first. What I tried to capture > with the rules is that an evaluation (ie, E), really uses the value > returned by L, and not just a, for example, constant value that can be > inferred from whatever was executed between L and E. This (is intended > to) prevent value prediction and such by programmers. The compiler in > turn knows what a real program is allowed to do that still wants to rely > on the ordering by the consume. Basing all of this on the values seemed > to be helpful because basing it on *purely* syntax didn't seem > implementable; for the latter, we'd need to disallow cases such as x-x > or require compilers to include artificial dependencies if a program > indeed does value speculation. IOW, it didn't seem possible to draw a > clear distinction between a data dependency defined purely based on > syntax and control dependencies. > > Another way to perhaps phrase the requirement might be to construct it > inductively, so that "E uses the value returned by L" becomes clearer. > However, I currently don't really know how to do that without any holes. > > For example, let's say that an atomic mo_consume load has a value > dependency to itself; the dependency has an associated set of values, > which is equal to all values allowed by the type (but see also below). > Then, an evaluation A that uses B as operand has a value dependency on B > if the associated set of values can still have more than one element. > Whether that's the case depends on the operation, and the other > operands, including their values at the time of execution. > However, it's not just results of evaluations that effectively constrain > the set of values. Conditional execution does as well, because the > condition being true might establish a constraint on L, which might > remove any dependency when L (or a result of a computation involving L) > is used. So I'm not sure the inductive approach would work. > > You said you thought I might have wanted to say that: > "E really-depends on L if there exist two different values that (just > according to typing) might be read for L that give rise to two > different values for E". > > Which should follow from the above, or at least should not conflict with > it. My "definition" tried to capture that the program must not > establish so many constraints between E and L that the value of L is > clear in the sense of having exactly one value. If we dereference such > an E whose execution in fact constrains L to one value, there is no real > dependency anymore. > However, by itself, this doesn't cover that E must use L in it's > computation -- which your formulation does, I think. unfortunately not - that's what the example above shows. > Another way might be to try to define which constraints established by > an execution (based on the executed code's semantics) actually remove > value dependencies on L. > > Do you have any suggestions for how to define this (ie, true value > dependencies) in a better way? not at the moment. I need to think some more about your vdps, though. As I understand it, you're agreeing with the general intent of the original C11 design that some new mechanism to tell the compiler not to optimise in certain places in required, but differing in the way you identify those? >> Second, it involves reasoning about counterfactual executions. That >> doesn't necessarily make it wrong, per se, but probably makes it hard >> to work with. For example, suppose that in all the actual >> whole-program executions, a runtime occurrence of L only ever returns >> one particular value (perhaps because of some simple #define'd >> configuration) > > Right, or just because it's a deterministic program. > > This is a problem for the definition of value dependencies, AFAICT, > because where do you put the line for what the compiler is allowed to > speculate about? When does the program indeed "reveal" the value of a > load, and when does it not? Is a compiler to do out-of-band tracking > for certain memory locations, and thus predict the value? > > Adding the assumption that L, at least conceptually, might return any > value seemed to be a simple way to remove that uncertainty regarding > what the compiler might know about. ...but it brings in a lot of baggage. >> , and that the code used in the evaluation of E depends >> on some invariant which is related to that configuration. > > A related issue that I've been thinking about is whether things like > divide-by-zero (ie, operations that give rise to undefined behavior) > should be considered as constraints on the values or not. I guess they > should, but then this seems closer what you mention, invariants > established by the whole program. > >> The >> hypothetical execution used above in which a different value is used >> is one in the code is being run in a situation with broken invariants. >> Then there will be technical difficulties in using the definition: >> I don't see how one would persuade oneself that a compiler always >> satisfies it, because these hypothetical executions are far removed >> from what it's actually working on. > > I think the above is easy to implement for a compiler. At the > mo_consume load, you simply forget any information you have about this > value / memory location; for operations on value_dep_preserving types, are all the subexpressions likewise forced to be of vdp types? > you ignore any out-of-band information that might originate from code > run before the load (IOW, you take a fresh start on each load in terms > of analysis info). Having the requirement that the program indeed needs > to have a value dependency should allow the compiler to run normal > optimizations on the code. > > I'd really appreciate any feedback or alternative suggestions for how to > formulate the requirements more precisely. I think many of us kind of > agree about the general approach we'd like to try, but specifying this > precisely is another step. > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-04 22:11 ` Peter Sewell @ 2014-03-05 17:15 ` Torvald Riegel 2014-03-05 18:37 ` Peter Sewell 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-05 17:15 UTC (permalink / raw) To: Peter.Sewell Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote: > On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote: > > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: > >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: > >> >> Hi Paul, > >> >> > >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: > >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: > >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > >> >> >> > <paulmck@linux.vnet.ibm.com> wrote: > >> >> >> > > > >> >> >> > > 3. The comparison was against another RCU-protected pointer, > >> >> >> > > where that other pointer was properly fetched using one > >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer > >> >> >> > > you use. At least as long as the rcu_assign_pointer() for > >> >> >> > > that other pointer happened after the last update to the > >> >> >> > > pointed-to structure. > >> >> >> > > > >> >> >> > > I am a bit nervous about #3. Any thoughts on it? > >> >> >> > > >> >> >> > I think that it might be worth pointing out as an example, and saying > >> >> >> > that code like > >> >> >> > > >> >> >> > p = atomic_read(consume); > >> >> >> > X; > >> >> >> > q = atomic_read(consume); > >> >> >> > Y; > >> >> >> > if (p == q) > >> >> >> > data = p->val; > >> >> >> > > >> >> >> > then the access of "p->val" is constrained to be data-dependent on > >> >> >> > *either* p or q, but you can't really tell which, since the compiler > >> >> >> > can decide that the values are interchangeable. > >> >> >> > > >> >> >> > I cannot for the life of me come up with a situation where this would > >> >> >> > matter, though. If "X" contains a fence, then that fence will be a > >> >> >> > stronger ordering than anything the consume through "p" would > >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the > >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the > >> >> >> > ordering to the access through "p" is through p or q is kind of > >> >> >> > irrelevant. No? > >> >> >> > >> >> >> I can make a contrived litmus test for it, but you are right, the only > >> >> >> time you can see it happen is when X has no barriers, in which case > >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can > >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, > >> >> >> come from either pointer. > >> >> >> > >> >> >> For whatever it is worth, hear is the litmus test: > >> >> >> > >> >> >> T1: p = kmalloc(...); > >> >> >> if (p == NULL) > >> >> >> deal_with_it(); > >> >> >> p->a = 42; /* Each field in its own cache line. */ > >> >> >> p->b = 43; > >> >> >> p->c = 44; > >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); > >> >> >> p->b = 143; > >> >> >> p->c = 144; > >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); > >> >> >> > >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); > >> >> >> r1 = p->b; /* Guaranteed to get 143. */ > >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); > >> >> >> if (p == q) { > >> >> >> /* The compiler decides that q->c is same as p->c. */ > >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ > >> >> >> } > >> >> >> > >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what > >> >> >> you get. > >> >> >> > >> >> >> And publishing a structure via one RCU-protected pointer, updating it, > >> >> >> then publishing it via another pointer seems to me to be asking for > >> >> >> trouble anyway. If you really want to do something like that and still > >> >> >> see consistency across all the fields in the structure, please put a lock > >> >> >> in the structure and use it to guard updates and accesses to those fields. > >> >> > > >> >> > And here is a patch documenting the restrictions for the current Linux > >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit > >> >> > differently than atomic_load_explicit(&p, memory_order_consume). > >> >> > > >> >> > Thoughts? > >> >> > >> >> That might serve as informal documentation for linux kernel > >> >> programmers about the bounds on the optimisations that you expect > >> >> compilers to do for common-case RCU code - and I guess that's what you > >> >> intend it to be for. But I don't see how one can make it precise > >> >> enough to serve as a language definition, so that compiler people > >> >> could confidently say "yes, we respect that", which I guess is what > >> >> you really need. As a useful criterion, we should aim for something > >> >> precise enough that in a verified-compiler context you can > >> >> mathematically prove that the compiler will satisfy it (even though > >> >> that won't happen anytime soon for GCC), and that analysis tool > >> >> authors can actually know what they're working with. All this stuff > >> >> about "you should avoid cancellation", and "avoid masking with just a > >> >> small number of bits" is just too vague. > >> > > >> > Understood, and yes, this is intended to document current compiler > >> > behavior for the Linux kernel community. It would not make sense to show > >> > it to the C11 or C++11 communities, except perhaps as an informational > >> > piece on current practice. > >> > > >> >> The basic problem is that the compiler may be doing sophisticated > >> >> reasoning with a bunch of non-local knowledge that it's deduced from > >> >> the code, neither of which are well-understood, and here we have to > >> >> identify some envelope, expressive enough for RCU idioms, in which > >> >> that reasoning doesn't allow data/address dependencies to be removed > >> >> (and hence the hardware guarantee about them will be maintained at the > >> >> source level). > >> >> > >> >> The C11 syntactic notion of dependency, whatever its faults, was at > >> >> least precise, could be reasoned about locally (just looking at the > >> >> syntactic code in question), and did do that. The fact that current > >> >> compilers do optimisations that remove dependencies and will likely > >> >> have many bugs at present is besides the point - this was surely > >> >> intended as a *new* constraint on what they are allowed to do. The > >> >> interesting question is really whether the compiler writers think that > >> >> they *could* implement it in a reasonable way - I'd like to hear > >> >> Torvald and his colleagues' opinion on that. > >> >> > >> >> What you're doing above seems to be basically a very cut-down version > >> >> of that, but with a fuzzy boundary. If you want it to be precise, > >> >> maybe it needs to be much simpler (which might force you into ruling > >> >> out some current code idioms). > >> > > >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) > >> > can be developed to serve this purpose. > >> > >> (I missed that mail when it first came past, sorry) > >> > >> That's also going to be tricky, I'm afraid. The key condition there is: > >> > >> "* at the time of execution of E, L [PS: I assume that L is a > >> typo and should be E] > > > > No, L was intended. (But see below.) > > then I misunderstood... > > >> can possibly have returned at > >> least two different values under the assumption that L itself > >> could have returned any value allowed by L's type." > >> > >> First, the evaluation of E might be nondeterministic - e.g., for an > >> artificial example, if it's just a nondeterministic value obtained > >> from the result of a race on SC atomics. The above doesn't > >> distinguish between that (which doesn't have a real dependency on L) > >> and that XOR'd with L (which does). And it does so in the wrong > >> direction: it'll say there the former has a dependency on L. > > > > I'm not quite sure I understand the examples you want to point out > > (could you add brief code snippets, perhaps?) -- but the informal > > definition I proposed also says that E must have used L in some way to > > make it's computation. That's pretty vague, and the way I phrased the > > requirement is probably not optimal. > > ...so this example isn't so relevant, but it's maybe interesting > anyway. In free-wheeling psuedocode: > > x = read_consume(L) // a memory_order_consume read of a 1-bit value from L > y = nondet() // some code that > nondeterministically produces another 1-bit value, eg by spawning two > threads that each do an SC-atomic write to some other location, > returning 0 or 1 depending on which wins > > then evaluate either just > > z=y > > or > > z = x XOR y > > In my misinterpretation of what you wrote, your definition would say > there's a dependency from the load of L to the evalution of y, even > though there isn't. The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet() doesn't depend on L (e.g., by using x). They don't use x's value. (And I don't have a precise definition for this, unfortunately, but I hope the intent is clear.) "x XOR y" would depend on L, because this does take x into account, and x's value remains relevant irrespective of which value y has. y would be non-vdp, and the compiler could have specialized the code into two branches for both 1 and 0 values of y; but it would still need x to compute z in the last evaluation. > > > So let me expand a bit on the background first. What I tried to capture > > with the rules is that an evaluation (ie, E), really uses the value > > returned by L, and not just a, for example, constant value that can be > > inferred from whatever was executed between L and E. This (is intended > > to) prevent value prediction and such by programmers. The compiler in > > turn knows what a real program is allowed to do that still wants to rely > > on the ordering by the consume. Basing all of this on the values seemed > > to be helpful because basing it on *purely* syntax didn't seem > > implementable; for the latter, we'd need to disallow cases such as x-x > > or require compilers to include artificial dependencies if a program > > indeed does value speculation. IOW, it didn't seem possible to draw a > > clear distinction between a data dependency defined purely based on > > syntax and control dependencies. > > > > Another way to perhaps phrase the requirement might be to construct it > > inductively, so that "E uses the value returned by L" becomes clearer. > > However, I currently don't really know how to do that without any holes. > > > > For example, let's say that an atomic mo_consume load has a value > > dependency to itself; the dependency has an associated set of values, > > which is equal to all values allowed by the type (but see also below). > > Then, an evaluation A that uses B as operand has a value dependency on B > > if the associated set of values can still have more than one element. > > Whether that's the case depends on the operation, and the other > > operands, including their values at the time of execution. > > However, it's not just results of evaluations that effectively constrain > > the set of values. Conditional execution does as well, because the > > condition being true might establish a constraint on L, which might > > remove any dependency when L (or a result of a computation involving L) > > is used. So I'm not sure the inductive approach would work. > > > > You said you thought I might have wanted to say that: > > "E really-depends on L if there exist two different values that (just > > according to typing) might be read for L that give rise to two > > different values for E". > > > > Which should follow from the above, or at least should not conflict with > > it. My "definition" tried to capture that the program must not > > establish so many constraints between E and L that the value of L is > > clear in the sense of having exactly one value. If we dereference such > > an E whose execution in fact constrains L to one value, there is no real > > dependency anymore. > > However, by itself, this doesn't cover that E must use L in it's > > computation -- which your formulation does, I think. > > unfortunately not - that's what the example above shows. Now I think I understand. > > Another way might be to try to define which constraints established by > > an execution (based on the executed code's semantics) actually remove > > value dependencies on L. > > > > Do you have any suggestions for how to define this (ie, true value > > dependencies) in a better way? > > not at the moment. I need to think some more about your vdps, though. > As I understand it, you're agreeing with the general intent of the > original C11 design that some new mechanism to tell the compiler not > to optimise in certain places in required, but differing in the way > you identify those? Yes, we still need such a mechanism because we need to prevent the compiler from doing value prediction or something similar that uses control dependencies to avoid data dependencies. For example, it could use a large switch statement to add code specialized for each possible value returned from an mo_consume load. Thus, at least for the standard, we need some mechanism that prevents certain compiler optimizations. Where it differs to C11 is that we're not trying to use just such a mechanism built on syntax to try to define when we actually have a real value dependency. > > >> Second, it involves reasoning about counterfactual executions. That > >> doesn't necessarily make it wrong, per se, but probably makes it hard > >> to work with. For example, suppose that in all the actual > >> whole-program executions, a runtime occurrence of L only ever returns > >> one particular value (perhaps because of some simple #define'd > >> configuration) > > > > Right, or just because it's a deterministic program. > > > > This is a problem for the definition of value dependencies, AFAICT, > > because where do you put the line for what the compiler is allowed to > > speculate about? When does the program indeed "reveal" the value of a > > load, and when does it not? Is a compiler to do out-of-band tracking > > for certain memory locations, and thus predict the value? > > > > Adding the assumption that L, at least conceptually, might return any > > value seemed to be a simple way to remove that uncertainty regarding > > what the compiler might know about. > > ...but it brings in a lot of baggage. Which baggage do you mean? In terms of verification, or definitions in the standard, or effects on compilers? > >> , and that the code used in the evaluation of E depends > >> on some invariant which is related to that configuration. > > > > A related issue that I've been thinking about is whether things like > > divide-by-zero (ie, operations that give rise to undefined behavior) > > should be considered as constraints on the values or not. I guess they > > should, but then this seems closer what you mention, invariants > > established by the whole program. > > > >> The > >> hypothetical execution used above in which a different value is used > >> is one in the code is being run in a situation with broken invariants. > >> Then there will be technical difficulties in using the definition: > >> I don't see how one would persuade oneself that a compiler always > >> satisfies it, because these hypothetical executions are far removed > >> from what it's actually working on. > > > > I think the above is easy to implement for a compiler. At the > > mo_consume load, you simply forget any information you have about this > > value / memory location; for operations on value_dep_preserving types, > > are all the subexpressions likewise forced to be of vdp types? I'd like to avoid that if possible. Things like "vdp-pointer + int" should Just Work in the sense of being still vdp-typed without requiring an explicit cast for the int operand. Operators ||, &&, ?: might justify more fine-grained rules, but I guess it's better to just make more stuff vdp if that's easier. vdp is not a sufficient condition for there being a value-dependency anyway, and the few optimizations it prevents on expressions containing some vdp-typed operands probably don't matter much. Also, as I mentioned in the reply to Paul, I believe that if the compiler can prove that there's no value-dependency, it can ignore the vdp-annotation (IOW, as-if is still possible). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-05 17:15 ` Torvald Riegel @ 2014-03-05 18:37 ` Peter Sewell 0 siblings, 0 replies; 285+ messages in thread From: Peter Sewell @ 2014-03-05 18:37 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On 5 March 2014 17:15, Torvald Riegel <triegel@redhat.com> wrote: > On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote: >> On 3 March 2014 20:44, Torvald Riegel <triegel@redhat.com> wrote: >> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: >> >> On 1 March 2014 08:03, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> >> Hi Paul, >> >> >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: >> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> >> > <paulmck@linux.vnet.ibm.com> wrote: >> >> >> >> > > >> >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> >> > > where that other pointer was properly fetched using one >> >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> >> > > that other pointer happened after the last update to the >> >> >> >> > > pointed-to structure. >> >> >> >> > > >> >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> >> > >> >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> >> > that code like >> >> >> >> > >> >> >> >> > p = atomic_read(consume); >> >> >> >> > X; >> >> >> >> > q = atomic_read(consume); >> >> >> >> > Y; >> >> >> >> > if (p == q) >> >> >> >> > data = p->val; >> >> >> >> > >> >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> >> > can decide that the values are interchangeable. >> >> >> >> > >> >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> >> > irrelevant. No? >> >> >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> >> >> come from either pointer. >> >> >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> >> if (p == NULL) >> >> >> >> deal_with_it(); >> >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> >> p->b = 43; >> >> >> >> p->c = 44; >> >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); >> >> >> >> p->b = 143; >> >> >> >> p->c = 144; >> >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> >> >> >> >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); >> >> >> >> if (p == q) { >> >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> >> } >> >> >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> >> you get. >> >> >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> >> see consistency across all the fields in the structure, please put a lock >> >> >> >> in the structure and use it to guard updates and accesses to those fields. >> >> >> > >> >> >> > And here is a patch documenting the restrictions for the current Linux >> >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> >> > differently than atomic_load_explicit(&p, memory_order_consume). >> >> >> > >> >> >> > Thoughts? >> >> >> >> >> >> That might serve as informal documentation for linux kernel >> >> >> programmers about the bounds on the optimisations that you expect >> >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> >> intend it to be for. But I don't see how one can make it precise >> >> >> enough to serve as a language definition, so that compiler people >> >> >> could confidently say "yes, we respect that", which I guess is what >> >> >> you really need. As a useful criterion, we should aim for something >> >> >> precise enough that in a verified-compiler context you can >> >> >> mathematically prove that the compiler will satisfy it (even though >> >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> >> authors can actually know what they're working with. All this stuff >> >> >> about "you should avoid cancellation", and "avoid masking with just a >> >> >> small number of bits" is just too vague. >> >> > >> >> > Understood, and yes, this is intended to document current compiler >> >> > behavior for the Linux kernel community. It would not make sense to show >> >> > it to the C11 or C++11 communities, except perhaps as an informational >> >> > piece on current practice. >> >> > >> >> >> The basic problem is that the compiler may be doing sophisticated >> >> >> reasoning with a bunch of non-local knowledge that it's deduced from >> >> >> the code, neither of which are well-understood, and here we have to >> >> >> identify some envelope, expressive enough for RCU idioms, in which >> >> >> that reasoning doesn't allow data/address dependencies to be removed >> >> >> (and hence the hardware guarantee about them will be maintained at the >> >> >> source level). >> >> >> >> >> >> The C11 syntactic notion of dependency, whatever its faults, was at >> >> >> least precise, could be reasoned about locally (just looking at the >> >> >> syntactic code in question), and did do that. The fact that current >> >> >> compilers do optimisations that remove dependencies and will likely >> >> >> have many bugs at present is besides the point - this was surely >> >> >> intended as a *new* constraint on what they are allowed to do. The >> >> >> interesting question is really whether the compiler writers think that >> >> >> they *could* implement it in a reasonable way - I'd like to hear >> >> >> Torvald and his colleagues' opinion on that. >> >> >> >> >> >> What you're doing above seems to be basically a very cut-down version >> >> >> of that, but with a fuzzy boundary. If you want it to be precise, >> >> >> maybe it needs to be much simpler (which might force you into ruling >> >> >> out some current code idioms). >> >> > >> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) >> >> > can be developed to serve this purpose. >> >> >> >> (I missed that mail when it first came past, sorry) >> >> >> >> That's also going to be tricky, I'm afraid. The key condition there is: >> >> >> >> "* at the time of execution of E, L [PS: I assume that L is a >> >> typo and should be E] >> > >> > No, L was intended. (But see below.) >> >> then I misunderstood... >> >> >> can possibly have returned at >> >> least two different values under the assumption that L itself >> >> could have returned any value allowed by L's type." >> >> >> >> First, the evaluation of E might be nondeterministic - e.g., for an >> >> artificial example, if it's just a nondeterministic value obtained >> >> from the result of a race on SC atomics. The above doesn't >> >> distinguish between that (which doesn't have a real dependency on L) >> >> and that XOR'd with L (which does). And it does so in the wrong >> >> direction: it'll say there the former has a dependency on L. >> > >> > I'm not quite sure I understand the examples you want to point out >> > (could you add brief code snippets, perhaps?) -- but the informal >> > definition I proposed also says that E must have used L in some way to >> > make it's computation. That's pretty vague, and the way I phrased the >> > requirement is probably not optimal. >> >> ...so this example isn't so relevant, but it's maybe interesting >> anyway. In free-wheeling psuedocode: >> >> x = read_consume(L) // a memory_order_consume read of a 1-bit value from L >> y = nondet() // some code that >> nondeterministically produces another 1-bit value, eg by spawning two >> threads that each do an SC-atomic write to some other location, >> returning 0 or 1 depending on which wins >> >> then evaluate either just >> >> z=y >> >> or >> >> z = x XOR y >> >> In my misinterpretation of what you wrote, your definition would say >> there's a dependency from the load of L to the evalution of y, even >> though there isn't. > > The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet() > doesn't depend on L (e.g., by using x). They don't use x's value. (And > I don't have a precise definition for this, unfortunately, but I hope > the intent is clear.) > > "x XOR y" would depend on L, because this does take x into account, and > x's value remains relevant irrespective of which value y has. y would > be non-vdp, and the compiler could have specialized the code into two > branches for both 1 and 0 values of y; but it would still need x to > compute z in the last evaluation. > >> >> > So let me expand a bit on the background first. What I tried to capture >> > with the rules is that an evaluation (ie, E), really uses the value >> > returned by L, and not just a, for example, constant value that can be >> > inferred from whatever was executed between L and E. This (is intended >> > to) prevent value prediction and such by programmers. The compiler in >> > turn knows what a real program is allowed to do that still wants to rely >> > on the ordering by the consume. Basing all of this on the values seemed >> > to be helpful because basing it on *purely* syntax didn't seem >> > implementable; for the latter, we'd need to disallow cases such as x-x >> > or require compilers to include artificial dependencies if a program >> > indeed does value speculation. IOW, it didn't seem possible to draw a >> > clear distinction between a data dependency defined purely based on >> > syntax and control dependencies. >> > >> > Another way to perhaps phrase the requirement might be to construct it >> > inductively, so that "E uses the value returned by L" becomes clearer. >> > However, I currently don't really know how to do that without any holes. >> > >> > For example, let's say that an atomic mo_consume load has a value >> > dependency to itself; the dependency has an associated set of values, >> > which is equal to all values allowed by the type (but see also below). >> > Then, an evaluation A that uses B as operand has a value dependency on B >> > if the associated set of values can still have more than one element. >> > Whether that's the case depends on the operation, and the other >> > operands, including their values at the time of execution. >> > However, it's not just results of evaluations that effectively constrain >> > the set of values. Conditional execution does as well, because the >> > condition being true might establish a constraint on L, which might >> > remove any dependency when L (or a result of a computation involving L) >> > is used. So I'm not sure the inductive approach would work. >> > >> > You said you thought I might have wanted to say that: >> > "E really-depends on L if there exist two different values that (just >> > according to typing) might be read for L that give rise to two >> > different values for E". >> > >> > Which should follow from the above, or at least should not conflict with >> > it. My "definition" tried to capture that the program must not >> > establish so many constraints between E and L that the value of L is >> > clear in the sense of having exactly one value. If we dereference such >> > an E whose execution in fact constrains L to one value, there is no real >> > dependency anymore. >> > However, by itself, this doesn't cover that E must use L in it's >> > computation -- which your formulation does, I think. >> >> unfortunately not - that's what the example above shows. > > Now I think I understand. > >> > Another way might be to try to define which constraints established by >> > an execution (based on the executed code's semantics) actually remove >> > value dependencies on L. >> > >> > Do you have any suggestions for how to define this (ie, true value >> > dependencies) in a better way? >> >> not at the moment. I need to think some more about your vdps, though. >> As I understand it, you're agreeing with the general intent of the >> original C11 design that some new mechanism to tell the compiler not >> to optimise in certain places in required, but differing in the way >> you identify those? > > Yes, we still need such a mechanism because we need to prevent the > compiler from doing value prediction or something similar that uses > control dependencies to avoid data dependencies. For example, it could > use a large switch statement to add code specialized for each possible > value returned from an mo_consume load. Thus, at least for the > standard, we need some mechanism that prevents certain compiler > optimizations. > > Where it differs to C11 is that we're not trying to use just such a > mechanism built on syntax to try to define when we actually have a real > value dependency. > >> >> >> Second, it involves reasoning about counterfactual executions. That >> >> doesn't necessarily make it wrong, per se, but probably makes it hard >> >> to work with. For example, suppose that in all the actual >> >> whole-program executions, a runtime occurrence of L only ever returns >> >> one particular value (perhaps because of some simple #define'd >> >> configuration) >> > >> > Right, or just because it's a deterministic program. >> > >> > This is a problem for the definition of value dependencies, AFAICT, >> > because where do you put the line for what the compiler is allowed to >> > speculate about? When does the program indeed "reveal" the value of a >> > load, and when does it not? Is a compiler to do out-of-band tracking >> > for certain memory locations, and thus predict the value? >> > >> > Adding the assumption that L, at least conceptually, might return any >> > value seemed to be a simple way to remove that uncertainty regarding >> > what the compiler might know about. >> >> ...but it brings in a lot of baggage. > > Which baggage do you mean? In terms of verification, or definitions in > the standard, or effects on compilers? At least the first two. I'm not in a position to comment on the last, but I'd hate to try to do compiler verification based on a semantics that involves substantial quantification over counterfactual hypothetical executions. >> >> , and that the code used in the evaluation of E depends >> >> on some invariant which is related to that configuration. >> > >> > A related issue that I've been thinking about is whether things like >> > divide-by-zero (ie, operations that give rise to undefined behavior) >> > should be considered as constraints on the values or not. I guess they >> > should, but then this seems closer what you mention, invariants >> > established by the whole program. >> > >> >> The >> >> hypothetical execution used above in which a different value is used >> >> is one in the code is being run in a situation with broken invariants. >> >> Then there will be technical difficulties in using the definition: >> >> I don't see how one would persuade oneself that a compiler always >> >> satisfies it, because these hypothetical executions are far removed >> >> from what it's actually working on. >> > >> > I think the above is easy to implement for a compiler. At the >> > mo_consume load, you simply forget any information you have about this >> > value / memory location; for operations on value_dep_preserving types, >> >> are all the subexpressions likewise forced to be of vdp types? > > I'd like to avoid that if possible. Things like "vdp-pointer + int" > should Just Work in the sense of being still vdp-typed without requiring > an explicit cast for the int operand. Operators ||, &&, ?: might > justify more fine-grained rules, but I guess it's better to just make > more stuff vdp if that's easier. vdp is not a sufficient condition for > there being a value-dependency anyway, and the few optimizations it > prevents on expressions containing some vdp-typed operands probably > don't matter much. Also, as I mentioned in the reply to Paul, I believe > that if the compiler can prove that there's no value-dependency, it can > ignore the vdp-annotation (IOW, as-if is still possible). > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-01 0:50 ` Paul E. McKenney 2014-03-01 10:06 ` Peter Sewell @ 2014-03-03 18:55 ` Torvald Riegel 2014-03-03 19:20 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 18:55 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > +o Do not use the results from the boolean "&&" and "||" when > + dereferencing. For example, the following (rather improbable) > + code is buggy: > + > + int a[2]; > + int index; > + int force_zero_index = 1; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > + > + The reason this is buggy is that "&&" and "||" are often compiled > + using branches. While weak-memory machines such as ARM or PowerPC > + do order stores after such branches, they can speculate loads, > + which can result in misordering bugs. > + > +o Do not use the results from relational operators ("==", "!=", > + ">", ">=", "<", or "<=") when dereferencing. For example, > + the following (quite strange) code is buggy: > + > + int a[2]; > + int index; > + int flip_index = 0; > + > + ... > + > + r1 = rcu_dereference(i1) > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > + > + As before, the reason this is buggy is that relational operators > + are often compiled using branches. And as before, although > + weak-memory machines such as ARM or PowerPC do order stores > + after such branches, but can speculate loads, which can again > + result in misordering bugs. Those two would be allowed by the wording I have recently proposed, AFAICS. r1 != flip_index would result in two possible values (unless there are further constraints due to the type of r1 and the values that flip_index can have). I don't think the wording is flawed. We could raise the requirement of having more than one value left for r1 to having more than N with N > 1 values left, but the fundamental problem remains in that a compiler could try to generate a (big) switch statement. Instead, I think that this indicates that the value_dep_preserving type modifier would be useful: It would tell the compiler that it shouldn't transform this into a branch in this case, yet allow that optimization for all other code. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-03 18:55 ` Torvald Riegel @ 2014-03-03 19:20 ` Paul E. McKenney 2014-03-03 20:46 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-03 19:20 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > +o Do not use the results from the boolean "&&" and "||" when > > + dereferencing. For example, the following (rather improbable) > > + code is buggy: > > + > > + int a[2]; > > + int index; > > + int force_zero_index = 1; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > + > > + The reason this is buggy is that "&&" and "||" are often compiled > > + using branches. While weak-memory machines such as ARM or PowerPC > > + do order stores after such branches, they can speculate loads, > > + which can result in misordering bugs. > > + > > +o Do not use the results from relational operators ("==", "!=", > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > + the following (quite strange) code is buggy: > > + > > + int a[2]; > > + int index; > > + int flip_index = 0; > > + > > + ... > > + > > + r1 = rcu_dereference(i1) > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > + > > + As before, the reason this is buggy is that relational operators > > + are often compiled using branches. And as before, although > > + weak-memory machines such as ARM or PowerPC do order stores > > + after such branches, but can speculate loads, which can again > > + result in misordering bugs. > > Those two would be allowed by the wording I have recently proposed, > AFAICS. r1 != flip_index would result in two possible values (unless > there are further constraints due to the type of r1 and the values that > flip_index can have). And I am OK with the value_dep_preserving type providing more/better guarantees than we get by default from current compilers. One question, though. Suppose that the code did not want a value dependency to be tracked through a comparison operator. What does the developer do in that case? (The reason I ask is that I have not yet found a use case in the Linux kernel that expects a value dependency to be tracked through a comparison.) > I don't think the wording is flawed. We could raise the requirement of > having more than one value left for r1 to having more than N with N > 1 > values left, but the fundamental problem remains in that a compiler > could try to generate a (big) switch statement. > > Instead, I think that this indicates that the value_dep_preserving type > modifier would be useful: It would tell the compiler that it shouldn't > transform this into a branch in this case, yet allow that optimization > for all other code. Understood! BTW, my current task is generating examples using the value_dep_preserving type for RCU-protected array indexes. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-03 19:20 ` Paul E. McKenney @ 2014-03-03 20:46 ` Torvald Riegel 2014-03-04 19:00 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 20:46 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > +o Do not use the results from the boolean "&&" and "||" when > > > + dereferencing. For example, the following (rather improbable) > > > + code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int force_zero_index = 1; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > + > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > + do order stores after such branches, they can speculate loads, > > > + which can result in misordering bugs. > > > + > > > +o Do not use the results from relational operators ("==", "!=", > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > + the following (quite strange) code is buggy: > > > + > > > + int a[2]; > > > + int index; > > > + int flip_index = 0; > > > + > > > + ... > > > + > > > + r1 = rcu_dereference(i1) > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > + > > > + As before, the reason this is buggy is that relational operators > > > + are often compiled using branches. And as before, although > > > + weak-memory machines such as ARM or PowerPC do order stores > > > + after such branches, but can speculate loads, which can again > > > + result in misordering bugs. > > > > Those two would be allowed by the wording I have recently proposed, > > AFAICS. r1 != flip_index would result in two possible values (unless > > there are further constraints due to the type of r1 and the values that > > flip_index can have). > > And I am OK with the value_dep_preserving type providing more/better > guarantees than we get by default from current compilers. > > One question, though. Suppose that the code did not want a value > dependency to be tracked through a comparison operator. What does > the developer do in that case? (The reason I ask is that I have > not yet found a use case in the Linux kernel that expects a value > dependency to be tracked through a comparison.) Hmm. I suppose use an explicit cast to non-vdp before or after the comparison? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-03 20:46 ` Torvald Riegel @ 2014-03-04 19:00 ` Paul E. McKenney 2014-03-04 21:35 ` Paul E. McKenney 2014-03-05 16:26 ` Torvald Riegel 0 siblings, 2 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-03-04 19:00 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > + dereferencing. For example, the following (rather improbable) > > > > + code is buggy: > > > > + > > > > + int a[2]; > > > > + int index; > > > > + int force_zero_index = 1; > > > > + > > > > + ... > > > > + > > > > + r1 = rcu_dereference(i1) > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > + > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > + do order stores after such branches, they can speculate loads, > > > > + which can result in misordering bugs. > > > > + > > > > +o Do not use the results from relational operators ("==", "!=", > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > + the following (quite strange) code is buggy: > > > > + > > > > + int a[2]; > > > > + int index; > > > > + int flip_index = 0; > > > > + > > > > + ... > > > > + > > > > + r1 = rcu_dereference(i1) > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > + > > > > + As before, the reason this is buggy is that relational operators > > > > + are often compiled using branches. And as before, although > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > + after such branches, but can speculate loads, which can again > > > > + result in misordering bugs. > > > > > > Those two would be allowed by the wording I have recently proposed, > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > there are further constraints due to the type of r1 and the values that > > > flip_index can have). > > > > And I am OK with the value_dep_preserving type providing more/better > > guarantees than we get by default from current compilers. > > > > One question, though. Suppose that the code did not want a value > > dependency to be tracked through a comparison operator. What does > > the developer do in that case? (The reason I ask is that I have > > not yet found a use case in the Linux kernel that expects a value > > dependency to be tracked through a comparison.) > > Hmm. I suppose use an explicit cast to non-vdp before or after the > comparison? That should work well assuming that things like "if", "while", and "?:" conditions are happy to take a vdp. This assumes that p->a only returns vdp if field "a" is declared vdp, otherwise we have vdps running wild through the program. ;-) The other thing that can happen is that a vdp can get handed off to another synchronization mechanism, for example, to reference counting: p = atomic_load_explicit(&gp, memory_order_consume); if (do_something_with(p->a)) { /* fast path protected by RCU. */ return 0; } if (atomic_inc_not_zero(&p->refcnt) { /* slow path protected by reference counting. */ return do_something_else_with((struct foo *)p); /* CHANGE */ } /* Needed slow path, but raced with deletion. */ return -EAGAIN; I am guessing that the cast ends the vdp. Is that the case? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-04 19:00 ` Paul E. McKenney @ 2014-03-04 21:35 ` Paul E. McKenney 2014-03-05 16:54 ` Torvald Riegel 2014-03-05 16:26 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-04 21:35 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > + dereferencing. For example, the following (rather improbable) > > > > > + code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int force_zero_index = 1; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > + > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > + do order stores after such branches, they can speculate loads, > > > > > + which can result in misordering bugs. > > > > > + > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > + the following (quite strange) code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int flip_index = 0; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > + > > > > > + As before, the reason this is buggy is that relational operators > > > > > + are often compiled using branches. And as before, although > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > + after such branches, but can speculate loads, which can again > > > > > + result in misordering bugs. > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > there are further constraints due to the type of r1 and the values that > > > > flip_index can have). > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > guarantees than we get by default from current compilers. > > > > > > One question, though. Suppose that the code did not want a value > > > dependency to be tracked through a comparison operator. What does > > > the developer do in that case? (The reason I ask is that I have > > > not yet found a use case in the Linux kernel that expects a value > > > dependency to be tracked through a comparison.) > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > comparison? > > That should work well assuming that things like "if", "while", and "?:" > conditions are happy to take a vdp. This assumes that p->a only returns > vdp if field "a" is declared vdp, otherwise we have vdps running wild > through the program. ;-) > > The other thing that can happen is that a vdp can get handed off to > another synchronization mechanism, for example, to reference counting: > > p = atomic_load_explicit(&gp, memory_order_consume); > if (do_something_with(p->a)) { > /* fast path protected by RCU. */ > return 0; > } > if (atomic_inc_not_zero(&p->refcnt) { > /* slow path protected by reference counting. */ > return do_something_else_with((struct foo *)p); /* CHANGE */ > } > /* Needed slow path, but raced with deletion. */ > return -EAGAIN; > > I am guessing that the cast ends the vdp. Is that the case? And here is a more elaborate example from the Linux kernel: struct md_rdev value_dep_preserving *rdev; /* CHANGE */ rdev = rcu_dereference(conf->mirrors[disk].rdev); if (r1_bio->bios[disk] == IO_BLOCKED || rdev == NULL || test_bit(Unmerged, &rdev->flags) || test_bit(Faulty, &rdev->flags)) continue; The fact that the "rdev == NULL" returns vdp does not force the "||" operators to be evaluated arithmetically because the entire function is an "if" condition, correct? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-04 21:35 ` Paul E. McKenney @ 2014-03-05 16:54 ` Torvald Riegel 2014-03-05 18:15 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-05 16:54 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > + code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int force_zero_index = 1; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > + which can result in misordering bugs. > > > > > > + > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > + the following (quite strange) code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int flip_index = 0; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > + are often compiled using branches. And as before, although > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > + after such branches, but can speculate loads, which can again > > > > > > + result in misordering bugs. > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > there are further constraints due to the type of r1 and the values that > > > > > flip_index can have). > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > guarantees than we get by default from current compilers. > > > > > > > > One question, though. Suppose that the code did not want a value > > > > dependency to be tracked through a comparison operator. What does > > > > the developer do in that case? (The reason I ask is that I have > > > > not yet found a use case in the Linux kernel that expects a value > > > > dependency to be tracked through a comparison.) > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > comparison? > > > > That should work well assuming that things like "if", "while", and "?:" > > conditions are happy to take a vdp. This assumes that p->a only returns > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > through the program. ;-) > > > > The other thing that can happen is that a vdp can get handed off to > > another synchronization mechanism, for example, to reference counting: > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > if (do_something_with(p->a)) { > > /* fast path protected by RCU. */ > > return 0; > > } > > if (atomic_inc_not_zero(&p->refcnt) { > > /* slow path protected by reference counting. */ > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > } > > /* Needed slow path, but raced with deletion. */ > > return -EAGAIN; > > > > I am guessing that the cast ends the vdp. Is that the case? > > And here is a more elaborate example from the Linux kernel: > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > if (r1_bio->bios[disk] == IO_BLOCKED > || rdev == NULL > || test_bit(Unmerged, &rdev->flags) > || test_bit(Faulty, &rdev->flags)) > continue; > > The fact that the "rdev == NULL" returns vdp does not force the "||" > operators to be evaluated arithmetically because the entire function > is an "if" condition, correct? That's a good question, and one that as far as I understand currently, essentially boils down to whether we want to have tight restrictions on which operations are still vdp. If we look at the different combinations, then it seems we can't decide on whether we have a value-dependency just due to a vdp type: * non-vdp || vdp: vdp iff non-vdp == false * vdp || non-vdp: vdp iff non-vdp == false? * vdp || vdp: always vdp? (and dependency on both?) I'm not sure it makes sense to try to not make all of those vdp-by-default. The first and second case show that it's dependent on the specific execution anyway, and thus is already covered by the requirement that the value must still matter. The vdp type is just a way to prevent inappropriate compiler optimizations; it's not critical for correctness is we make more stuff vdp, yet it may prevent some optimizations in the affected expression. If the compiler knows that some vdp-typed evaluation will not have a value-dependency anyway, then it can just optimize this evaluation like non-vdp code. I guess not much would change for the code you posted, because we already have to evaluate || operands in order, I believe (e.g., don't access rdev->flags before doing the rdev == NULL check, modulo as-if). Do I understand your question correctly? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-05 16:54 ` Torvald Riegel @ 2014-03-05 18:15 ` Paul E. McKenney 2014-03-07 18:33 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-05 18:15 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > + code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int force_zero_index = 1; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > + which can result in misordering bugs. > > > > > > > + > > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > + the following (quite strange) code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int flip_index = 0; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > + are often compiled using branches. And as before, although > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > + result in misordering bugs. > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > > there are further constraints due to the type of r1 and the values that > > > > > > flip_index can have). > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > dependency to be tracked through a comparison operator. What does > > > > > the developer do in that case? (The reason I ask is that I have > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > dependency to be tracked through a comparison.) > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > comparison? > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > conditions are happy to take a vdp. This assumes that p->a only returns > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > through the program. ;-) > > > > > > The other thing that can happen is that a vdp can get handed off to > > > another synchronization mechanism, for example, to reference counting: > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > if (do_something_with(p->a)) { > > > /* fast path protected by RCU. */ > > > return 0; > > > } > > > if (atomic_inc_not_zero(&p->refcnt) { > > > /* slow path protected by reference counting. */ > > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > > } > > > /* Needed slow path, but raced with deletion. */ > > > return -EAGAIN; > > > > > > I am guessing that the cast ends the vdp. Is that the case? > > > > And here is a more elaborate example from the Linux kernel: > > > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > > if (r1_bio->bios[disk] == IO_BLOCKED > > || rdev == NULL > > || test_bit(Unmerged, &rdev->flags) > > || test_bit(Faulty, &rdev->flags)) > > continue; > > > > The fact that the "rdev == NULL" returns vdp does not force the "||" > > operators to be evaluated arithmetically because the entire function > > is an "if" condition, correct? > > That's a good question, and one that as far as I understand currently, > essentially boils down to whether we want to have tight restrictions on > which operations are still vdp. > > If we look at the different combinations, then it seems we can't decide > on whether we have a value-dependency just due to a vdp type: > * non-vdp || vdp: vdp iff non-vdp == false > * vdp || non-vdp: vdp iff non-vdp == false? > * vdp || vdp: always vdp? (and dependency on both?) > > I'm not sure it makes sense to try to not make all of those > vdp-by-default. The first and second case show that it's dependent on > the specific execution anyway, and thus is already covered by the > requirement that the value must still matter. The vdp type is just a > way to prevent inappropriate compiler optimizations; it's not critical > for correctness is we make more stuff vdp, yet it may prevent some > optimizations in the affected expression. > > If the compiler knows that some vdp-typed evaluation will not have a > value-dependency anyway, then it can just optimize this evaluation like > non-vdp code. > > I guess not much would change for the code you posted, because we > already have to evaluate || operands in order, I believe (e.g., don't > access rdev->flags before doing the rdev == NULL check, modulo as-if). > Do I understand your question correctly? Let me give an example for the other side: struct foo value_dep_preserving *p; struct foo value_dep_preserving *q; p = rcu_dereference(gp); q = rcu_dereference(gq); return myarray[p || q]]; /* Linux kernel doesn't do this. */ If we wanted this to work (and I am not at all convinced that we do), the compiler would have to force a data dependency through the "||". But I would be just as happy to instead just say that boolean logical operators ("||" and "&&") never return vdp values. Ditto for the relational operators ("==", "!=", ">", ">=", "<", and "<="). No one seems to rely on value dependencies via these operators, after all, and preserving value dependencies through them seems to require that the compiler generate odd code. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-05 18:15 ` Paul E. McKenney @ 2014-03-07 18:33 ` Torvald Riegel 2014-03-07 19:11 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-07 18:33 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > > + code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int force_zero_index = 1; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > > + which can result in misordering bugs. > > > > > > > > + > > > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int flip_index = 0; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > > + are often compiled using branches. And as before, although > > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > > > there are further constraints due to the type of r1 and the values that > > > > > > > flip_index can have). > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > comparison? > > > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > > conditions are happy to take a vdp. This assumes that p->a only returns > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > > through the program. ;-) > > > > > > > > The other thing that can happen is that a vdp can get handed off to > > > > another synchronization mechanism, for example, to reference counting: > > > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > > if (do_something_with(p->a)) { > > > > /* fast path protected by RCU. */ > > > > return 0; > > > > } > > > > if (atomic_inc_not_zero(&p->refcnt) { > > > > /* slow path protected by reference counting. */ > > > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > > > } > > > > /* Needed slow path, but raced with deletion. */ > > > > return -EAGAIN; > > > > > > > > I am guessing that the cast ends the vdp. Is that the case? > > > > > > And here is a more elaborate example from the Linux kernel: > > > > > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > > > > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > > > if (r1_bio->bios[disk] == IO_BLOCKED > > > || rdev == NULL > > > || test_bit(Unmerged, &rdev->flags) > > > || test_bit(Faulty, &rdev->flags)) > > > continue; > > > > > > The fact that the "rdev == NULL" returns vdp does not force the "||" > > > operators to be evaluated arithmetically because the entire function > > > is an "if" condition, correct? > > > > That's a good question, and one that as far as I understand currently, > > essentially boils down to whether we want to have tight restrictions on > > which operations are still vdp. > > > > If we look at the different combinations, then it seems we can't decide > > on whether we have a value-dependency just due to a vdp type: > > * non-vdp || vdp: vdp iff non-vdp == false > > * vdp || non-vdp: vdp iff non-vdp == false? > > * vdp || vdp: always vdp? (and dependency on both?) > > > > I'm not sure it makes sense to try to not make all of those > > vdp-by-default. The first and second case show that it's dependent on > > the specific execution anyway, and thus is already covered by the > > requirement that the value must still matter. The vdp type is just a > > way to prevent inappropriate compiler optimizations; it's not critical > > for correctness is we make more stuff vdp, yet it may prevent some > > optimizations in the affected expression. > > > > If the compiler knows that some vdp-typed evaluation will not have a > > value-dependency anyway, then it can just optimize this evaluation like > > non-vdp code. > > > > I guess not much would change for the code you posted, because we > > already have to evaluate || operands in order, I believe (e.g., don't > > access rdev->flags before doing the rdev == NULL check, modulo as-if). > > Do I understand your question correctly? > > Let me give an example for the other side: > > struct foo value_dep_preserving *p; > struct foo value_dep_preserving *q; > > p = rcu_dereference(gp); > q = rcu_dereference(gq); > return myarray[p || q]]; /* Linux kernel doesn't do this. */ > > If we wanted this to work (and I am not at all convinced that we do), > the compiler would have to force a data dependency through the "||". Yes. > But I would be just as happy to instead just say that boolean logical > operators ("||" and "&&") never return vdp values. I think those aren't actually the problem (or if they were, we'd need to think about & and | on 1-bit integers or bitfields as well), but ... > Ditto for the > relational operators ("==", "!=", ">", ">=", "<", and "<="). No one > seems to rely on value dependencies via these operators, after all, > and preserving value dependencies through them seems to require that > the compiler generate odd code. ... that any conversion from vdp to bool requires specialized handling by the compiler. That happens on implicit conversion (as in "p || q") and in the operators you mentioned. I don't see a reason why conversion to bool (or any other operator returning bool and taking vdp as operand) should be *always* non-vdp. But it seems it would be easier to misuse than other operators. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-07 18:33 ` Torvald Riegel @ 2014-03-07 19:11 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-03-07 19:11 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Mar 07, 2014 at 07:33:25PM +0100, Torvald Riegel wrote: > On Wed, 2014-03-05 at 10:15 -0800, Paul E. McKenney wrote: > > On Wed, Mar 05, 2014 at 05:54:59PM +0100, Torvald Riegel wrote: > > > On Tue, 2014-03-04 at 13:35 -0800, Paul E. McKenney wrote: > > > > On Tue, Mar 04, 2014 at 11:00:32AM -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > > > + code is buggy: > > > > > > > > > + > > > > > > > > > + int a[2]; > > > > > > > > > + int index; > > > > > > > > > + int force_zero_index = 1; > > > > > > > > > + > > > > > > > > > + ... > > > > > > > > > + > > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > > + > > > > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > > > + which can result in misordering bugs. > > > > > > > > > + > > > > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > > + > > > > > > > > > + int a[2]; > > > > > > > > > + int index; > > > > > > > > > + int flip_index = 0; > > > > > > > > > + > > > > > > > > > + ... > > > > > > > > > + > > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > > + > > > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > > > + are often compiled using branches. And as before, although > > > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > > > > there are further constraints due to the type of r1 and the values that > > > > > > > > flip_index can have). > > > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > > comparison? > > > > > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > > > conditions are happy to take a vdp. This assumes that p->a only returns > > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > > > through the program. ;-) > > > > > > > > > > The other thing that can happen is that a vdp can get handed off to > > > > > another synchronization mechanism, for example, to reference counting: > > > > > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > > > if (do_something_with(p->a)) { > > > > > /* fast path protected by RCU. */ > > > > > return 0; > > > > > } > > > > > if (atomic_inc_not_zero(&p->refcnt) { > > > > > /* slow path protected by reference counting. */ > > > > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > > > > } > > > > > /* Needed slow path, but raced with deletion. */ > > > > > return -EAGAIN; > > > > > > > > > > I am guessing that the cast ends the vdp. Is that the case? > > > > > > > > And here is a more elaborate example from the Linux kernel: > > > > > > > > struct md_rdev value_dep_preserving *rdev; /* CHANGE */ > > > > > > > > rdev = rcu_dereference(conf->mirrors[disk].rdev); > > > > if (r1_bio->bios[disk] == IO_BLOCKED > > > > || rdev == NULL > > > > || test_bit(Unmerged, &rdev->flags) > > > > || test_bit(Faulty, &rdev->flags)) > > > > continue; > > > > > > > > The fact that the "rdev == NULL" returns vdp does not force the "||" > > > > operators to be evaluated arithmetically because the entire function > > > > is an "if" condition, correct? > > > > > > That's a good question, and one that as far as I understand currently, > > > essentially boils down to whether we want to have tight restrictions on > > > which operations are still vdp. > > > > > > If we look at the different combinations, then it seems we can't decide > > > on whether we have a value-dependency just due to a vdp type: > > > * non-vdp || vdp: vdp iff non-vdp == false > > > * vdp || non-vdp: vdp iff non-vdp == false? > > > * vdp || vdp: always vdp? (and dependency on both?) > > > > > > I'm not sure it makes sense to try to not make all of those > > > vdp-by-default. The first and second case show that it's dependent on > > > the specific execution anyway, and thus is already covered by the > > > requirement that the value must still matter. The vdp type is just a > > > way to prevent inappropriate compiler optimizations; it's not critical > > > for correctness is we make more stuff vdp, yet it may prevent some > > > optimizations in the affected expression. > > > > > > If the compiler knows that some vdp-typed evaluation will not have a > > > value-dependency anyway, then it can just optimize this evaluation like > > > non-vdp code. > > > > > > I guess not much would change for the code you posted, because we > > > already have to evaluate || operands in order, I believe (e.g., don't > > > access rdev->flags before doing the rdev == NULL check, modulo as-if). > > > Do I understand your question correctly? > > > > Let me give an example for the other side: > > > > struct foo value_dep_preserving *p; > > struct foo value_dep_preserving *q; > > > > p = rcu_dereference(gp); > > q = rcu_dereference(gq); > > return myarray[p || q]]; /* Linux kernel doesn't do this. */ > > > > If we wanted this to work (and I am not at all convinced that we do), > > the compiler would have to force a data dependency through the "||". > > Yes. > > > But I would be just as happy to instead just say that boolean logical > > operators ("||" and "&&") never return vdp values. > > I think those aren't actually the problem (or if they were, we'd need to > think about & and | on 1-bit integers or bitfields as well), but ... > > > Ditto for the > > relational operators ("==", "!=", ">", ">=", "<", and "<="). No one > > seems to rely on value dependencies via these operators, after all, > > and preserving value dependencies through them seems to require that > > the compiler generate odd code. > > ... that any conversion from vdp to bool requires specialized handling > by the compiler. That happens on implicit conversion (as in "p || q") > and in the operators you mentioned. > > I don't see a reason why conversion to bool (or any other operator > returning bool and taking vdp as operand) should be *always* non-vdp. > But it seems it would be easier to misuse than other operators. Well, we have more than 1,000 things in the Linux kernel that head up what would be vdps, and none of them need a relational operator or a boolean non-bitwise operator to produce a vdp. Admittedly some danger in extrapolating, but we are extrapolating from a reasonably large sample. That said, I don't have a problem with these operators producing a vdp as long as the quality of the code emitted by the compiler doesn't suffer in the common case where they are not needed, and you are the expert on that. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-04 19:00 ` Paul E. McKenney 2014-03-04 21:35 ` Paul E. McKenney @ 2014-03-05 16:26 ` Torvald Riegel 2014-03-05 18:01 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-05 16:26 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > + dereferencing. For example, the following (rather improbable) > > > > > + code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int force_zero_index = 1; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > + > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > + do order stores after such branches, they can speculate loads, > > > > > + which can result in misordering bugs. > > > > > + > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > + the following (quite strange) code is buggy: > > > > > + > > > > > + int a[2]; > > > > > + int index; > > > > > + int flip_index = 0; > > > > > + > > > > > + ... > > > > > + > > > > > + r1 = rcu_dereference(i1) > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > + > > > > > + As before, the reason this is buggy is that relational operators > > > > > + are often compiled using branches. And as before, although > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > + after such branches, but can speculate loads, which can again > > > > > + result in misordering bugs. > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > there are further constraints due to the type of r1 and the values that > > > > flip_index can have). > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > guarantees than we get by default from current compilers. > > > > > > One question, though. Suppose that the code did not want a value > > > dependency to be tracked through a comparison operator. What does > > > the developer do in that case? (The reason I ask is that I have > > > not yet found a use case in the Linux kernel that expects a value > > > dependency to be tracked through a comparison.) > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > comparison? > > That should work well assuming that things like "if", "while", and "?:" > conditions are happy to take a vdp. I currently don't see a reason why that should be disallowed. If we have allowed an implicit conversion to non-vdp, I believe that should follow. ?: could be somewhat special, in that the type depends on the 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be disallowed if we don't provide for implicit casts from non-vdp to vdp. > This assumes that p->a only returns > vdp if field "a" is declared vdp, otherwise we have vdps running wild > through the program. ;-) That's a good question. For the scheme I had in mind, I'm not concerned about vdps running wild because one needs to assign to explicitly vdp-typed variables (or function arguments, etc.) to let vdp extend to beyond single expressions. Nonetheless, I think it's a good question how -> should behave if the field is not vdp; in particular, should vdp->non_vdp be automatically vdp? One concern might be that we know something about non-vdp -- OTOH, we shouldn't be able to do so because we (assume to) don't know anything about the vdp pointer, so we can't infer something about something it points to. > The other thing that can happen is that a vdp can get handed off to > another synchronization mechanism, for example, to reference counting: > > p = atomic_load_explicit(&gp, memory_order_consume); > if (do_something_with(p->a)) { > /* fast path protected by RCU. */ > return 0; > } > if (atomic_inc_not_zero(&p->refcnt) { Is the argument to atomic_inc_no_zero vdp or non-vdp? > /* slow path protected by reference counting. */ > return do_something_else_with((struct foo *)p); /* CHANGE */ > } > /* Needed slow path, but raced with deletion. */ > return -EAGAIN; > > I am guessing that the cast ends the vdp. Is that the case? That would end it, yes. The other way this could happen is that the argument of do_something_else_with() would be specified to be non-vdp. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-05 16:26 ` Torvald Riegel @ 2014-03-05 18:01 ` Paul E. McKenney 2014-03-07 17:45 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-03-05 18:01 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > + code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int force_zero_index = 1; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > + which can result in misordering bugs. > > > > > > + > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > + the following (quite strange) code is buggy: > > > > > > + > > > > > > + int a[2]; > > > > > > + int index; > > > > > > + int flip_index = 0; > > > > > > + > > > > > > + ... > > > > > > + > > > > > > + r1 = rcu_dereference(i1) > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > + > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > + are often compiled using branches. And as before, although > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > + after such branches, but can speculate loads, which can again > > > > > > + result in misordering bugs. > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > there are further constraints due to the type of r1 and the values that > > > > > flip_index can have). > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > guarantees than we get by default from current compilers. > > > > > > > > One question, though. Suppose that the code did not want a value > > > > dependency to be tracked through a comparison operator. What does > > > > the developer do in that case? (The reason I ask is that I have > > > > not yet found a use case in the Linux kernel that expects a value > > > > dependency to be tracked through a comparison.) > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > comparison? > > > > That should work well assuming that things like "if", "while", and "?:" > > conditions are happy to take a vdp. > > I currently don't see a reason why that should be disallowed. If we > have allowed an implicit conversion to non-vdp, I believe that should > follow. I am a bit nervous about a silent implicit conversion from vdp to non-vdp in the general case. However, when the result is being used by a conditional, the silent implicit conversion makes a lot of sense. Is that distinction something that the compiler can handle easily? On the other hand, silent implicit conversion from non-vdp to vdp is very useful for common code that can be invoked both by RCU readers and by updaters. > ?: could be somewhat special, in that the type depends on the > 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be > disallowed if we don't provide for implicit casts from non-vdp to vdp. Actually, from the Linux-kernel code that I am seeing, we want to be able to silently convert from non-vdp to vdp in order to permit common code that is invoked from both RCU readers (vdp) and updaters (often non-vdp). This common code must be compiled conservatively to allow vdp, but should be just find with non-vdp. Going through the combinations... 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */ 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */ 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */ 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */ 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */ 8. non-vdp x = vdp ? vdp : vdp; /* Warning unless condition. */ 9. non-vdp x = vdp ? vdp : non-vdp; /* Warning unless condition. */ 10. non-vdp x = vdp ? non-vdp : vdp; /* Warning unless condition. */ 11. non-vdp x = vdp ? non-vdp : non-vdp; /* OK, matches. */ 12. non-vdp x = non-vdp ? vdp : vdp; /* Warning unless condition. */ 13. non-vdp x = non-vdp ? vdp : non-vdp; /* Warning unless condition. */ 14. non-vdp x = non-vdp ? non-vdp : vdp; /* Warning unless condition. */ 15. non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */ 0, 4, 11, and 15 are OK because both legs of the ?: match the variable being assigned to. 1, 2, 3, 4, 6, and 7 are implicit silent conversions from non-vdp to vdp, which is always safe and is useful for common code. 8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned to a non-vdp variable, which could potentially be passed to a vdp-oblivious function. However, 8, 9, 10, 12, 13, and 14 are OK if the result is consumed by a conditional. That said, I would not complain if something like the following kicked out a warning: struct foo value_dep_preserving *p; struct foo *q; p = rcu_dereference(gp); q = f() ? p : p + 1; if (q < THE_LIMIT) do_something(); else do_something_else(p); The warning could be avoided by marking q value_dep_preserving or by eliminating q entirely: struct foo value_dep_preserving *p; p = rcu_dereference(gp); if ((f() ? p : p + 1) < THE_LIMIT) do_something(); else do_something_else(p); Or, for that matter, by using a cast: struct foo value_dep_preserving *p; struct foo *q; p = rcu_dereference(gp); q = (struct foo *)(f() ? p : p + 1); if (q < THE_LIMIT) do_something(); else do_something_else(p); Does that make sense? > > This assumes that p->a only returns > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > through the program. ;-) > > That's a good question. For the scheme I had in mind, I'm not concerned > about vdps running wild because one needs to assign to explicitly > vdp-typed variables (or function arguments, etc.) to let vdp extend to > beyond single expressions. > > Nonetheless, I think it's a good question how -> should behave if the > field is not vdp; in particular, should vdp->non_vdp be automatically > vdp? One concern might be that we know something about non-vdp -- OTOH, > we shouldn't be able to do so because we (assume to) don't know anything > about the vdp pointer, so we can't infer something about something it > points to. In almost all the cases I am seeing in the Linux kernel, p->f wants to be non-vdp. A common case is that "f" is an integer that is used in later computation, but where the ordering is needed only when fetching p->f, not during later use of the resulting integer. So it is looking like p->f should be vdp only if field "f" is declared vdp. > > The other thing that can happen is that a vdp can get handed off to > > another synchronization mechanism, for example, to reference counting: > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > if (do_something_with(p->a)) { > > /* fast path protected by RCU. */ > > return 0; > > } > > if (atomic_inc_not_zero(&p->refcnt) { > > Is the argument to atomic_inc_no_zero vdp or non-vdp? The argument to atomic_inc_not_zero() is non-vdp, and because it is an atomic operation, it would not make sense to mark it vdp. This results in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp if "p" is vdp, and I am finding other code (like the above) that wants "&p->f" to be non-vdp always. The approaches I can think of at the moment include: 1. If "p" is vdp, make "&p->f" be vdp, but don't complain about subsequent assignments to non-vdp variables. Sounds like quite a mess in the compiler. 2. Propagate value_dep_preserving tags throughout the kernel. Sounds like a good recipe for a Linux-kernel revolt against this proposal. 3. Require explicit casts to avoid warnings: if atomic_inc_not_zero((struct foo *)&p->refcnt) { This would not be as bad as #2, but would still require a fair amount of markup. 4. Use something like kill_dependency(). This has strengths and weaknesses similar to #3, but has the advantage of being useful in type-generic macros. 5. Either #3 or #4 above, but have a command-line flag that shuts off the warnings. That way, people who want the diagnostics can enable them in their own code, and people who don't can disable them. #5 looks like the way to go to me. So "&p->f" has the same vdp-ness as "p", so that assigning it to a non-vdp variable, passing it via a non-vdp argument, or returning it via a non-vdp return value will cause a warning. However, that warning can be easily shut off on a file-by-file basis. Seem reasonable? > > /* slow path protected by reference counting. */ > > return do_something_else_with((struct foo *)p); /* CHANGE */ > > } > > /* Needed slow path, but raced with deletion. */ > > return -EAGAIN; > > > > I am guessing that the cast ends the vdp. Is that the case? > > That would end it, yes. The other way this could happen is that the > argument of do_something_else_with() would be specified to be non-vdp. Agreed. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-05 18:01 ` Paul E. McKenney @ 2014-03-07 17:45 ` Torvald Riegel 2014-03-07 19:02 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-03-07 17:45 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > > xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > + code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int force_zero_index = 1; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > + which can result in misordering bugs. > > > > > > > + > > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > + the following (quite strange) code is buggy: > > > > > > > + > > > > > > > + int a[2]; > > > > > > > + int index; > > > > > > > + int flip_index = 0; > > > > > > > + > > > > > > > + ... > > > > > > > + > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > + > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > + are often compiled using branches. And as before, although > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > + result in misordering bugs. > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > > there are further constraints due to the type of r1 and the values that > > > > > > flip_index can have). > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > dependency to be tracked through a comparison operator. What does > > > > > the developer do in that case? (The reason I ask is that I have > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > dependency to be tracked through a comparison.) > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > comparison? > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > conditions are happy to take a vdp. > > > > I currently don't see a reason why that should be disallowed. If we > > have allowed an implicit conversion to non-vdp, I believe that should > > follow. > > I am a bit nervous about a silent implicit conversion from vdp to > non-vdp in the general case. Why are you nervous about it? > However, when the result is being used by > a conditional, the silent implicit conversion makes a lot of sense. > Is that distinction something that the compiler can handle easily? I think so. I'm not a language lawyer, but we have other such conversions in the standard (e.g., int to boolean, between int and float) and I currently don't see a fundamental difference to those. But we'll have to ask the language folks (or SG1 or LEWG) to really verify that. > On the other hand, silent implicit conversion from non-vdp to vdp > is very useful for common code that can be invoked both by RCU > readers and by updaters. I'd be more nervous about that because then there's less obstacles to one programmer expecting a vdp to indicate a dependency vs. another programmer putting non-vdp into vdp. For this case of common code (which I agree is a valid concern), would it be a lot of programmer overhead to add explicit casts from non-vdp to vdp? Would C11 generics help with that, similarly to how C++ template functions would? Nonetheless, in the end this is just trading off convenient use against different ways to catch different but simple errors. > > ?: could be somewhat special, in that the type depends on the > > 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be > > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be > > disallowed if we don't provide for implicit casts from non-vdp to vdp. > > Actually, from the Linux-kernel code that I am seeing, we want to be able > to silently convert from non-vdp to vdp in order to permit common code > that is invoked from both RCU readers (vdp) and updaters (often non-vdp). > This common code must be compiled conservatively to allow vdp, but should > be just find with non-vdp. > > Going through the combinations... > > 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ > 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ > 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ > 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */ > 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */ > 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */ > 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */ > 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */ > 8. non-vdp x = vdp ? vdp : vdp; /* Warning unless condition. */ > 9. non-vdp x = vdp ? vdp : non-vdp; /* Warning unless condition. */ > 10. non-vdp x = vdp ? non-vdp : vdp; /* Warning unless condition. */ > 11. non-vdp x = vdp ? non-vdp : non-vdp; /* OK, matches. */ > 12. non-vdp x = non-vdp ? vdp : vdp; /* Warning unless condition. */ > 13. non-vdp x = non-vdp ? vdp : non-vdp; /* Warning unless condition. */ > 14. non-vdp x = non-vdp ? non-vdp : vdp; /* Warning unless condition. */ > 15. non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */ > > 0, 4, 11, and 15 are OK because both legs of the ?: match the variable > being assigned to. 1, 2, 3, 4, 6, and 7 are implicit silent conversions > from non-vdp to vdp, which is always safe and is useful for common code. Note that some of those can in fact be vdp depending on operands, but don't necessarily carry an actual value dependency. So, from a type system perspective, I would guess that those expressions would be vdp by default (except 7 and 3). > 8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned > to a non-vdp variable, which could potentially be passed to a vdp-oblivious > function. I agree that there is a mismatch, but I'm not sure we want to warn on silent conversion from vdp to non-vdp instead of just doing a silent conversion. Otherwise, we'll have to add casts whenever we send a vdp to something that doesn't want to make use of the value dependency (e.g., printf, an if statement, ...). What would be the programmer overhead for the latter? > However, 8, 9, 10, 12, 13, and 14 are OK if the result is > consumed by a conditional. That said, I would not complain if something > like the following kicked out a warning: > > struct foo value_dep_preserving *p; > struct foo *q; > > p = rcu_dereference(gp); > q = f() ? p : p + 1; You'd like to see the warning here, right? > if (q < THE_LIMIT) > do_something(); > else > do_something_else(p); > > The warning could be avoided by marking q value_dep_preserving or by > eliminating q entirely: > > struct foo value_dep_preserving *p; > > p = rcu_dereference(gp); > if ((f() ? p : p + 1) < THE_LIMIT) > do_something(); > else > do_something_else(p); > > Or, for that matter, by using a cast: > > struct foo value_dep_preserving *p; > struct foo *q; > > p = rcu_dereference(gp); > q = (struct foo *)(f() ? p : p + 1); So the cast would be like kill_dependency()? > if (q < THE_LIMIT) > do_something(); > else > do_something_else(p); > > Does that make sense? I think I understand which scheme you have in mind. I just don't have a strong preference for either your approach (AFAIU, roughly, to expect programmers to kill vdp and to warn on silent kills otherwise) and what I had in mind (to allow silent transitions to non-vdp but to provide a helper function/macro that raises an error if the access is not vdp (so one can "request" to get a vdp for memory accesses where this matters)). Did I understand your approach correctly? Right now, I can't confidently say that one would be better than the other. I think we need to get feedback for both. > > > This assumes that p->a only returns > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > through the program. ;-) > > > > That's a good question. For the scheme I had in mind, I'm not concerned > > about vdps running wild because one needs to assign to explicitly > > vdp-typed variables (or function arguments, etc.) to let vdp extend to > > beyond single expressions. > > > > Nonetheless, I think it's a good question how -> should behave if the > > field is not vdp; in particular, should vdp->non_vdp be automatically > > vdp? One concern might be that we know something about non-vdp -- OTOH, > > we shouldn't be able to do so because we (assume to) don't know anything > > about the vdp pointer, so we can't infer something about something it > > points to. > > In almost all the cases I am seeing in the Linux kernel, p->f wants to > be non-vdp. A common case is that "f" is an integer that is used in > later computation, but where the ordering is needed only when fetching > p->f, not during later use of the resulting integer. Module perhaps a few minor missed optimizations in this expression, if we had implicit/silent conversion to non-vdp, this should just work fine I believe even if we say that vdp->non_vdp is vdp by default. > So it is looking like p->f should be vdp only if field "f" is declared vdp. I can see that if the silent conversion to non-vdp is not allowed, then this approach might lead to fewer casts / kill_dependency(). > > > The other thing that can happen is that a vdp can get handed off to > > > another synchronization mechanism, for example, to reference counting: > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > if (do_something_with(p->a)) { > > > /* fast path protected by RCU. */ > > > return 0; > > > } > > > if (atomic_inc_not_zero(&p->refcnt) { > > > > Is the argument to atomic_inc_no_zero vdp or non-vdp? > > The argument to atomic_inc_not_zero() is non-vdp, and because it is an > atomic operation, it would not make sense to mark it vdp. Why? If it would be an atomic mo_relaxed load, for example, then vdp would possibly make sense, or not? For atomic RMW ops, I also don't see why we'd always want non-vdp as operands. > This results > in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp > if "p" is vdp, and I am finding other code (like the above) that wants > "&p->f" to be non-vdp always. > > The approaches I can think of at the moment include: > > 1. If "p" is vdp, make "&p->f" be vdp, but don't complain about > subsequent assignments to non-vdp variables. Sounds like quite > a mess in the compiler. Why? On the assignment, there will need to be an implicit conversion. But assigning floats to integers or integer to boolean seems quite similar, or not? > 2. Propagate value_dep_preserving tags throughout the kernel. > Sounds like a good recipe for a Linux-kernel revolt against > this proposal. > > 3. Require explicit casts to avoid warnings: > > if atomic_inc_not_zero((struct foo *)&p->refcnt) { > > This would not be as bad as #2, but would still require > a fair amount of markup. 3a. We could also allow implicit conversion from non-vdp to vdp (but a default to go from vdp to vdp, conservatively, so that dependency chains in expressions aren't broken accidentally). > 4. Use something like kill_dependency(). This has strengths > and weaknesses similar to #3, but has the advantage of > being useful in type-generic macros. > > 5. Either #3 or #4 above, but have a command-line flag that > shuts off the warnings. That way, people who want the > diagnostics can enable them in their own code, and people > who don't can disable them. If we have implicit conversions, than having warnings for when (some of) those happen sounds like a good idea. > #5 looks like the way to go to me. So "&p->f" has the same vdp-ness > as "p", so that assigning it to a non-vdp variable, passing it via a > non-vdp argument, or returning it via a non-vdp return value will > cause a warning. However, that warning can be easily shut off on a > file-by-file basis. > > Seem reasonable? Yes, except that I think that we need to describe this differently (ie, more like what the standard handles other implicit conversions). Whether the warning should be on by default (ie, opt-out) or should be part of -Wall can be separately discussed, I believe. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-03-07 17:45 ` Torvald Riegel @ 2014-03-07 19:02 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-03-07 19:02 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Mar 07, 2014 at 06:45:57PM +0100, Torvald Riegel wrote: > xagsmtp5.20140307174618.3777@vmsdvm6.vnet.ibm.com > X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP5 at VMSDVM6) > > On Wed, 2014-03-05 at 10:01 -0800, Paul E. McKenney wrote: > > On Wed, Mar 05, 2014 at 05:26:36PM +0100, Torvald Riegel wrote: > > > xagsmtp3.20140305162928.8243@uk1vsc.vnet.ibm.com > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP3 at UK1VSC) > > > > > > On Tue, 2014-03-04 at 11:00 -0800, Paul E. McKenney wrote: > > > > On Mon, Mar 03, 2014 at 09:46:19PM +0100, Torvald Riegel wrote: > > > > > xagsmtp2.20140303204700.3556@vmsdvma.vnet.ibm.com > > > > > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > > > > > > > > > On Mon, 2014-03-03 at 11:20 -0800, Paul E. McKenney wrote: > > > > > > On Mon, Mar 03, 2014 at 07:55:08PM +0100, Torvald Riegel wrote: > > > > > > > xagsmtp2.20140303190831.9500@uk1vsc.vnet.ibm.com > > > > > > > X-Xagent-Gateway: uk1vsc.vnet.ibm.com (XAGSMTP2 at UK1VSC) > > > > > > > > > > > > > > On Fri, 2014-02-28 at 16:50 -0800, Paul E. McKenney wrote: > > > > > > > > +o Do not use the results from the boolean "&&" and "||" when > > > > > > > > + dereferencing. For example, the following (rather improbable) > > > > > > > > + code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int force_zero_index = 1; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 && force_zero_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + The reason this is buggy is that "&&" and "||" are often compiled > > > > > > > > + using branches. While weak-memory machines such as ARM or PowerPC > > > > > > > > + do order stores after such branches, they can speculate loads, > > > > > > > > + which can result in misordering bugs. > > > > > > > > + > > > > > > > > +o Do not use the results from relational operators ("==", "!=", > > > > > > > > + ">", ">=", "<", or "<=") when dereferencing. For example, > > > > > > > > + the following (quite strange) code is buggy: > > > > > > > > + > > > > > > > > + int a[2]; > > > > > > > > + int index; > > > > > > > > + int flip_index = 0; > > > > > > > > + > > > > > > > > + ... > > > > > > > > + > > > > > > > > + r1 = rcu_dereference(i1) > > > > > > > > + r2 = a[r1 != flip_index]; /* BUGGY!!! */ > > > > > > > > + > > > > > > > > + As before, the reason this is buggy is that relational operators > > > > > > > > + are often compiled using branches. And as before, although > > > > > > > > + weak-memory machines such as ARM or PowerPC do order stores > > > > > > > > + after such branches, but can speculate loads, which can again > > > > > > > > + result in misordering bugs. > > > > > > > > > > > > > > Those two would be allowed by the wording I have recently proposed, > > > > > > > AFAICS. r1 != flip_index would result in two possible values (unless > > > > > > > there are further constraints due to the type of r1 and the values that > > > > > > > flip_index can have). > > > > > > > > > > > > And I am OK with the value_dep_preserving type providing more/better > > > > > > guarantees than we get by default from current compilers. > > > > > > > > > > > > One question, though. Suppose that the code did not want a value > > > > > > dependency to be tracked through a comparison operator. What does > > > > > > the developer do in that case? (The reason I ask is that I have > > > > > > not yet found a use case in the Linux kernel that expects a value > > > > > > dependency to be tracked through a comparison.) > > > > > > > > > > Hmm. I suppose use an explicit cast to non-vdp before or after the > > > > > comparison? > > > > > > > > That should work well assuming that things like "if", "while", and "?:" > > > > conditions are happy to take a vdp. > > > > > > I currently don't see a reason why that should be disallowed. If we > > > have allowed an implicit conversion to non-vdp, I believe that should > > > follow. > > > > I am a bit nervous about a silent implicit conversion from vdp to > > non-vdp in the general case. > > Why are you nervous about it? If someone expects the vdp to propagate into some function that might be compiled with aggressive optimizations that break this expectation, it would be good for that someone to know about it. Ah! I am assuming that the compiler is -not- emitting memory barriers at vdp-to-non-vdp transitions. In that case, warnings are even more important -- without the warnings, it is a real pain chasing these unnecessary memory barriers out of the code. So we are -not- in the business of emitting memory barriers on vdp-to-non-vdp transitions, right? > > However, when the result is being used by > > a conditional, the silent implicit conversion makes a lot of sense. > > Is that distinction something that the compiler can handle easily? > > I think so. I'm not a language lawyer, but we have other such > conversions in the standard (e.g., int to boolean, between int and > float) and I currently don't see a fundamental difference to those. But > we'll have to ask the language folks (or SG1 or LEWG) to really verify > that. Understood! > > On the other hand, silent implicit conversion from non-vdp to vdp > > is very useful for common code that can be invoked both by RCU > > readers and by updaters. > > I'd be more nervous about that because then there's less obstacles to > one programmer expecting a vdp to indicate a dependency vs. another > programmer putting non-vdp into vdp. Well, that is the concern either way. But the usual reason for putting non-vdp into vdp is because the update-side code holds the lock, so that nothing can change. In this case, there is no harm in passing the non-vdp pointer to a vdp function. In contrast, on the read side, there is nothing preventing the underlying data from changing at any time. So if you have a read-side vdp, passing it to a non-vdp function could result in an ordering bug. And given the choice, what I would want would be to be warned of a vdp-to-non-vdp transition within an RCU read-side critical section, but not outside of an RCU read-side critical section. Not sure whether that is practical from a compiler viewpoint, though. (Would need to tell the compiler about rcu_read_unlock(), which is easy from my perspective, but there are interactions with vdp-marked parameters and return values.) > For this case of common code (which I agree is a valid concern), would > it be a lot of programmer overhead to add explicit casts from non-vdp to > vdp? Would C11 generics help with that, similarly to how C++ template > functions would? For common code, it depends on other decisions. For example, in the case where "p" being vdp implies that "p->f" is vdp even when "f" is declared non-vdp, there would be an intolerable number of casts. I am not sure to what extent generics or templates would help. The use of gcc type-generic macros would be much easier with some primitive that could strip vdp from a given type. > Nonetheless, in the end this is just trading off convenient use against > different ways to catch different but simple errors. Yep. Perhaps best just to make two separate command-line flags, one to enable vdp-to-non-vdp warnings and the other to enable non-vdp-to-vdp warnings. That would allow each project to adapt the compiler to their coding standards and expectations. > > > ?: could be somewhat special, in that the type depends on the > > > 2nd and 3rd operand. Thus, "vdp x = non-vdp ? vdp : vdp;" should be > > > allowed, whereas "vdp x = non-vdp ? non-vdp : vdp;" probably should be > > > disallowed if we don't provide for implicit casts from non-vdp to vdp. > > > > Actually, from the Linux-kernel code that I am seeing, we want to be able > > to silently convert from non-vdp to vdp in order to permit common code > > that is invoked from both RCU readers (vdp) and updaters (often non-vdp). > > This common code must be compiled conservatively to allow vdp, but should > > be just find with non-vdp. > > > > Going through the combinations... > > > > 0. vdp x = vdp ? vdp : vdp; /* OK, matches. */ > > 1. vdp x = vdp ? vdp : non-vdp; /* Silent conversion. */ > > 2. vdp x = vdp ? non-vdp : vdp; /* Silent conversion. */ > > 3. vdp x = vdp ? non-vdp : non-vdp; /* Silent conversion. */ > > 4. vdp x = non-vdp ? vdp : vdp; /* OK, matches. */ > > 5. vdp x = non-vdp ? vdp : non-vdp; /* Silent conversion. */ > > 6. vdp x = non-vdp ? non-vdp : vdp; /* Silent conversion. */ > > 7. vdp x = non-vdp ? non-vdp : non-vdp; /* Silent conversion. */ > > 8. non-vdp x = vdp ? vdp : vdp; /* Warning unless condition. */ > > 9. non-vdp x = vdp ? vdp : non-vdp; /* Warning unless condition. */ > > 10. non-vdp x = vdp ? non-vdp : vdp; /* Warning unless condition. */ > > 11. non-vdp x = vdp ? non-vdp : non-vdp; /* OK, matches. */ > > 12. non-vdp x = non-vdp ? vdp : vdp; /* Warning unless condition. */ > > 13. non-vdp x = non-vdp ? vdp : non-vdp; /* Warning unless condition. */ > > 14. non-vdp x = non-vdp ? non-vdp : vdp; /* Warning unless condition. */ > > 15. non-vdp x = non-vdp ? non-vdp : non-vdp; /* OK, matches. */ > > > > 0, 4, 11, and 15 are OK because both legs of the ?: match the variable > > being assigned to. 1, 2, 3, 4, 6, and 7 are implicit silent conversions > > from non-vdp to vdp, which is always safe and is useful for common code. > > Note that some of those can in fact be vdp depending on operands, but > don't necessarily carry an actual value dependency. So, from a type > system perspective, I would guess that those expressions would be vdp by > default (except 7 and 3). > > > 8, 9, 10, 12, 13, and 14 are mismatches: A vdp quantity is being assigned > > to a non-vdp variable, which could potentially be passed to a vdp-oblivious > > function. > > I agree that there is a mismatch, but I'm not sure we want to warn on > silent conversion from vdp to non-vdp instead of just doing a silent > conversion. Otherwise, we'll have to add casts whenever we send a vdp > to something that doesn't want to make use of the value dependency > (e.g., printf, an if statement, ...). What would be the programmer > overhead for the latter? You have convinced me that different projects will want to have different types of warnings, so that there should be a pair of compiler command-line flags, one to enable vdp-to-non-vdb warnings and another to enable non-vdp-to-vdp warnings. > > However, 8, 9, 10, 12, 13, and 14 are OK if the result is > > consumed by a conditional. That said, I would not complain if something > > like the following kicked out a warning: > > > > struct foo value_dep_preserving *p; > > struct foo *q; > > > > p = rcu_dereference(gp); > > q = f() ? p : p + 1; > > You'd like to see the warning here, right? Yes, if enabled and inside an RCU read-side critical section. > > if (q < THE_LIMIT) > > do_something(); > > else > > do_something_else(p); > > > > The warning could be avoided by marking q value_dep_preserving or by > > eliminating q entirely: > > > > struct foo value_dep_preserving *p; > > > > p = rcu_dereference(gp); > > if ((f() ? p : p + 1) < THE_LIMIT) > > do_something(); > > else > > do_something_else(p); > > > > Or, for that matter, by using a cast: > > > > struct foo value_dep_preserving *p; > > struct foo *q; > > > > p = rcu_dereference(gp); > > q = (struct foo *)(f() ? p : p + 1); > > So the cast would be like kill_dependency()? In the sense that it tells the compiler to stop worrying about the value dependency. > > if (q < THE_LIMIT) > > do_something(); > > else > > do_something_else(p); > > > > Does that make sense? > > I think I understand which scheme you have in mind. I just don't have a > strong preference for either your approach (AFAIU, roughly, to expect > programmers to kill vdp and to warn on silent kills otherwise) and what > I had in mind (to allow silent transitions to non-vdp but to provide a > helper function/macro that raises an error if the access is not vdp (so > one can "request" to get a vdp for memory accesses where this matters)). Ah, I see where you were coming from for your non-vdp-to-vdp warning, as that would catch the case where your function/macro might get the wrong answer. Hmmm... > Did I understand your approach correctly? I believe you did. > Right now, I can't confidently say that one would be better than the > other. I think we need to get feedback for both. Makes sense. > > > > This assumes that p->a only returns > > > > vdp if field "a" is declared vdp, otherwise we have vdps running wild > > > > through the program. ;-) > > > > > > That's a good question. For the scheme I had in mind, I'm not concerned > > > about vdps running wild because one needs to assign to explicitly > > > vdp-typed variables (or function arguments, etc.) to let vdp extend to > > > beyond single expressions. > > > > > > Nonetheless, I think it's a good question how -> should behave if the > > > field is not vdp; in particular, should vdp->non_vdp be automatically > > > vdp? One concern might be that we know something about non-vdp -- OTOH, > > > we shouldn't be able to do so because we (assume to) don't know anything > > > about the vdp pointer, so we can't infer something about something it > > > points to. > > > > In almost all the cases I am seeing in the Linux kernel, p->f wants to > > be non-vdp. A common case is that "f" is an integer that is used in > > later computation, but where the ordering is needed only when fetching > > p->f, not during later use of the resulting integer. > > Module perhaps a few minor missed optimizations in this expression, if > we had implicit/silent conversion to non-vdp, this should just work fine > I believe even if we say that vdp->non_vdp is vdp by default. > > > So it is looking like p->f should be vdp only if field "f" is declared vdp. > > I can see that if the silent conversion to non-vdp is not allowed, then > this approach might lead to fewer casts / kill_dependency(). Agreed in both cases. > > > > The other thing that can happen is that a vdp can get handed off to > > > > another synchronization mechanism, for example, to reference counting: > > > > > > > > p = atomic_load_explicit(&gp, memory_order_consume); > > > > if (do_something_with(p->a)) { > > > > /* fast path protected by RCU. */ > > > > return 0; > > > > } > > > > if (atomic_inc_not_zero(&p->refcnt) { > > > > > > Is the argument to atomic_inc_no_zero vdp or non-vdp? > > > > The argument to atomic_inc_not_zero() is non-vdp, and because it is an > > atomic operation, it would not make sense to mark it vdp. > > Why? If it would be an atomic mo_relaxed load, for example, then vdp > would possibly make sense, or not? For atomic RMW ops, I also don't see > why we'd always want non-vdp as operands. I should have said that it is a value-returning read-modify-write atomic operation, which translates to something stronger than memory_order_seq_cst in C11. The reason that it is stronger is that provides more ordering guarantees to surrounding relaxed (ACCESS_ONCE()) operations. For example, given x and y both initially zero: T1: ACCESS_ONCE(x) = 1; if (atomic_inc_not_zero(&p->refcnt)) ACCESS_ONCE(y) = 1; T2: if (ACCESS_ONCE(y)) { if (atomic_inc_not_zero(q->other_refcnt)) BUG_ON(!ACCESS_ONCE(x)); } In the Linux kernel, the BUG_ON() cannot trigger. In C11, it could. So, to answer your question, if atomic_inc_not_zero() mapped to a memory_order_relaxed operation, then yes, you might need vdp. But given that it maps to stronger-than-memory_order_seq_cst, you do not. > > This results > > in a bit of a dilemma: I am finding code that wants "&p->f" to be vdp > > if "p" is vdp, and I am finding other code (like the above) that wants > > "&p->f" to be non-vdp always. > > > > The approaches I can think of at the moment include: > > > > 1. If "p" is vdp, make "&p->f" be vdp, but don't complain about > > subsequent assignments to non-vdp variables. Sounds like quite > > a mess in the compiler. > > Why? On the assignment, there will need to be an implicit conversion. > But assigning floats to integers or integer to boolean seems quite > similar, or not? In your scheme, you would never complain about vdp-to-non-vdp assignments, so no problem. If it is also not a problem in my approach, so mucn the better! ;-) > > 2. Propagate value_dep_preserving tags throughout the kernel. > > Sounds like a good recipe for a Linux-kernel revolt against > > this proposal. > > > > 3. Require explicit casts to avoid warnings: > > > > if atomic_inc_not_zero((struct foo *)&p->refcnt) { > > > > This would not be as bad as #2, but would still require > > a fair amount of markup. > > 3a. > We could also allow implicit conversion from non-vdp to vdp (but a > default to go from vdp to vdp, conservatively, so that dependency chains > in expressions aren't broken accidentally). Right, this is your proposal. > > 4. Use something like kill_dependency(). This has strengths > > and weaknesses similar to #3, but has the advantage of > > being useful in type-generic macros. > > > > 5. Either #3 or #4 above, but have a command-line flag that > > shuts off the warnings. That way, people who want the > > diagnostics can enable them in their own code, and people > > who don't can disable them. > > If we have implicit conversions, than having warnings for when (some of) > those happen sounds like a good idea. Very good. > > #5 looks like the way to go to me. So "&p->f" has the same vdp-ness > > as "p", so that assigning it to a non-vdp variable, passing it via a > > non-vdp argument, or returning it via a non-vdp return value will > > cause a warning. However, that warning can be easily shut off on a > > file-by-file basis. > > > > Seem reasonable? > > Yes, except that I think that we need to describe this differently (ie, > more like what the standard handles other implicit conversions). > Whether the warning should be on by default (ie, opt-out) or should be > part of -Wall can be separately discussed, I believe. Right, the standardese would be quite different. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 19:47 ` Linus Torvalds 2014-02-27 20:53 ` Paul E. McKenney @ 2014-03-03 18:59 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 18:59 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-27 at 11:47 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > 3. The comparison was against another RCU-protected pointer, > > where that other pointer was properly fetched using one > > of the RCU primitives. Here it doesn't matter which pointer > > you use. At least as long as the rcu_assign_pointer() for > > that other pointer happened after the last update to the > > pointed-to structure. > > > > I am a bit nervous about #3. Any thoughts on it? > > I think that it might be worth pointing out as an example, and saying > that code like > > p = atomic_read(consume); > X; > q = atomic_read(consume); > Y; > if (p == q) > data = p->val; > > then the access of "p->val" is constrained to be data-dependent on > *either* p or q, but you can't really tell which, since the compiler > can decide that the values are interchangeable. The wording I proposed would make the p dereference have a value dependency unless X and Y would somehow restrict p and q. The reasoning is that if the atomic loads return potentially more than one value, then even if we find out that two such loads did return the same value, we still don't know what the exact value was. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 17:01 ` Linus Torvalds 2014-02-27 19:06 ` Paul E. McKenney @ 2014-03-03 15:36 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 15:36 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Peter Sewell On Thu, 2014-02-27 at 09:01 -0800, Linus Torvalds wrote: > On Thu, Feb 27, 2014 at 7:37 AM, Torvald Riegel <triegel@redhat.com> wrote: > > Regarding the latter, we make a fresh start at each mo_consume load (ie, > > we assume we know nothing -- L could have returned any possible value); > > I believe this is easier to reason about than other scopes like function > > granularities (what happens on inlining?), or translation units. It > > should also be simple to implement for compilers, and would hopefully > > not constrain optimization too much. > > > > [...] > > > > Paul's litmus test would work, because we guarantee to the programmer > > that it can assume that the mo_consume load would return any value > > allowed by the type; effectively, this forbids the compiler analysis > > Paul thought about: > > So realistically, since with the new wording we can ignore the silly > cases (ie "p-p") and we can ignore the trivial-to-optimize compiler > cases ("if (p == &variable) .. use p"), and you would forbid the > "global value range optimization case" that Paul bright up, what > remains would seem to be just really subtle compiler transformations > of data dependencies to control dependencies. > > And the only such thing I can think of is basically compiler-initiated > value-prediction, presumably directed by PGO (since now if the value > prediction is in the source code, it's considered to break the value > chain). The other example that comes to mind would be feedback-directed JIT compilation. I don't think that's widely used today, and it might never be for the kernel -- but *in the standard*, we at least have to consider what the future might bring. > The good thing is that afaik, value-prediction is largely not used in > real life, afaik. There are lots of papers on it, but I don't think > anybody actually does it (although I can easily see some > specint-specific optimization pattern that is build up around it). > > And even value prediction is actually fine, as long as the compiler > can see the memory *source* of the value prediction (and it isn't a > mo_consume). So it really ends up limiting your value prediction in > very simple ways: you cannot do it to function arguments if they are > registers. But you can still do value prediction on values you loaded > from memory, if you can actually *see* that memory op. I think one would need to show that the source is *not even indirectly* a mo_consume load. With the wording I proposed, value dependencies don't break when storing to / loading from memory locations. Thus, if a compiler ends up at a memory load after waling SSA, it needs to prove that the load cannot read a value that (1) was produced by a store sequenced-before the load and (2) might carry a value dependency (e.g., by being a mo_consume load) that the value prediction in question would break. This, in general, requires alias analysis. Deciding whether a prediction would break a value dependency has to consider what later stages in a compiler would be doing, including LTO or further rounds of inlining/optimizations. OTOH, if the compiler can treat an mo_consume load as returning all possible values (eg, by ignoring all knowledge about it), then it can certainly do so with other memory loads too. So, I think that the constraints due to value dependencies can matter in practice. However, the impact on optimizations on non-mo_consume-related code are hard to estimate -- I don't see a huge amount of impact right now, but I also wouldn't want to predict that this can't change in the future. > Of course, on more strongly ordered CPU's, even that "register > argument" limitation goes away. > > So I agree that there is basically no real optimization constraint. > Value-prediction is of dubious value to begin with, and the actual > constraint on its use if some compiler writer really wants to is not > onerous. > > > What I have in mind is roughly the following (totally made-up syntax -- > > suggestions for how to do this properly are very welcome): > > * Have a type modifier (eg, like restrict), that specifies that > > operations on data of this type are preserving value dependencies: > > So I'm not violently opposed, but I think the upsides are not great. > Note that my earlier suggestion to use "restrict" wasn't because I > believed the annotation itself would be visible, but basically just as > a legalistic promise to the compiler that *if* it found an alias, then > it didn't need to worry about ordering. So to me, that type modifier > was about conceptual guarantees, not about actual value chains. > > Anyway, the reason I don't believe any type modifier (and > "[[carries_dependency]]" is basically just that) is worth it is simply > that it adds a real burden on the programmer, without actually giving > the programmer any real upside: > > Within a single function, the compiler already sees that mo_consume > source, and so doing a type-based restriction doesn't really help. The > information is already there, without any burden on the programmer. I think it's not just a question of whether we're talking a single function or across functions, but to which extent other code can detect whether it might have to consider value dependencies. The store/load case above is an example that complicates the detection for a compiler. In cases in which the mo_consume load is used directly, we don't need to use any annotations on the type: int val = atomic_load_explicit(ptr, mo_consume)->value; However, if we need to use the load's result more than once (which I think will happen often), then we do need the type annotation: s value_dep_preserving *ptr = atomic_load_explicit(ptr, mo_consume); if (ptr != 0) int val = ptr->value; If we want to avoid the annotation in this case, and still want to avoid the store/load vs. alias analysis problem mentioned above, we'd need to require that ptr isn't a variable that's visible to other code not related to this mo_consume load. But I believe that such a requirement would be awkward, and also hard to specify. I hope that Paul's look at rcu_derefence() usage could provide some indication of how much annotation overhead there actually would be for a programmer. > > And across functions, the compiler has already - by definition - > mostly lost sight of all the things it could use to reduce the value > space. I don't think that I agree here. Assume we have two separate functions bar and foo, and one temporary variable t of a type int012 that holds values 0,1,2 (excuse the somewhat artificial example): int012 t; int arr[20]; int bar(int a) { bar_stuff(a); // compiler knows this is noop with arguments 0 or 1 // and this will *never* touch t nor arr return a; } int foo(int a) { foo_stuff(a); // compiler knows this is noop with arguments 1 or 2 // and this will *never* touch t nor arr return a; } void main() { t = atomic_load_explicit(&source, mo_consume); x = arr[bar(foo(t))]; // value-dependent? } If a compiler looks at foo() and bar() separately, I think it might want to optimize bar() to the following: int bar_opt(int a) { if (a != 2) return a; bar_stuff(a); return a; } int foo_opt(int a) { if (a != 0) return a; foo_stuff(a); return a; } I think that those could be valuable optimizations for general-purpose code. What happens if the compiler does LTO afterwards and combines the foo and bar calls?: int bar_opt_foo_opt(int a) { if (a == 1) return a; if (a == 0) foo_stuff(a); else bar_stuff(a); return a; } This still looks like a good thing to do for general-purpose code, and it doesn't do any value prediction. If we inline this into main, it becomes kind of difficult for the compiler because it cannot just weave in bar_opt_foo_opt, or it might get: t = atomic_load_explicit(&source, mo_consume); if (t == 1) goto next; if (t == 0) foo_stuff(t); else bar_stuff(t); access: x = arr[t]; // value-dependent? Would this be still value-dependent for the hardware, or would the branch prediction interfere? Even if this would still be okay from the hardware POV, other compiler transformations now need to pay attention to where the value comes from. In particular, we can't specialize this into the following (which doesn't predict any values): t = atomic_load_explicit(&source, mo_consume); if (t == 1) x = arr[1]; else { if (t == 0) foo_stuff(t); else bar_stuff(t); x = arr[t]; } We could argue that this wouldn't be allowed because t is coming from an mo_consume load, but then we also need to say that this can in fact affect compiler transformations other than just value prediction. At least in this example, introducing the value_dep_preserving type modifier would have made this easier because it would have allowed the compiler to avoid the initial value prediction. However, I think that we might get a few interesting issues even with value_dep_preserving, so this needs further investigation. Nonetheless, my gut feeling is that having value_dep_preserving makes this all a lot easier because if unsure, the compiler can just use a bigger hammer for value_dep_preserving without having to worry about preventing optimizations on unrelated code. > Even Paul's example doesn't really work if the use of the > "mo_consume" value has been passed to another function, because inside > a separate function, the compiler couldn't see that the value it uses > comes from only two possible values. > > And as mentioned, even *if* the compiler wants to do value prediction > that turns a data dependency into a control dependency, the limitation > to say "no, you can't do it unless you saw where the value got loaded" > really isn't that onerous. > > I bet that if you ask actual production compiler people (as opposed to > perhaps academia), none of them actually really believe in value > prediction to begin with. What about keeping whether we really need value_dep_preserving as an open question for now, and try to get more feeback by compiler implementers on what the consequences of not having it would be? This should help us assess the current implementation perspective. We would still need to extrapolate what future compilers might want to do, though; thus, deciding to not use it will remain somewhat risky for future optimizations. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 15:37 ` Torvald Riegel 2014-02-27 17:01 ` Linus Torvalds @ 2014-02-27 17:50 ` Paul E. McKenney 2014-02-27 19:22 ` Paul E. McKenney ` (2 more replies) 1 sibling, 3 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-27 17:50 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > Good points. How about the following replacements? > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > results in another chained pointer in that same pointer chain. > > > The results of addition and subtraction operations that cancel > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > is a pointer to char) are implementation defined. > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > applied to a chained pointer and an integer for the purposes > > > of alignment and pointer translation results in another > > > chained pointer in that same pointer chain. Other uses > > > of bitwise operators on chained pointers (for example, > > > "p|~0") are implementation defined. > > > > Quite frankly, I think all of this language that is about the actual > > operations is irrelevant and wrong. > > > > It's not going to help compiler writers, and it sure isn't going to > > help users that read this. > > > > Why not just talk about "value chains" and that any operations that > > restrict the value range severely end up breaking the chain. There is > > no point in listing the operations individually, because every single > > operation *can* restrict things. Listing individual operations and > > depdendencies is just fundamentally wrong. > > [...] > > > The *only* thing that matters for all of them is whether they are > > "value-preserving", or whether they drop so much information that the > > compiler might decide to use a control dependency instead. That's true > > for every single one of them. > > > > Similarly, actual true control dependencies that limit the problem > > space sufficiently that the actual pointer value no longer has > > significant information in it (see the above example) are also things > > that remove information to the point that only a control dependency > > remains. Even when the value itself is not modified in any way at all. > > I agree that just considering syntactic properties of the program seems > to be insufficient. Making it instead depend on whether there is a > "semantic" dependency due to a value being "necessary" to compute a > result seems better. However, whether a value is "necessary" might not > be obvious, and I understand Paul's argument that he does not want to > have to reason about all potential compiler optimizations. Thus, I > believe we need to specify when a value is "necessary". > > I have a suggestion for a somewhat different formulation of the feature > that you seem to have in mind, which I'll discuss below. Excuse the > verbosity of the following, but I'd rather like to avoid > misunderstandings than save a few words. Thank you very much for putting this forward! I must confess that I was stuck, and my earlier attempt now enshrined in the C11 and C++11 standards is quite clearly way bogus. One possible saving grace: From discussions at the standards committee meeting a few weeks ago, there is a some chance that the committee will be willing to do a rip-and-replace on the current memory_order_consume wording, without provisions for backwards compatibility with the current bogosity. > What we'd like to capture is that a value originating from a mo_consume > load is "necessary" for a computation (e.g., it "cannot" be replaced > with value predictions and/or control dependencies); if that's the case > in the program, we can reasonably assume that a compiler implementation > will transform this into a data dependency, which will then lead to > ordering guarantees by the HW. > > However, we need to specify when a value is "necessary". We could say > that this is implementation-defined, and use a set of litmus tests > (e.g., like those discussed in the thread) to roughly carve out what a > programmer could expect. This may even be practical for a project like > the Linux kernel that follows strict project-internal rules and pays a > lot of attention to what the particular implementations of compilers > expected to compile the kernel are doing. However, I think this > approach would be too vague for the standard and for many other > programs/projects. I agree that a number of other projects would have more need for this than might the kernel. Please understand that this is in no way denigrating the intelligence of other projects' members. It is just that many of them have only recently started seriously thinking about concurrency. In contrast, the Linux kernel community has been doing concurrency since the mid-1990s. Projects with less experience with concurrency will probably need more help, from the compiler and from elsewhere as well. Your proposal looks quite promising at first glance. But rather than try and comment on it immediately, I am going to take a number of uses of RCU from the Linux kernel and apply your proposal to them, then respond with the results Fair enough? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 17:50 ` Paul E. McKenney @ 2014-02-27 19:22 ` Paul E. McKenney 2014-02-28 1:02 ` Paul E. McKenney 2014-03-03 19:01 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-27 19:22 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com > > > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > Good points. How about the following replacements? > > > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > > results in another chained pointer in that same pointer chain. > > > > The results of addition and subtraction operations that cancel > > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > > is a pointer to char) are implementation defined. > > > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > > applied to a chained pointer and an integer for the purposes > > > > of alignment and pointer translation results in another > > > > chained pointer in that same pointer chain. Other uses > > > > of bitwise operators on chained pointers (for example, > > > > "p|~0") are implementation defined. > > > > > > Quite frankly, I think all of this language that is about the actual > > > operations is irrelevant and wrong. > > > > > > It's not going to help compiler writers, and it sure isn't going to > > > help users that read this. > > > > > > Why not just talk about "value chains" and that any operations that > > > restrict the value range severely end up breaking the chain. There is > > > no point in listing the operations individually, because every single > > > operation *can* restrict things. Listing individual operations and > > > depdendencies is just fundamentally wrong. > > > > [...] > > > > > The *only* thing that matters for all of them is whether they are > > > "value-preserving", or whether they drop so much information that the > > > compiler might decide to use a control dependency instead. That's true > > > for every single one of them. > > > > > > Similarly, actual true control dependencies that limit the problem > > > space sufficiently that the actual pointer value no longer has > > > significant information in it (see the above example) are also things > > > that remove information to the point that only a control dependency > > > remains. Even when the value itself is not modified in any way at all. > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Thank you very much for putting this forward! I must confess that I was > stuck, and my earlier attempt now enshrined in the C11 and C++11 standards > is quite clearly way bogus. > > One possible saving grace: From discussions at the standards committee > meeting a few weeks ago, there is a some chance that the committee will > be willing to do a rip-and-replace on the current memory_order_consume > wording, without provisions for backwards compatibility with the current > bogosity. > > > What we'd like to capture is that a value originating from a mo_consume > > load is "necessary" for a computation (e.g., it "cannot" be replaced > > with value predictions and/or control dependencies); if that's the case > > in the program, we can reasonably assume that a compiler implementation > > will transform this into a data dependency, which will then lead to > > ordering guarantees by the HW. > > > > However, we need to specify when a value is "necessary". We could say > > that this is implementation-defined, and use a set of litmus tests > > (e.g., like those discussed in the thread) to roughly carve out what a > > programmer could expect. This may even be practical for a project like > > the Linux kernel that follows strict project-internal rules and pays a > > lot of attention to what the particular implementations of compilers > > expected to compile the kernel are doing. However, I think this > > approach would be too vague for the standard and for many other > > programs/projects. > > I agree that a number of other projects would have more need for this than > might the kernel. Please understand that this is in no way denigrating > the intelligence of other projects' members. It is just that many of > them have only recently started seriously thinking about concurrency. > In contrast, the Linux kernel community has been doing concurrency since > the mid-1990s. Projects with less experience with concurrency will > probably need more help, from the compiler and from elsewhere as well. I should hasten to add that it is not just concurrency. After all, part of the reason I got into trouble with memory_order_consume is that my mid-to-late 70s experience with compilers is not so useful in 2014. ;-) Thanx, Paul > Your proposal looks quite promising at first glance. But rather than > try and comment on it immediately, I am going to take a number of uses of > RCU from the Linux kernel and apply your proposal to them, then respond > with the results > > Fair enough? > > Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 17:50 ` Paul E. McKenney 2014-02-27 19:22 ` Paul E. McKenney @ 2014-02-28 1:02 ` Paul E. McKenney 2014-03-03 19:01 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-28 1:02 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 27, 2014 at 09:50:21AM -0800, Paul E. McKenney wrote: > On Thu, Feb 27, 2014 at 04:37:33PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140227154925.3851@vmsdvm9.vnet.ibm.com > > > > On Mon, 2014-02-24 at 11:54 -0800, Linus Torvalds wrote: > > > On Mon, Feb 24, 2014 at 10:53 AM, Paul E. McKenney > > > <paulmck@linux.vnet.ibm.com> wrote: > > > > > > > > Good points. How about the following replacements? > > > > > > > > 3. Adding or subtracting an integer to/from a chained pointer > > > > results in another chained pointer in that same pointer chain. > > > > The results of addition and subtraction operations that cancel > > > > the chained pointer's value (for example, "p-(long)p" where "p" > > > > is a pointer to char) are implementation defined. > > > > > > > > 4. Bitwise operators ("&", "|", "^", and I suppose also "~") > > > > applied to a chained pointer and an integer for the purposes > > > > of alignment and pointer translation results in another > > > > chained pointer in that same pointer chain. Other uses > > > > of bitwise operators on chained pointers (for example, > > > > "p|~0") are implementation defined. > > > > > > Quite frankly, I think all of this language that is about the actual > > > operations is irrelevant and wrong. > > > > > > It's not going to help compiler writers, and it sure isn't going to > > > help users that read this. > > > > > > Why not just talk about "value chains" and that any operations that > > > restrict the value range severely end up breaking the chain. There is > > > no point in listing the operations individually, because every single > > > operation *can* restrict things. Listing individual operations and > > > depdendencies is just fundamentally wrong. > > > > [...] > > > > > The *only* thing that matters for all of them is whether they are > > > "value-preserving", or whether they drop so much information that the > > > compiler might decide to use a control dependency instead. That's true > > > for every single one of them. > > > > > > Similarly, actual true control dependencies that limit the problem > > > space sufficiently that the actual pointer value no longer has > > > significant information in it (see the above example) are also things > > > that remove information to the point that only a control dependency > > > remains. Even when the value itself is not modified in any way at all. > > > > I agree that just considering syntactic properties of the program seems > > to be insufficient. Making it instead depend on whether there is a > > "semantic" dependency due to a value being "necessary" to compute a > > result seems better. However, whether a value is "necessary" might not > > be obvious, and I understand Paul's argument that he does not want to > > have to reason about all potential compiler optimizations. Thus, I > > believe we need to specify when a value is "necessary". > > > > I have a suggestion for a somewhat different formulation of the feature > > that you seem to have in mind, which I'll discuss below. Excuse the > > verbosity of the following, but I'd rather like to avoid > > misunderstandings than save a few words. > > Thank you very much for putting this forward! I must confess that I was > stuck, and my earlier attempt now enshrined in the C11 and C++11 standards > is quite clearly way bogus. > > One possible saving grace: From discussions at the standards committee > meeting a few weeks ago, there is a some chance that the committee will > be willing to do a rip-and-replace on the current memory_order_consume > wording, without provisions for backwards compatibility with the current > bogosity. > > > What we'd like to capture is that a value originating from a mo_consume > > load is "necessary" for a computation (e.g., it "cannot" be replaced > > with value predictions and/or control dependencies); if that's the case > > in the program, we can reasonably assume that a compiler implementation > > will transform this into a data dependency, which will then lead to > > ordering guarantees by the HW. > > > > However, we need to specify when a value is "necessary". We could say > > that this is implementation-defined, and use a set of litmus tests > > (e.g., like those discussed in the thread) to roughly carve out what a > > programmer could expect. This may even be practical for a project like > > the Linux kernel that follows strict project-internal rules and pays a > > lot of attention to what the particular implementations of compilers > > expected to compile the kernel are doing. However, I think this > > approach would be too vague for the standard and for many other > > programs/projects. > > I agree that a number of other projects would have more need for this than > might the kernel. Please understand that this is in no way denigrating > the intelligence of other projects' members. It is just that many of > them have only recently started seriously thinking about concurrency. > In contrast, the Linux kernel community has been doing concurrency since > the mid-1990s. Projects with less experience with concurrency will > probably need more help, from the compiler and from elsewhere as well. > > Your proposal looks quite promising at first glance. But rather than > try and comment on it immediately, I am going to take a number of uses of > RCU from the Linux kernel and apply your proposal to them, then respond > with the results And here is an initial set of six selected randomly from the Linux kernel (assuming you trust awk's random-number generator). This is of course a tiny subset of what is in the kernel, but should be a good set to start with. Looks like a reasonable start to me, though I would not expect the kernel to convert over wholesale any time soon. Which is OK, there are userspace projects using RCU. Thoughts? Am I understanding your proposal at all? ;-) Thanx, Paul ------------------------------------------------------------------------ value_dep_preserving usage examples. /* The following is approximate -- need sparse and maybe other checking. */ #define rcu_dereference(x) atomic_load_explicit(&(x), memory_order_consume) 1. mm/vmalloc.c __purge_vmap_area_lazy() This requires only two small changes. I am not sure that the second change is necessary. My guess is that it is, on the theory that passing a non-value_dep_preserving variable in through a value_dep_preserving argument is guaranteed OK, give or take unnecessary suppression of optimizations, but that passing a value_dep_preserving in through a non-value_dep_preserving could be a serious bug. That said, the Linux kernel convention is that once you leave the outermost rcu_read_lock(), there is no more value dependency preservation. Implementing this convention would remove the need for the cast, for whatever that is worth. static void __purge_vmap_area_lazy(...) { static DEFINE_SPINLOCK(purge_lock); LIST_HEAD(valist); struct vmap_area value_dep_preserving *va; /* CHANGE */ struct vmap_area *n_va; int nr = 0; ... rcu_read_lock(); list_for_each_entry_rcu(va, &vmap_area_list, list) { if (va->flags & VM_LAZY_FREE) { if (va->va_start < *start) *start = va->va_start; if (va->va_end > *end) *end = va->va_end; nr += (va->va_end - va->va_start) >> PAGE_SHIFT; list_add_tail(&va->purge_list, &valist); va->flags |= VM_LAZY_FREEING; va->flags &= ~VM_LAZY_FREE; } } rcu_read_unlock(); ... if (nr) { spin_lock(&vmap_area_lock); list_for_each_entry_safe(va, n_va, &valist, purge_list) __free_vmap_area((struct vmap_area *)va); /* CHANGE */ spin_unlock(&vmap_area_lock); } ... } 2. net/core/sock.c sock_def_wakeup() static void sock_def_wakeup(struct sock *sk) { struct socket_wq value_dep_preserving *wq; /* CHANGE */ rcu_read_lock(); wq = rcu_dereference(sk->sk_wq); if (wq_has_sleeper(wq)) wake_up_interruptible_all(&((struct socket_wq *)wq->wait)); /* CHANGE */ rcu_read_unlock(); } This calls wq_has_sleeper(): static inline bool wq_has_sleeper(struct socket_wq value_dep_preserving *wq) /* CHANGE */ { /* We need to be sure we are in sync with the * add_wait_queue modifications to the wait queue. * * This memory barrier is paired in the sock_poll_wait. */ smp_mb(); return wq && waitqueue_active(&wq->wait); } Although wq_has_sleeper() has a full memory barrier and therefore does not need value_dep_preserving, it seems to be called from within a lot of RCU read-side critical sections, so it is likely a bit nicer to decorate its argument than to apply casts to all calls. Is "&wq->wait" also value_dep_preserving? The call to wake_up_interruptible_all() doesn't need it to be because it acquires a spinlock within the structure, and the ensuing barriers and atomic instructions enforce all the ordering that is required. The call waitqueue_active() is preceded by a full barrier, so it too has all the ordering it needs. So this example is slightly nicer if, given a value_dep_preserving pointer "p", "&p->f" is non-value_dep_preserving. But there will likely be other examples that strongly prefer otherwise, and the pair of casts required here are not that horrible. Plus this approach matches our discussion earlier in this thread. Might be nice to have an intrinsic that takes a value_dep_preserving pointer and returns a non-value_dep_preserving pointer with the same value. 3. net/ipv4/tcp_cong.c tcp_init_congestion_control() This example requires only one change. I am assuming that if "p" is value_dep_preserving, then "p->a" is -not- value_dep_preserving unless the struct field "a" has been declared value_dep_preserving. This means that there is no need to change the "try_module_get(ca->owner)", which is a nice improvement over the current memory_order_consume work. I believe that this approach will do much to trim what would otherwise be over-large value_dep_preserving regions. void tcp_init_congestion_control(struct sock *sk) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_congestion_ops value_dep_preserving *ca; /* CHANGE */ if (icsk->icsk_ca_ops == &tcp_init_congestion_ops) { rcu_read_lock(); list_for_each_entry_rcu(ca, &tcp_cong_list, list) { if (try_module_get(ca->owner)) { icsk->icsk_ca_ops = ca; break; } /* fallback to next available */ } rcu_read_unlock(); } if (icsk->icsk_ca_ops->init) icsk->icsk_ca_ops->init(sk); } 4. drivers/target/tcm_fc/tfc_sess.c ft_sess_get() This one brings up an interesting point. I am guessing that the value_dep_preserving should be silently stripped when the value is passed in through __VA_ARGS__, as in pr_debug() below. Thoughts? static struct ft_sess *ft_sess_get(struct fc_lport *lport, u32 port_id) { struct ft_tport value_dep_preserving *tport; /* CHANGE */ struct hlist_head *head; struct ft_sess value_dep_preserving *sess; /* CHANGE */ rcu_read_lock(); tport = rcu_dereference(lport->prov[FC_TYPE_FCP]); if (!tport) goto out; head = &tport->hash[ft_sess_hash(port_id)]; hlist_for_each_entry_rcu(sess, head, hash) { if (sess->port_id == port_id) { kref_get((struct kref *)&sess->kref); /* CHANGE */ rcu_read_unlock(); pr_debug("port_id %x found %p\n", port_id, sess); return sess; } } out: rcu_read_unlock(); pr_debug("port_id %x not found\n", port_id); return NULL; } 5. net/netlabel/netlabel_unlabeled.c netlbl_unlhsh_search_iface() This one is interesting -- you have to go up a function-call level to find the rcu_read_lock. netlbl_unlhsh_add() calls netlbl_unlhsh_search_iface(). There is no need for value_dep_preserving to flow into netlbl_unlhsh_search_iface(), but its return value appears to need to be netlbl_unlhsh_search_iface. static struct netlbl_unlhsh_iface value_dep_preserving * netlbl_unlhsh_search_iface(int ifindex) /* CHANGE */ { u32 bkt; struct list_head *bkt_list; struct netlbl_unlhsh_iface value_dep_preserving *iter; /* CHANGE */ bkt = netlbl_unlhsh_hash(ifindex); bkt_list = &netlbl_unlhsh_rcu_deref(netlbl_unlhsh)->tbl[bkt]; list_for_each_entry_rcu(iter, bkt_list, list) if (iter->valid && iter->ifindex == ifindex) return iter; return NULL; } 6. drivers/md/linear.c linear_congested() It is possible that this code relies on dependencies flowing into bdev_get_queue(), but to keep this effort finite in duration, I am relying on the __rcu markers -- and some of the structures in this subsystem do have __rcu. The ->rdev and ->bdev fields don't, so this one is pretty straightforward. static int linear_congested(void *data, int bits) { struct mddev *mddev = data; struct linear_conf value_dep_preserving *conf; /* CHANGE */ int i, ret = 0; if (mddev_congested(mddev, bits)) return 1; rcu_read_lock(); conf = rcu_dereference(mddev->private); for (i = 0; i < mddev->raid_disks && !ret ; i++) { struct request_queue *q = bdev_get_queue(conf->disks[i].rdev->bdev); ret |= bdi_congested(&q->backing_dev_info, bits); } rcu_read_unlock(); return ret; } ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-27 17:50 ` Paul E. McKenney 2014-02-27 19:22 ` Paul E. McKenney 2014-02-28 1:02 ` Paul E. McKenney @ 2014-03-03 19:01 ` Torvald Riegel 2 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-03-03 19:01 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-27 at 09:50 -0800, Paul E. McKenney wrote: > Your proposal looks quite promising at first glance. But rather than > try and comment on it immediately, I am going to take a number of uses of > RCU from the Linux kernel and apply your proposal to them, then respond > with the results > > Fair enough? Sure. Thanks for doing the cross-check! ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:32 ` Linus Torvalds 2014-02-20 18:53 ` Torvald Riegel @ 2014-02-20 18:56 ` Paul E. McKenney 2014-02-20 19:45 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:56 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 10:32:51AM -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:11 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > You really need that "consume" to be "acquire". > > So I think we now all agree that that is what the standard is saying. > > And I'm saying that that is wrong, that the standard is badly written, > and should be fixed. > > Because before the standard is fixed, I claim that "consume" is > unusable. We cannot trust it. End of story. We get exactly those same issues with control dependencies. The example gcc breakage was something like this: i = atomic_load(idx, memory_order_consume); x = array[0 + i - i]; Then gcc optimized this to: i = atomic_load(idx, memory_order_consume); x = array[0]; This same issue would hit control dependencies. You are free to argue that this is the fault of ARM and PowerPC memory ordering, but the fact remains that your suggested change has -exactly- the same vulnerability as memory_order_consume currently has. > The fact that apparently gcc is currently buggy because it got the > dependency calculations *wrong* just reinforces my point. > > The gcc bug Torvald pointed at is exactly because the current C > standard is illogical unreadable CRAP. I can guarantee that what > happened is: > > - the compiler saw that the result of the read was used as the left > hand expression of the ternary "? :" operator > > - as a result, the compiler decided that there's no dependency > > - the compiler didn't think about the dependency that comes from the > result of the load *also* being used as the middle part of the ternary > expression, because it had optimized it away, despite the standard not > talking about that at all. > > - so the compiler never saw the dependency that the standard talks about No, the dependency was in a cancelling arithmetic expression as shown above, so that gcc optimized the dependency away. Then the ordering was lost on AARCH64. http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 > BECAUSE THE STANDARD LANGUAGE IS PURE AND UTTER SHIT. > > My suggested language never had any of these problems, because *my* > suggested semantics are clear, logical, and don't have these kinds of > idiotic pit-falls. > > Solution: Fix the f*cking C standard. No excuses, no explanations. > Just get it fixed. I agree that the standard needs help, but your suggested fix has the same problems as shown in the bugzilla. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:56 ` Paul E. McKenney @ 2014-02-20 19:45 ` Linus Torvalds 2014-02-20 22:10 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 19:45 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > The example gcc breakage was something like this: > > i = atomic_load(idx, memory_order_consume); > x = array[0 + i - i]; > > Then gcc optimized this to: > > i = atomic_load(idx, memory_order_consume); > x = array[0]; > > This same issue would hit control dependencies. You are free to argue > that this is the fault of ARM and PowerPC memory ordering, but the fact > remains that your suggested change has -exactly- the same vulnerability > as memory_order_consume currently has. No it does not, for two reasons, first the legalistic (and bad) reason: As I actually described it, the "consume" becomes an "acquire" by default. If it's not used as an address to the dependent load, then it's an acquire. The use "going away" in no way makes the acquire go away in my simplistic model. So the compiler would actually translate that to a load-with-acquire, not be able to remove the acquire, and we have end of story. The actual code generation would be that "ld + sync + ld" on powerpc, or "ld.acq" on ARM. Now, the reason I claim that reason was "legalistic and bad" is that it's actually a cop-out, and if you had made the example be something like this: p = atomic_load(&ptr, memory_order_consume); x = array[0 + p - p]; y = p->val; then yes, I actually think that the order of loads of 'x' and 'p' are not enforced by the "consume". The only case that is clear is the order of 'y' and 'p', because that is the only one that really *USES* the value. The "use" of "+p-p" is syntactic bullshit. It's very obvious to even a slightly developmentally challenged hedgehog that "+p-p" doesn't have any actual *semantic* meaning, it's purely syntactic. And the syntactic meaning is meaningless and doesn't matter. Because I would just get rid of the whole "dependency chain" language ALTOGETHER. So in fact, in my world, I would consider your example to be a non-issue. In my world, there _is_ no "dependency chain" at a syntactic level. In my SANE world, none of that insane crap language exists. That language is made-up and tied to syntax exactly because it *cannot* be tied to semantics. In my sane world, "consume" has a much simpler meaning, and has no legalistic syntactic meaning: only real use matters. If the value can be optimized away, the so can the barrier, and so can the whole load. The value isn't "consumed", so it has no meaning. So if you write i = atomic_load(idx, memory_order_consume); x = array[0+i-i]; then in my world that "+i-i" is meaningless. It's semantic fluff, and while my naive explanation would have left it as an acquire (because it cannot be peep-holed away), I actually do believe that the compiler should be obviously allowed to optimize the load away entirely since it's meaningless, and if no use of 'i' remains, then it has no consumer, and so there is no dependency. Put another way: "consume" is not about getting a lock, it's about getting a *value*. Only the dependency on the *value* matters, and if the value is optimized away, there is no dependency. And the value itself does not have any semantics. There's nothing "volatile" about the use of the value that would mean that the compiler cannot re-order it or remove it entirely. There's no barrier "carried around" by the value per se. The barrier is between the load and use. That's the *point* of "consume" after all. The whole "chain of dependency" language is pointless. It's wrong. It's complicated, it is illogical, and it causes subtle problems exactly because it got tied to the language *syntax* rather than to any logical use. Don't try to re-introduce the whole issue. It was a mistake for the C standard to talk about dependencies in the first place, exactly because it results in these idiotic legalistic practices. You do realize that that whole "*(q+flag-flag)" example in the bugzilla comes from the fact that the programmer tried to *fight* the fact that the C standard got the control dependency wrong? In other words, the *deepest* reason for that bugzilla is that the programmer tried to force the logical dependency by rewriting it as a (fake, and easily optimizable) data dependency. In *my* world, the stupid data-vs-control dependency thing goes away, the test of the value itself is a use of it, and "*p ? *q :0" just does the right thing, there's no reason to do that "q+flag-flag" thing in the first place, and if you do, the compiler *should* just ignore your little games. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 19:45 ` Linus Torvalds @ 2014-02-20 22:10 ` Paul E. McKenney 2014-02-20 22:52 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 22:10 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 11:45:29AM -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 10:56 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > The example gcc breakage was something like this: > > > > i = atomic_load(idx, memory_order_consume); > > x = array[0 + i - i]; > > > > Then gcc optimized this to: > > > > i = atomic_load(idx, memory_order_consume); > > x = array[0]; > > > > This same issue would hit control dependencies. You are free to argue > > that this is the fault of ARM and PowerPC memory ordering, but the fact > > remains that your suggested change has -exactly- the same vulnerability > > as memory_order_consume currently has. > > No it does not, for two reasons, first the legalistic (and bad) reason: > > As I actually described it, the "consume" becomes an "acquire" by > default. If it's not used as an address to the dependent load, then > it's an acquire. The use "going away" in no way makes the acquire go > away in my simplistic model. > > So the compiler would actually translate that to a load-with-acquire, > not be able to remove the acquire, and we have end of story. The > actual code generation would be that "ld + sync + ld" on powerpc, or > "ld.acq" on ARM. > > Now, the reason I claim that reason was "legalistic and bad" is that > it's actually a cop-out, and if you had made the example be something > like this: > > p = atomic_load(&ptr, memory_order_consume); > x = array[0 + p - p]; > y = p->val; > > then yes, I actually think that the order of loads of 'x' and 'p' are > not enforced by the "consume". The only case that is clear is the > order of 'y' and 'p', because that is the only one that really *USES* > the value. > > The "use" of "+p-p" is syntactic bullshit. It's very obvious to even a > slightly developmentally challenged hedgehog that "+p-p" doesn't have > any actual *semantic* meaning, it's purely syntactic. > > And the syntactic meaning is meaningless and doesn't matter. Because I > would just get rid of the whole "dependency chain" language > ALTOGETHER. > > So in fact, in my world, I would consider your example to be a > non-issue. In my world, there _is_ no "dependency chain" at a > syntactic level. In my SANE world, none of that insane crap language > exists. That language is made-up and tied to syntax exactly because it > *cannot* be tied to semantics. > > In my sane world, "consume" has a much simpler meaning, and has no > legalistic syntactic meaning: only real use matters. If the value can > be optimized away, the so can the barrier, and so can the whole load. > The value isn't "consumed", so it has no meaning. > > So if you write > > i = atomic_load(idx, memory_order_consume); > x = array[0+i-i]; > > then in my world that "+i-i" is meaningless. It's semantic fluff, and > while my naive explanation would have left it as an acquire (because > it cannot be peep-holed away), I actually do believe that the compiler > should be obviously allowed to optimize the load away entirely since > it's meaningless, and if no use of 'i' remains, then it has no > consumer, and so there is no dependency. > > Put another way: "consume" is not about getting a lock, it's about > getting a *value*. Only the dependency on the *value* matters, and if > the value is optimized away, there is no dependency. > > And the value itself does not have any semantics. There's nothing > "volatile" about the use of the value that would mean that the > compiler cannot re-order it or remove it entirely. There's no barrier > "carried around" by the value per se. The barrier is between the load > and use. That's the *point* of "consume" after all. > > The whole "chain of dependency" language is pointless. It's wrong. > It's complicated, it is illogical, and it causes subtle problems > exactly because it got tied to the language *syntax* rather than to > any logical use. > > Don't try to re-introduce the whole issue. It was a mistake for the C > standard to talk about dependencies in the first place, exactly > because it results in these idiotic legalistic practices. > > You do realize that that whole "*(q+flag-flag)" example in the > bugzilla comes from the fact that the programmer tried to *fight* the > fact that the C standard got the control dependency wrong? > > In other words, the *deepest* reason for that bugzilla is that the > programmer tried to force the logical dependency by rewriting it as a > (fake, and easily optimizable) data dependency. > > In *my* world, the stupid data-vs-control dependency thing goes away, > the test of the value itself is a use of it, and "*p ? *q :0" just > does the right thing, there's no reason to do that "q+flag-flag" thing > in the first place, and if you do, the compiler *should* just ignore > your little games. Linus, given that you are calling me out for pushing "legalistic and bad" things, "syntactic bullshit", and playing "little games", I am forced to conclude that you have never attended any sort of standards-committee meeting. ;-) That said, I am fine with pushing control/data dependencies with this general approach. There will be complications, but there always are and they can be dealt with as they come up. FWIW, the last time I tried excluding things like "f-f", "x%1", "y*0" and so on, I got a lot of pushback. The reason I didn't argue too much back (2007 or some such) then was that my view at the time was that I figured the kernel code wouldn't do things like that anyway, so it didn't matter. However, that was more than five years ago, so worth another try. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 22:10 ` Paul E. McKenney @ 2014-02-20 22:52 ` Linus Torvalds 2014-02-21 18:35 ` Michael Matz 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 22:52 UTC (permalink / raw) To: Paul McKenney Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 2:10 PM, Paul E. McKenney <paulmck@linux.vnet.ibm.com> wrote: > > Linus, given that you are calling me out for pushing "legalistic and bad" > things, "syntactic bullshit", and playing "little games", I am forced > to conclude that you have never attended any sort of standards-committee > meeting. ;-) Heh. I have heard people wax poetic about the pleasures of standards committee meetings. Enough that I haven't really ever had the slightest urge to participate ;) > FWIW, the last time I tried excluding things like "f-f", "x%1", "y*0" and > so on, I got a lot of pushback. The reason I didn't argue too much back > (2007 or some such) then was that my view at the time was that I figured > the kernel code wouldn't do things like that anyway, so it didn't matter. Well.. I'd really hope we would never do that. That said, we have certainly used disgusting things that we knew would disable certain optimizations in the compiler before, so I wouldn't put it *entirely* past us to do things like that, but I'd argue that we'd do so in order to confuse the compiler to do what we want, not in order to argue that it's a good thing. I mean, right now we have at least *one* active ugly work-around for a compiler bug (the magic empty inline asm that works around the "asm goto" bug in gcc). So it's not like I would claim that we don't do disgusting things when we need to. But I'm pretty sure that any compiler guy must *hate* that current odd dependency-generation part, and if I was a gcc person, seeing that bugzilla entry Torvald pointed at, I would personally want to dismember somebody with a rusty spoon.. So I suspect there are a number of people who would be *more* than happy with a change to those odd dependency rules. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 22:52 ` Linus Torvalds @ 2014-02-21 18:35 ` Michael Matz 2014-02-21 19:13 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Michael Matz @ 2014-02-21 18:35 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc Hi, On Thu, 20 Feb 2014, Linus Torvalds wrote: > But I'm pretty sure that any compiler guy must *hate* that current odd > dependency-generation part, and if I was a gcc person, seeing that > bugzilla entry Torvald pointed at, I would personally want to > dismember somebody with a rusty spoon.. Yes. Defining dependency chains in the way the standard currently seems to do must come from people not writing compilers. There's simply no sensible way to implement it without being really conservative, because the depchains can contain arbitrary constructs including stores, loads and function calls but must still be observed. And with conservative I mean "everything is a source of a dependency, and hence can't be removed, reordered or otherwise fiddled with", and that includes code sequences where no atomic objects are anywhere in sight [1]. In the light of that the only realistic way (meaning to not have to disable optimization everywhere) to implement consume as currently specified is to map it to acquire. At which point it becomes pointless. > So I suspect there are a number of people who would be *more* than > happy with a change to those odd dependency rules. I can't say much about your actual discussion related to semantics of atomics, not my turf. But the "carries a dependency" relation is not usefully implementable. Ciao, Michael. [1] Simple example of what type of transformations would be disallowed: int getzero (int i) { return i - i; } Should be optimizable to "return 0;", right? Not with carries a dependency in place: int jeez (int idx) { int i = atomic_load(idx, memory_order_consume); // A int j = getzero (i); // B return array[j]; // C } As I read "carries a dependency" there's a dependency from A to C. Now suppose we would optimize getzero in the obvious way, then inline, and boom, dependency gone. So we wouldn't be able to optimize any function when we don't control all its users, for fear that it _might_ be used in some dependency chain where it then matters that we possibly removed some chain elements due to the transformation. We would have to retain 'i-i' before inlining, and if the function then is inlined into a context where depchains don't matter, could _then_ optmize it to zero. But that's insane, especially considering that it's hard to detect if a given context doesn't care for depchains, after all the depchain relation is constructed exactly so that it bleeds into nearly everywhere. So we would most of the time have to assume that the ultimate context will be depchain-aware and therefore disable many transformations. There'd be one solution to the above, we would have to invent some special operands and markers that explicitely model "carries-a-dep", ala this: int getzero (int i) { #RETURN.dep = i.dep return 0; } int jeez (int idx) { # i.dep = idx.dep int i = atomic_load(idx, memory_order_consume); // A # j.dep = i.dep int j = getzero (i); // B # RETURN.dep = j.dep + array.dep return array[j]; // C } Then inlining getzero would merely add another "# j.dep = i.dep" relation, so depchains are still there but the value optimization can happen before inlining. Having to do something like that I'd find disgusting, and rather rewrite consume into acquire :) Or make the depchain relation somehow realistically implementable. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 18:35 ` Michael Matz @ 2014-02-21 19:13 ` Paul E. McKenney 2014-02-21 22:10 ` Joseph S. Myers ` (2 more replies) 0 siblings, 3 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-21 19:13 UTC (permalink / raw) To: Michael Matz Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > Hi, > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > dependency-generation part, and if I was a gcc person, seeing that > > bugzilla entry Torvald pointed at, I would personally want to > > dismember somebody with a rusty spoon.. > > Yes. Defining dependency chains in the way the standard currently seems > to do must come from people not writing compilers. There's simply no > sensible way to implement it without being really conservative, because > the depchains can contain arbitrary constructs including stores, > loads and function calls but must still be observed. > > And with conservative I mean "everything is a source of a dependency, and > hence can't be removed, reordered or otherwise fiddled with", and that > includes code sequences where no atomic objects are anywhere in sight [1]. > In the light of that the only realistic way (meaning to not have to > disable optimization everywhere) to implement consume as currently > specified is to map it to acquire. At which point it becomes pointless. No, only memory_order_consume loads and [[carries_dependency]] function arguments are sources of dependency chains. > > So I suspect there are a number of people who would be *more* than > > happy with a change to those odd dependency rules. > > I can't say much about your actual discussion related to semantics of > atomics, not my turf. But the "carries a dependency" relation is not > usefully implementable. > > > Ciao, > Michael. > [1] Simple example of what type of transformations would be disallowed: > > int getzero (int i) { return i - i; } This needs to be as follows: [[carries_dependency]] int getzero(int i [[carries_dependency]]) { return i - i; } Otherwise dependencies won't get carried through it. > Should be optimizable to "return 0;", right? Not with carries a > dependency in place: > > int jeez (int idx) { > int i = atomic_load(idx, memory_order_consume); // A > int j = getzero (i); // B > return array[j]; // C > } > > As I read "carries a dependency" there's a dependency from A to C. > Now suppose we would optimize getzero in the obvious way, then inline, and > boom, dependency gone. So we wouldn't be able to optimize any function > when we don't control all its users, for fear that it _might_ be used in > some dependency chain where it then matters that we possibly removed some > chain elements due to the transformation. We would have to retain 'i-i' > before inlining, and if the function then is inlined into a context where > depchains don't matter, could _then_ optmize it to zero. But that's > insane, especially considering that it's hard to detect if a given context > doesn't care for depchains, after all the depchain relation is constructed > exactly so that it bleeds into nearly everywhere. So we would most of > the time have to assume that the ultimate context will be depchain-aware > and therefore disable many transformations. Any function that does not contain a memory_order_consume load and that doesn't have any arguments marked [[carries_dependency]] can be optimized just as before. > There'd be one solution to the above, we would have to invent some special > operands and markers that explicitely model "carries-a-dep", ala this: > > int getzero (int i) { > #RETURN.dep = i.dep > return 0; > } The above is already handled by the [[carries_dependency]] attribute, see above. > int jeez (int idx) { > # i.dep = idx.dep > int i = atomic_load(idx, memory_order_consume); // A > # j.dep = i.dep > int j = getzero (i); // B > # RETURN.dep = j.dep + array.dep > return array[j]; // C > } > > Then inlining getzero would merely add another "# j.dep = i.dep" relation, > so depchains are still there but the value optimization can happen before > inlining. Having to do something like that I'd find disgusting, and > rather rewrite consume into acquire :) Or make the depchain relation > somehow realistically implementable. I was actually OK with arithmetic cancellation breaking the dependency chains. Others on the committee felt otherwise, and I figured that (1) I wouldn't be writing that kind of function anyway and (2) they knew more about writing compilers than I. I would still be OK saying that things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just break the dependency chain. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 19:13 ` Paul E. McKenney @ 2014-02-21 22:10 ` Joseph S. Myers 2014-02-21 22:37 ` Paul E. McKenney 2014-02-26 13:09 ` Torvald Riegel 2014-02-24 13:55 ` Michael Matz 2014-02-26 13:04 ` Torvald Riegel 2 siblings, 2 replies; 285+ messages in thread From: Joseph S. Myers @ 2014-02-21 22:10 UTC (permalink / raw) To: Paul E. McKenney Cc: Michael Matz, Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 21 Feb 2014, Paul E. McKenney wrote: > This needs to be as follows: > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > { > return i - i; > } > > Otherwise dependencies won't get carried through it. C11 doesn't have attributes at all (and no specification regarding calls and dependencies that I can see). And the way I read the C++11 specification of carries_dependency is that specifying carries_dependency is purely about increasing optimization of the caller: that if it isn't specified, then the caller doesn't know what dependencies might be carried. "Note: The carries_dependency attribute does not change the meaning of the program, but may result in generation of more efficient code. - end note". -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 22:10 ` Joseph S. Myers @ 2014-02-21 22:37 ` Paul E. McKenney 2014-02-26 13:09 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-21 22:37 UTC (permalink / raw) To: Joseph S. Myers Cc: Michael Matz, Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, Feb 21, 2014 at 10:10:54PM +0000, Joseph S. Myers wrote: > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > This needs to be as follows: > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > { > > return i - i; > > } > > > > Otherwise dependencies won't get carried through it. > > C11 doesn't have attributes at all (and no specification regarding calls > and dependencies that I can see). And the way I read the C++11 > specification of carries_dependency is that specifying carries_dependency > is purely about increasing optimization of the caller: that if it isn't > specified, then the caller doesn't know what dependencies might be > carried. "Note: The carries_dependency attribute does not change the > meaning of the program, but may result in generation of more efficient > code. - end note". Good point -- I am so used to them being in gcc that I missed that. In which case, it seems to me that straight C11 is within its rights to emit a memory barrier just before control passes into a function that either it can't see or that it chose to apply dependency-breaking optimizations to. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 22:10 ` Joseph S. Myers 2014-02-21 22:37 ` Paul E. McKenney @ 2014-02-26 13:09 ` Torvald Riegel 2014-02-26 18:43 ` Joseph S. Myers 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-26 13:09 UTC (permalink / raw) To: Joseph S. Myers Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote: > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > This needs to be as follows: > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > { > > return i - i; > > } > > > > Otherwise dependencies won't get carried through it. > > C11 doesn't have attributes at all (and no specification regarding calls > and dependencies that I can see). And the way I read the C++11 > specification of carries_dependency is that specifying carries_dependency > is purely about increasing optimization of the caller: that if it isn't > specified, then the caller doesn't know what dependencies might be > carried. "Note: The carries_dependency attribute does not change the > meaning of the program, but may result in generation of more efficient > code. - end note". I think that this last sentence can be kind of misleading, especially when looking at it from an implementation point of view. How dependencies are handled (ie, preserving the syntactic dependencies vs. emitting barriers) must be part of the ABI, or things like [[carries_dependency]] won't work as expected (or lead to inefficient code). Thus, in practice, all compiler vendors on a platform would have to agree to a particular handling, which might end up in selecting the easy-but-conservative implementation option (ie, always emitting mo_acquire when the source uses mo_consume). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 13:09 ` Torvald Riegel @ 2014-02-26 18:43 ` Joseph S. Myers 2014-02-27 0:52 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Joseph S. Myers @ 2014-02-26 18:43 UTC (permalink / raw) To: Torvald Riegel Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 26 Feb 2014, Torvald Riegel wrote: > On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote: > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > > This needs to be as follows: > > > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > > { > > > return i - i; > > > } > > > > > > Otherwise dependencies won't get carried through it. > > > > C11 doesn't have attributes at all (and no specification regarding calls > > and dependencies that I can see). And the way I read the C++11 > > specification of carries_dependency is that specifying carries_dependency > > is purely about increasing optimization of the caller: that if it isn't > > specified, then the caller doesn't know what dependencies might be > > carried. "Note: The carries_dependency attribute does not change the > > meaning of the program, but may result in generation of more efficient > > code. - end note". > > I think that this last sentence can be kind of misleading, especially > when looking at it from an implementation point of view. How > dependencies are handled (ie, preserving the syntactic dependencies vs. > emitting barriers) must be part of the ABI, or things like > [[carries_dependency]] won't work as expected (or lead to inefficient > code). Thus, in practice, all compiler vendors on a platform would have > to agree to a particular handling, which might end up in selecting the > easy-but-conservative implementation option (ie, always emitting > mo_acquire when the source uses mo_consume). Regardless of the ABI, my point is that if a program is valid, it is also valid when all uses of [[carries_dependency]] are removed. If a function doesn't use [[carries_dependency]], that means "dependencies may or may not be carried through this function". If a function uses [[carries_dependency]], that means that certain dependencies *are* carried through the function (and the ABI should then specify what this means the caller can rely on, in terms of the architecture's memory model). (This may or may not be useful, but it's how I understand C++11.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 18:43 ` Joseph S. Myers @ 2014-02-27 0:52 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-27 0:52 UTC (permalink / raw) To: Joseph S. Myers Cc: Paul E. McKenney, Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-26 at 18:43 +0000, Joseph S. Myers wrote: > On Wed, 26 Feb 2014, Torvald Riegel wrote: > > > On Fri, 2014-02-21 at 22:10 +0000, Joseph S. Myers wrote: > > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > > > > This needs to be as follows: > > > > > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > > > { > > > > return i - i; > > > > } > > > > > > > > Otherwise dependencies won't get carried through it. > > > > > > C11 doesn't have attributes at all (and no specification regarding calls > > > and dependencies that I can see). And the way I read the C++11 > > > specification of carries_dependency is that specifying carries_dependency > > > is purely about increasing optimization of the caller: that if it isn't > > > specified, then the caller doesn't know what dependencies might be > > > carried. "Note: The carries_dependency attribute does not change the > > > meaning of the program, but may result in generation of more efficient > > > code. - end note". > > > > I think that this last sentence can be kind of misleading, especially > > when looking at it from an implementation point of view. How > > dependencies are handled (ie, preserving the syntactic dependencies vs. > > emitting barriers) must be part of the ABI, or things like > > [[carries_dependency]] won't work as expected (or lead to inefficient > > code). Thus, in practice, all compiler vendors on a platform would have > > to agree to a particular handling, which might end up in selecting the > > easy-but-conservative implementation option (ie, always emitting > > mo_acquire when the source uses mo_consume). > > Regardless of the ABI, my point is that if a program is valid, it is also > valid when all uses of [[carries_dependency]] are removed. If a function > doesn't use [[carries_dependency]], that means "dependencies may or may > not be carried through this function". If a function uses > [[carries_dependency]], that means that certain dependencies *are* carried > through the function (and the ABI should then specify what this means the > caller can rely on, in terms of the architecture's memory model). (This > may or may not be useful, but it's how I understand C++11.) I agree. What I tried to point out is that it's not the case that an *implementation* can just ignore [[carries_dependency]]. So from an implementation perspective, the attribute does have semantics. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 19:13 ` Paul E. McKenney 2014-02-21 22:10 ` Joseph S. Myers @ 2014-02-24 13:55 ` Michael Matz 2014-02-24 17:40 ` Paul E. McKenney 2014-02-26 13:04 ` Torvald Riegel 2 siblings, 1 reply; 285+ messages in thread From: Michael Matz @ 2014-02-24 13:55 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc Hi, On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > And with conservative I mean "everything is a source of a dependency, and > > hence can't be removed, reordered or otherwise fiddled with", and that > > includes code sequences where no atomic objects are anywhere in sight [1]. > > In the light of that the only realistic way (meaning to not have to > > disable optimization everywhere) to implement consume as currently > > specified is to map it to acquire. At which point it becomes pointless. > > No, only memory_order_consume loads and [[carries_dependency]] > function arguments are sources of dependency chains. I don't see [[carries_dependency]] in the C11 final draft (yeah, should get a real copy, I know, but let's assume it's the same language as the standard). Therefore, yes, only consume loads are sources of dependencies. The problem with the definition of the "carries a dependency" relation is not the sources, but rather where it stops. It's transitively closed over "value of evaluation A is used as operand in evaluation B", with very few exceptions as per 5.1.2.4#14. Evaluations can contain function calls, so if there's _any_ chance that an operand of an evaluation might even indirectly use something resulting from a consume load then that evaluation must be compiled in a way to not break dependency chains. I don't see a way to generally assume that e.g. the value of a function argument can impossibly result from a consume load, therefore the compiler must assume that all function arguments _can_ result from such loads, and so must disable all depchain breaking optimization (which are many). > > [1] Simple example of what type of transformations would be disallowed: > > > > int getzero (int i) { return i - i; } > > This needs to be as follows: > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > { > return i - i; > } > > Otherwise dependencies won't get carried through it. So, with the above do you agree that in absense of any other magic (see below) the compiler is not allowed to transform my initial getzero() (without the carries_dependency markers) implementation into "return 0;" because of the C11 rules for "carries-a-dependency"? If so, do you then also agree that the specification of "carries a dependency" is somewhat, err, shall we say, overbroad? > > depchains don't matter, could _then_ optmize it to zero. But that's > > insane, especially considering that it's hard to detect if a given context > > doesn't care for depchains, after all the depchain relation is constructed > > exactly so that it bleeds into nearly everywhere. So we would most of > > the time have to assume that the ultimate context will be depchain-aware > > and therefore disable many transformations. > > Any function that does not contain a memory_order_consume load and that > doesn't have any arguments marked [[carries_dependency]] can be > optimized just as before. And as such marker doesn't exist we must conservatively assume that it's on _all_ parameters, so I'll stand by my claim. > > Then inlining getzero would merely add another "# j.dep = i.dep" > > relation, so depchains are still there but the value optimization can > > happen before inlining. Having to do something like that I'd find > > disgusting, and rather rewrite consume into acquire :) Or make the > > depchain relation somehow realistically implementable. > > I was actually OK with arithmetic cancellation breaking the dependency > chains. Others on the committee felt otherwise, and I figured that (1) > I wouldn't be writing that kind of function anyway and (2) they knew > more about writing compilers than I. I would still be OK saying that > things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just break the > dependency chain. Exactly. I can see the problem that people had with that, though. There are very many ways to write conceiled zeros (or generally neutral elements of the function in question). My getzero() function is one (it could e.g. be an assembler implementation). The allowance to break dependency chains would have to apply to such cancellation as well, and so can't simply itemize all cases in which cancellation is allowed. Rather it would have had to argue about something like "value dependency", ala "evaluation B depends on A, if there exist at least two different values A1 and A2 (results from A), for which evaluation B (with otherwise same operands) yields different values B1 and B2". Alas, it doesn't, except if you want to understand the term "the value of A is used as an operand of B" in that way. Even then you'd still have the second case of the depchain definition, via intermediate not even atomic memory stores and loads to make two evaluations be ordered per carries-a-dependency. And even that understanding of "is used" wouldn't be enough, because there are cases where the cancellation happens in steps, and where it interacts with the third clause (transitiveness): Assume this: a = something() // evaluation A b = 1 - a // evaluation B c = a - 1 + b // evaluation C Now, clearly B depends on A. Also C depends on B (because with otherwise same operands changing just B also changes C), because of transitiveness C then also depends on A. But equally cleary C was just an elaborate way to write "0", and so depends on nothing. The problem was of course that A and B weren't independent when determining the dependencies of C. But allowing cancellation to break dependency chains would have to allow for these cases as well. So, now, that leaves us basically with depchains forcing us to disable many useful transformation or finding some other magic. One would be to just regard all consume loads as acquire loads and be done (and effectively remove the ill-advised "carries a dependency" relation from consideration). You say downthread that it'd also be possible to just emit barriers before all function calls (I say "all" because the compiler will generally have applied some transformation that broke depchains if they existed). That seems to me to be a bigger hammer than just ignoring depchains and emit acquires instead of consumes (because the latter changes only exactly where atomics are used, the former seems to me to have unbounded effect). So, am still missing something or is my understanding of the carries-a-dependency relation correct and my conclusions are merely too pessimistic? Ciao, Michael. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-24 13:55 ` Michael Matz @ 2014-02-24 17:40 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-24 17:40 UTC (permalink / raw) To: Michael Matz Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 24, 2014 at 02:55:07PM +0100, Michael Matz wrote: > Hi, > > On Fri, 21 Feb 2014, Paul E. McKenney wrote: > > > > And with conservative I mean "everything is a source of a dependency, and > > > hence can't be removed, reordered or otherwise fiddled with", and that > > > includes code sequences where no atomic objects are anywhere in sight [1]. > > > In the light of that the only realistic way (meaning to not have to > > > disable optimization everywhere) to implement consume as currently > > > specified is to map it to acquire. At which point it becomes pointless. > > > > No, only memory_order_consume loads and [[carries_dependency]] > > function arguments are sources of dependency chains. > > I don't see [[carries_dependency]] in the C11 final draft (yeah, should > get a real copy, I know, but let's assume it's the same language as the > standard). Therefore, yes, only consume loads are sources of > dependencies. The problem with the definition of the "carries a > dependency" relation is not the sources, but rather where it stops. > It's transitively closed over "value of evaluation A is used as operand in > evaluation B", with very few exceptions as per 5.1.2.4#14. Evaluations > can contain function calls, so if there's _any_ chance that an operand of > an evaluation might even indirectly use something resulting from a consume > load then that evaluation must be compiled in a way to not break > dependency chains. > > I don't see a way to generally assume that e.g. the value of a function > argument can impossibly result from a consume load, therefore the compiler > must assume that all function arguments _can_ result from such loads, and > so must disable all depchain breaking optimization (which are many). > > > > [1] Simple example of what type of transformations would be disallowed: > > > > > > int getzero (int i) { return i - i; } > > > > This needs to be as follows: > > > > [[carries_dependency]] int getzero(int i [[carries_dependency]]) > > { > > return i - i; > > } > > > > Otherwise dependencies won't get carried through it. > > So, with the above do you agree that in absense of any other magic (see > below) the compiler is not allowed to transform my initial getzero() > (without the carries_dependency markers) implementation into "return 0;" > because of the C11 rules for "carries-a-dependency"? > > If so, do you then also agree that the specification of "carries a > dependency" is somewhat, err, shall we say, overbroad? >From what I can see, overbroad. The problem is that the C++11 standard defines how carries-dependency interacts with function calls and returns in 7.6.4, which describes the [[carries_dependency]] attribute. For example, 7.6.4p6 says: Function g’s second parameter has a carries_dependency attribute, but its first parameter does not. Therefore, function h’s first call to g carries a dependency into g, but its second call does not. The implementation might need to insert a fence prior to the second call to g. When C11 declined to take attributes, they also left out the part saying how carries-dependency interacts with functions. :-/ Might be fixed by now, checking up on it. One could argue that the bit about emitting fence instructions at function calls and returns is implied by the as-if rule even without this wording, but... > > > depchains don't matter, could _then_ optmize it to zero. But that's > > > insane, especially considering that it's hard to detect if a given context > > > doesn't care for depchains, after all the depchain relation is constructed > > > exactly so that it bleeds into nearly everywhere. So we would most of > > > the time have to assume that the ultimate context will be depchain-aware > > > and therefore disable many transformations. > > > > Any function that does not contain a memory_order_consume load and that > > doesn't have any arguments marked [[carries_dependency]] can be > > optimized just as before. > > And as such marker doesn't exist we must conservatively assume that it's > on _all_ parameters, so I'll stand by my claim. Or that you have to emit a fence instruction when a dependency chain enters or leaves a function in cases where all callers/calles are not visible to the compiler. My preference is that the ordering properties of a carries-dependency chain is implementation defined at the point that it enters or leaves a function without the marker, but others strongly disagreed. ;-) > > > Then inlining getzero would merely add another "# j.dep = i.dep" > > > relation, so depchains are still there but the value optimization can > > > happen before inlining. Having to do something like that I'd find > > > disgusting, and rather rewrite consume into acquire :) Or make the > > > depchain relation somehow realistically implementable. > > > > I was actually OK with arithmetic cancellation breaking the dependency > > chains. Others on the committee felt otherwise, and I figured that (1) > > I wouldn't be writing that kind of function anyway and (2) they knew > > more about writing compilers than I. I would still be OK saying that > > things like "i-i", "i*0", "i%1", "i&0", "i|~0" and so on just break the > > dependency chain. > > Exactly. I can see the problem that people had with that, though. There > are very many ways to write conceiled zeros (or generally neutral elements > of the function in question). My getzero() function is one (it could e.g. > be an assembler implementation). The allowance to break dependency chains > would have to apply to such cancellation as well, and so can't simply > itemize all cases in which cancellation is allowed. Rather it would have > had to argue about something like "value dependency", ala "evaluation B > depends on A, if there exist at least two different values A1 and A2 > (results from A), for which evaluation B (with otherwise same operands) > yields different values B1 and B2". And that was in fact one of the arguments used against me. ;-) > Alas, it doesn't, except if you want to understand the term "the value of > A is used as an operand of B" in that way. Even then you'd still have the > second case of the depchain definition, via intermediate not even atomic > memory stores and loads to make two evaluations be ordered per > carries-a-dependency. > > And even that understanding of "is used" wouldn't be enough, because there > are cases where the cancellation happens in steps, and where it interacts > with the third clause (transitiveness): Assume this: > > a = something() // evaluation A > b = 1 - a // evaluation B > c = a - 1 + b // evaluation C > > Now, clearly B depends on A. Also C depends on B (because with otherwise > same operands changing just B also changes C), because of transitiveness C > then also depends on A. But equally cleary C was just an elaborate way to > write "0", and so depends on nothing. The problem was of course that A > and B weren't independent when determining the dependencies of C. But > allowing cancellation to break dependency chains would have to allow for > these cases as well. > > So, now, that leaves us basically with depchains forcing us to disable > many useful transformation or finding some other magic. One would be to > just regard all consume loads as acquire loads and be done (and > effectively remove the ill-advised "carries a dependency" relation from > consideration). > > You say downthread that it'd also be possible to just emit barriers before > all function calls (I say "all" because the compiler will generally > have applied some transformation that broke depchains if they existed). > That seems to me to be a bigger hammer than just ignoring depchains and > emit acquires instead of consumes (because the latter changes only exactly > where atomics are used, the former seems to me to have unbounded effect). Yep, converting the acquire to a consume is a valid alternative to emitting a memory-barrier instruction prior to entering/exiting the function in question. > So, am still missing something or is my understanding of the > carries-a-dependency relation correct and my conclusions are merely too > pessimistic? Given the definition as it is, I believe you understand it. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 19:13 ` Paul E. McKenney 2014-02-21 22:10 ` Joseph S. Myers 2014-02-24 13:55 ` Michael Matz @ 2014-02-26 13:04 ` Torvald Riegel 2014-02-26 18:27 ` Paul E. McKenney 2 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-26 13:04 UTC (permalink / raw) To: paulmck Cc: Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > > Hi, > > > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > > dependency-generation part, and if I was a gcc person, seeing that > > > bugzilla entry Torvald pointed at, I would personally want to > > > dismember somebody with a rusty spoon.. > > > > Yes. Defining dependency chains in the way the standard currently seems > > to do must come from people not writing compilers. There's simply no > > sensible way to implement it without being really conservative, because > > the depchains can contain arbitrary constructs including stores, > > loads and function calls but must still be observed. > > > > And with conservative I mean "everything is a source of a dependency, and > > hence can't be removed, reordered or otherwise fiddled with", and that > > includes code sequences where no atomic objects are anywhere in sight [1]. > > In the light of that the only realistic way (meaning to not have to > > disable optimization everywhere) to implement consume as currently > > specified is to map it to acquire. At which point it becomes pointless. > > No, only memory_order_consume loads and [[carries_dependency]] > function arguments are sources of dependency chains. However, that is, given how the standard specifies things, just one of the possible ways for how an implementation can handle this. Treating [[carries_dependency]] as a "necessary" annotation to make exploiting mo_consume work in practice is possible, but it's not required by the standard. Also, dependencies are specified to flow through loads and stores (restricted to scalar objects and bitfields), so any load that might load from a dependency-carrying store can also be a source (and that doesn't seem to be restricted by [[carries_dependency]]). ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-26 13:04 ` Torvald Riegel @ 2014-02-26 18:27 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-26 18:27 UTC (permalink / raw) To: Torvald Riegel Cc: Michael Matz, Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, Feb 26, 2014 at 02:04:30PM +0100, Torvald Riegel wrote: > xagsmtp2.20140226130517.3625@vmsdvma.vnet.ibm.com > X-Xagent-Gateway: vmsdvma.vnet.ibm.com (XAGSMTP2 at VMSDVMA) > > On Fri, 2014-02-21 at 11:13 -0800, Paul E. McKenney wrote: > > On Fri, Feb 21, 2014 at 07:35:37PM +0100, Michael Matz wrote: > > > Hi, > > > > > > On Thu, 20 Feb 2014, Linus Torvalds wrote: > > > > > > > But I'm pretty sure that any compiler guy must *hate* that current odd > > > > dependency-generation part, and if I was a gcc person, seeing that > > > > bugzilla entry Torvald pointed at, I would personally want to > > > > dismember somebody with a rusty spoon.. > > > > > > Yes. Defining dependency chains in the way the standard currently seems > > > to do must come from people not writing compilers. There's simply no > > > sensible way to implement it without being really conservative, because > > > the depchains can contain arbitrary constructs including stores, > > > loads and function calls but must still be observed. > > > > > > And with conservative I mean "everything is a source of a dependency, and > > > hence can't be removed, reordered or otherwise fiddled with", and that > > > includes code sequences where no atomic objects are anywhere in sight [1]. > > > In the light of that the only realistic way (meaning to not have to > > > disable optimization everywhere) to implement consume as currently > > > specified is to map it to acquire. At which point it becomes pointless. > > > > No, only memory_order_consume loads and [[carries_dependency]] > > function arguments are sources of dependency chains. > > However, that is, given how the standard specifies things, just one of > the possible ways for how an implementation can handle this. Treating > [[carries_dependency]] as a "necessary" annotation to make exploiting > mo_consume work in practice is possible, but it's not required by the > standard. > > Also, dependencies are specified to flow through loads and stores > (restricted to scalar objects and bitfields), so any load that might > load from a dependency-carrying store can also be a source (and that > doesn't seem to be restricted by [[carries_dependency]]). OK, this last is clearly totally unacceptable. :-/ Leaving aside the option of dropping the whole thing for the moment, the only thing that suggests itself is having all dependencies die at a specific point in the code, corresponding to the rcu_read_unlock(). But as far as I can see, that absolutely requires "necessary" parameter and return marking in order to correctly handle nested RCU read-side critical sections in different functions. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:11 ` Paul E. McKenney 2014-02-20 18:32 ` Linus Torvalds @ 2014-02-20 18:44 ` Torvald Riegel 2014-02-20 18:56 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 18:44 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote: > But yes, the compiler guys would be extremely happy to simply drop > memory_order_consume from the standard, as it is the memory order > that they most love to hate. > > Getting them to agree to any sort of peep-hole optimization semantics > for memory_order_consume is likely problematic. I wouldn't be so pessimistic about that. If the transformations can be shown to be always correct in terms of the semantics specified in the standard, and if the performance win is sufficiently large, why not? Of course, somebody has to volunteer to actually implement it :) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:44 ` Torvald Riegel @ 2014-02-20 18:56 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:56 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 07:44:32PM +0100, Torvald Riegel wrote: > xagsmtp3.20140220184514.1789@bldgate.vnet.ibm.com > X-Xagent-Gateway: bldgate.vnet.ibm.com (XAGSMTP3 at BLDGATE) > > On Thu, 2014-02-20 at 10:11 -0800, Paul E. McKenney wrote: > > But yes, the compiler guys would be extremely happy to simply drop > > memory_order_consume from the standard, as it is the memory order > > that they most love to hate. > > > > Getting them to agree to any sort of peep-hole optimization semantics > > for memory_order_consume is likely problematic. > > I wouldn't be so pessimistic about that. If the transformations can be > shown to be always correct in terms of the semantics specified in the > standard, and if the performance win is sufficiently large, why not? Of > course, somebody has to volunteer to actually implement it :) I guess that there is only one way to find out. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:01 ` Linus Torvalds 2014-02-20 18:11 ` Paul E. McKenney @ 2014-02-20 18:23 ` Torvald Riegel [not found] ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com> 2 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 18:23 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 09:01 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 12:30 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > >> > >> So lets make this really simple: if you have a consume->cmp->read, is > >> the ordering of the two reads guaranteed? > > > > Not as far as I know. Also, as far as I know, there is no difference > > between consume and relaxed in the consume->cmp->read case. > > Ok, quite frankly, I think that means that "consume" is misdesigned. > > > The above example can have a return value of 0 if translated > > straightforwardly into either ARM or Power, right? > > Correct. And I think that is too subtle. It's dangerous, it makes code > that *looks* correct work incorrectly, and it actually happens to work > on x86 since x86 doesn't have crap-for-brains memory ordering > semantics. > > > So, if you make one of two changes to your example, then I will agree > > with you. > > No. We're not playing games here. I'm fed up with complex examples > that make no sense. Note that Paul's second suggestion for a change was to just use mo_acquire; that's a simple change, and the easiest option, so it should be just fine. > Nobody sane writes code that does that pointer comparison, and it is > entirely immaterial what the compiler can do behind our backs. The C > standard semantics need to make sense to the *user* (ie programmer), > not to a CPU and not to a compiler. The CPU and compiler are "tools". > They don't matter. Their only job is to make the code *work*, dammit. > > So no idiotic made-up examples that involve code that nobody will ever > write and that have subtle issues. > > So the starting point is that (same example as before, but with even > clearer naming): > > Initialization state: > initialized = 0; > value = 0; > > Consumer: > > return atomic_read(&initialized, consume) ? value : -1; > > Writer: > value = 42; > atomic_write(&initialized, 1, release); > > and because the C memory ordering standard is written in such a way > that this is subtly buggy (and can return 0, which is *not* logically > a valid value), then I think the C memory ordering standard is broken. > > That "consumer" memory ordering is dangerous as hell, and it is > dangerous FOR NO GOOD REASON. > > The trivial "fix" to the standard would be to get rid of all the > "carries a dependency" crap, and just say that *anything* that depends > on it is ordered wrt it. > > That just means that on alpha, "consume" implies an unconditional read > barrier (well, unless the value is never used and is loaded just > because it is also volatile), on x86, "consume" is the same as > "acquire" which is just a plain load with ordering guarantees, and on > ARM or power you can still avoid the extra synchronization *if* the > value is used just for computation and for following pointers, but if > the value is used for a comparison, there needs to be a > synchronization barrier. > > Notice? Getting rid of the stupid "carries-dependency" crap from the > standard actually > (a) simplifies the standard Agreed, although it's easy to ignore the parts related to mo_consume, I think. > (b) means that the above obvious example *works* > (c) does not in *any* way make for any less efficient code generation > for the cases that "consume" works correctly for in the current > mis-designed standard. > (d) is actually a hell of a lot easier to explain to a compiler > writer, and I can guarantee that it is simpler to implement too. mo_acquire is certainly easier to implement in a compiler. > Why do I claim (d) "it is simpler to implement" - because on ARM/power > you can implement it *exactly* as a special "acquire", with just a > trivial peep-hole special case that follows the use chain of the > acquire op to the consume, and then just drop the acquire bit if the > only use is that compute-to-load chain. That's similar to the way I saw it and described in my reply to your other email (before getting to this email here). It seems that this indeed might be doable transparently in the compiler, without requiring a special mo_acquire variant visible to programmers. > In fact, realistically, the *only* thing you need to actually care > about for the intended use case of "consume" is the question "is the > consuming load immediately consumed as an address (with offset) of a > memory operation. So you don't even need to follow any complicated > computation chain in a compiler - the only case that matters for the > barrier removal optimization is the "oh, I can see that it is only > used as an address to a dereference". To make this mo_acquire optimization apply often, a compiler might have to try to filter out accesses that don't synchronize (e.g., so that an access to a non-shared temporary variable doesn't prevent the optimization). > Seriously. The current standard is broken. Please, let's be precise in such statement, so that everyone actually knows what's meant. The rest of the memory model can be perfectly fine even if you think that mo_consume isn't useful at all. I think your opinion about mo_consume is clear now (and I have concerns about it too, FWIW). If you see issues about all the other parts of the memory model (or the standard), please state these separately. ^ permalink raw reply [flat|nested] 285+ messages in thread
[parent not found: <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com>]
* Re: [RFC][PATCH 0/5] arch: atomic rework [not found] ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com> @ 2014-02-21 19:16 ` Linus Torvalds 2014-02-21 19:41 ` Linus Torvalds [not found] ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com> [not found] ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com> 1 sibling, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-21 19:16 UTC (permalink / raw) To: p796231 . Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty On Fri, Feb 21, 2014 at 10:25 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > > If one thinks this is too fragile, then simply using memory_order_acquire > and paying the resulting barrier cost (and perhaps hoping that compilers > will eventually be able to optimise some cases of those barriers to > hardware-level dependencies) is the obvious alternative. No, the obvious alternative is to do what we do already, and just do it by hand. Using acquire is *worse* than what we have now. Maybe for some other users, the thing falls out differently. > Many concurrent things will "accidentally" work on x86 - consume is not > special in that respect. No. But if you have something that is mis-designed, easy to get wrong, and untestable, any sane programmer will go "that's bad". > There are two difficulties with this, if I understand correctly what you're > proposing. > > The first is knowing where to stop. No. Read my suggestion. Knowing where to stop is *trivial*. Either the dependency is immediate and obvious, or you treat it like an acquire. Seriously. Any compiler that doesn't turn the dependency chain into SSA or something pretty much equivalent is pretty much a joke. Agreed? So we can pretty much assume that the compiler will have some intermediate representation as part of optimization that is basically SSA. So what you do is, - build the SSA by doing all the normal parsing and possible tree-level optimizations you already do even before getting to the SSA stage - do all the normal optimizations/simplifications/cse/etc that you do normally on SSA - add *one* new rule to your SSA simplification that goes something like this: * when you see a load op that is marked with a "consume" barrier, just follow the usage chain that comes from that. * if you hit a normal arithmetic op, just follow the result chain of that * if you hit a memory operation address use, stop and say "looks good" * it you hit anything else (including a copy/phi/whatever), abort * if nothing aborted as part of the walk, you can now just remove the "consume" barrier. You can fancy it up and try to follow more cases, but realistically the only case that really matters is the "consume" being fed directly into one or more loads, with possibly an offset calculation in between. There are certainly more cases you could *try* to remove the barrier, but the thing is, it's never incorrect to not remove it, so any time you get bored or hit any complication at all, just do the "abort" part. I *guarantee* that if you describe this to a compiler writer, he will tell you that my scheme is about a billion times simpler than the current standard wording. Especially after you've pointed him to that gcc bugzilla entry and explained to him about how the current standard cares about those kinds of made-up syntactic chains that he likely removed quite early, possibly even as he was generating the semantic tree. Try it. I dare you. So if you want to talk about "difficulties", the current C standard loses. > The second is the proposal in later mails to use some notion of "semantic" > dependency instead of this syntactic one. Bah. The C standard does that all over. It's called "as-is". The C standard talks about how the compiler can do pretty much whatever it likes, as long as the end result acts the same in the virtual C machine. So claiming that "semantics" being meaningful is somehow complex is bogus. People do that all the time. If you make it clear that the dependency chain is through the *value*, not syntax, and that the value can be optimized all the usual ways, it's quite clear what the end result is. Any operation that actually meaningfully uses the value is serialized with the load, and if there is no meaningful use that would affect the end result in the virtual machine, then there is no barrier. Why would this be any different, especially since it's easy to understand both for a human and a compiler? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 19:16 ` Linus Torvalds @ 2014-02-21 19:41 ` Linus Torvalds 2014-02-21 19:48 ` Peter Sewell [not found] ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com> 1 sibling, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-21 19:41 UTC (permalink / raw) To: p796231 . Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Why would this be any different, especially since it's easy to > understand both for a human and a compiler? Btw, the actual data path may actually be semantically meaningful even at a processor level. For example, let's look at that gcc bugzilla that got mentioned earlier, and let's assume that gcc is fixed to follow the "arithmetic is always meaningful, even if it is only syntactic" the letter. So we have that gcc bugzilla use-case: flag ? *(q + flag - flag) : 0; and let's say that the fixed compiler now generates the code with the data dependency that is actually suggested in that bugzilla entry: and w2, w2, #0 ldr w0, [x1, w2] ie the CPU actually sees that address data dependency. Now everything is fine, right? Wrong. It is actually quite possible that the CPU sees the "and with zero" and *breaks the dependencies on the incoming value*. Modern CPU's literally do things like that. Seriously. Maybe not that particular one, but you'll sometimes find that the CPU - int he instruction decoding phase (ie very early in the pipeline) notices certain patterns that generate constants, and actually drop the data dependency on the "incoming" registers. On x86, generating zero using "xor" on the register with itself is one such known sequence. Can you guarantee that powerpc doesn't do the same for "and r,r,#0"? Or what if the compiler generated the much more obvious sub w2,w2,w2 for that "+flag-flag"? Are you really 100% sure that the CPU won't notice that that is just a way to generate a zero, and doesn't depend on the incoming values? Because I'm not. I know CPU designers that do exactly this. So I would actually and seriously argue that the whole C standard attempt to use a syntactic data dependency as a determination of whether two things are serialized is wrong, and that you actually *want* to have the compiler optimize away false data dependencies. Because people playing tricks with "+flag-flag" and thinking that that somehow generates a data dependency - that's *wrong*. It's not just the compiler that decides "that's obviously nonsense, I'll optimize it away". The CPU itself can do it. So my "actual semantic dependency" model is seriously more likely to be *correct*. Not just t a compiler level. Btw, any tricks like that, I would also take a second look at the assembler and the linker. Many assemblers do some trivial optimizations too. Are you sure that "and w2, w2, #0" really ends up being encoded as an "and"? Maybe the assembler says "I can do that as a "mov w2,#0" instead? Who knows? Even power and ARM have their variable-sized encodings (there are some "compressed executable" embedded power processors, and there is obviously Thumb2, and many assemblers end up trying to use equivalent "small" instructions.. So the whole "fake data dependency" thing is just dangerous on so many levels. MUCH more dangerous than my "actual real dependency" model. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-21 19:41 ` Linus Torvalds @ 2014-02-21 19:48 ` Peter Sewell 0 siblings, 0 replies; 285+ messages in thread From: Peter Sewell @ 2014-02-21 19:48 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty On 21 February 2014 19:41, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Fri, Feb 21, 2014 at 11:16 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: >> >> Why would this be any different, especially since it's easy to >> understand both for a human and a compiler? > > Btw, the actual data path may actually be semantically meaningful even > at a processor level. > > For example, let's look at that gcc bugzilla that got mentioned > earlier, and let's assume that gcc is fixed to follow the "arithmetic > is always meaningful, even if it is only syntactic" the letter. > So we have that gcc bugzilla use-case: > > flag ? *(q + flag - flag) : 0; > > and let's say that the fixed compiler now generates the code with the > data dependency that is actually suggested in that bugzilla entry: > > and w2, w2, #0 > ldr w0, [x1, w2] > > ie the CPU actually sees that address data dependency. Now everything > is fine, right? > > Wrong. > > It is actually quite possible that the CPU sees the "and with zero" > and *breaks the dependencies on the incoming value*. For reference: the Power and ARM architectures explicitly guarantee not to do this, the architects are quite clear about it, and we've tested (some cases) rather thoroughly. I can't speak about other architectures. > Modern CPU's literally do things like that. Seriously. Maybe not that > particular one, but you'll sometimes find that the CPU - int he > instruction decoding phase (ie very early in the pipeline) notices > certain patterns that generate constants, and actually drop the data > dependency on the "incoming" registers. > > On x86, generating zero using "xor" on the register with itself is one > such known sequence. > > Can you guarantee that powerpc doesn't do the same for "and r,r,#0"? > Or what if the compiler generated the much more obvious > > sub w2,w2,w2 > > for that "+flag-flag"? Are you really 100% sure that the CPU won't > notice that that is just a way to generate a zero, and doesn't depend > on the incoming values? > > Because I'm not. I know CPU designers that do exactly this. > > So I would actually and seriously argue that the whole C standard > attempt to use a syntactic data dependency as a determination of > whether two things are serialized is wrong, and that you actually > *want* to have the compiler optimize away false data dependencies. > > Because people playing tricks with "+flag-flag" and thinking that that > somehow generates a data dependency - that's *wrong*. It's not just > the compiler that decides "that's obviously nonsense, I'll optimize it > away". The CPU itself can do it. > > So my "actual semantic dependency" model is seriously more likely to > be *correct*. Not just t a compiler level. > > Btw, any tricks like that, I would also take a second look at the > assembler and the linker. Many assemblers do some trivial > optimizations too. That's certainly something worth checking. > Are you sure that "and w2, w2, #0" really ends > up being encoded as an "and"? Maybe the assembler says "I can do that > as a "mov w2,#0" instead? Who knows? Even power and ARM have their > variable-sized encodings (there are some "compressed executable" > embedded power processors, and there is obviously Thumb2, and many > assemblers end up trying to use equivalent "small" instructions.. > > So the whole "fake data dependency" thing is just dangerous on so many levels. > > MUCH more dangerous than my "actual real dependency" model. > > Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
[parent not found: <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com>]
* Re: [RFC][PATCH 0/5] arch: atomic rework [not found] ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com> @ 2014-02-21 20:22 ` Linus Torvalds 0 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-21 20:22 UTC (permalink / raw) To: p796231 . Cc: Paul McKenney, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty On Fri, Feb 21, 2014 at 11:43 AM, Peter Sewell <Peter.Sewell@cl.cam.ac.uk> wrote: > > You have to track dependencies through other assignments, e.g. simple x=y That is all visible in the SSA form. Variable assignment has been converted to some use of the SSA node that generated the value. The use might be a phi node or a cast op, or maybe it's just a memory store op, but the whole point of SSA is that there is one single node that creates the data (in this case that would be the "load" op with the associated consume barrier - that barrier might be part of the load op itself, or it might be implemented as a separate SSA node that consumes the result of the load that generates a new pseudo), and the uses of the result are all visible from that. And yes, there might be a lot of users. But any complex case you just punt on - and the difference here is that since "punt" means "leave the barrier in place", it's never a correctness issue. So yeah, it could be somewhat expensive, although you can always bound that expense by just punting. But the dependencies in SSA form are no more complex than the dependencies the C standard talks about now, and in SSA form they are at least really easy to follow. So if they are complex and expensive in SSA form, I'd expect them to be *worse* in the current "depends-on" syntax form. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
[parent not found: <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com>]
* Re: [RFC][PATCH 0/5] arch: atomic rework [not found] ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com> @ 2014-02-21 19:24 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-21 19:24 UTC (permalink / raw) To: Peter Sewell Cc: Linus Torvalds, Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty On Fri, Feb 21, 2014 at 06:28:05PM +0000, Peter Sewell wrote: > On 20 February 2014 17:01, Linus Torvalds <torvalds@linux-foundation.org > >wrote: [ . . . ] > > > So, if you make one of two changes to your example, then I will agree > > > with you. > > > > No. We're not playing games here. I'm fed up with complex examples > > that make no sense. > > > > Nobody sane writes code that does that pointer comparison, and it is > > entirely immaterial what the compiler can do behind our backs. The C > > standard semantics need to make sense to the *user* (ie programmer), > > not to a CPU and not to a compiler. The CPU and compiler are "tools". > > They don't matter. Their only job is to make the code *work*, dammit. > > > > So no idiotic made-up examples that involve code that nobody will ever > > write and that have subtle issues. > > > > So the starting point is that (same example as before, but with even > > clearer naming): > > > > Initialization state: > > initialized = 0; > > value = 0; > > > > Consumer: > > > > return atomic_read(&initialized, consume) ? value : -1; > > > > Writer: > > value = 42; > > atomic_write(&initialized, 1, release); > > > > and because the C memory ordering standard is written in such a way > > that this is subtly buggy (and can return 0, which is *not* logically > > a valid value), then I think the C memory ordering standard is broken. > > > > That "consumer" memory ordering is dangerous as hell, and it is > > dangerous FOR NO GOOD REASON. > > > > The trivial "fix" to the standard would be to get rid of all the > > "carries a dependency" crap, and just say that *anything* that depends > > on it is ordered wrt it. > > > > There are two difficulties with this, if I understand correctly what > you're proposing. > > The first is knowing where to stop. If one includes all data and > control dependencies, pretty much all the continuation of execution > would depend on the consume read, so the compiler would eventually > have to give up and insert a gratuitous barrier. One might imagine > somehow annotating the return_expensive_system_value() you have below > to say "stop dependency tracking at the return" (thereby perhaps > enabling the compiler to optimise the barrier that you'd need in h/w > to order the Linus-consume-read of initialised and the non-atomic read > of calculated, replacing it by a compiler-introduced artificial > dependency), and indeed that's roughly what the standard's > kill_dependency does for consumes. One way to tell the compiler where to stop would be to place markers in the source code saying where dependencies stop. These markers could be provided by the definitions of the current rcu_read_unlock() tags in the Linux kernel (and elsewhere, for that matter). These would be overridden by [[carries_dependency]] tags on function arguments and return values, which is needed to handle the possibility of nested RCU read-side critical sections. > The second is the proposal in later mails to use some notion of > "semantic" dependency instead of this syntactic one. That's maybe > attractive at first sight, but rather hard to define in a decent way > in general. To define whether the consume load can "really" affect > some subsequent value, you need to know about all the set of possible > executions of the program - which is exactly what we have to define. > > For syntactic dependencies, in contrast, you can at least tell whether > they exist by examining the source code you have in front of you. The > fact that artificial dependencies like (&x + y-y) are significant is > (I guess) basically incidental at the C level - sometimes things like > this are the best idiom to enforce ordering at the assembly level, but > in C I imagine they won't normally arise. If they do, it might be > nicer to have a more informative syntax, eg (&x + dependency(y)). This was in fact one of the arguments put forward in favor of carrying dependencies through things like "y-y" back in the 2007-8 timeframe. Can't say that I am much of a fan of manually tagging all dependencies: Machines are much better at that sort of thing than are humans. But just out of curiosity, did you instead mean (&x + dependency(y-y)) or some such? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 8:30 ` Paul E. McKenney 2014-02-20 9:20 ` Paul E. McKenney 2014-02-20 17:01 ` Linus Torvalds @ 2014-02-20 17:54 ` Torvald Riegel 2014-02-20 18:11 ` Paul E. McKenney 2 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 17:54 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote: > Well, all the compilers currently convert consume to acquire, so you have > your wish there. Of course, that also means that they generate actual > unneeded memory-barrier instructions, which seems extremely sub-optimal > to me. GCC doesn't currently, but it also doesn't seem to track the dependencies, but that's a bug: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:54 ` Torvald Riegel @ 2014-02-20 18:11 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:11 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 06:54:06PM +0100, Torvald Riegel wrote: > xagsmtp4.20140220175519.1127@vmsdvm6.vnet.ibm.com > X-Xagent-Gateway: vmsdvm6.vnet.ibm.com (XAGSMTP4 at VMSDVM6) > > On Thu, 2014-02-20 at 00:30 -0800, Paul E. McKenney wrote: > > Well, all the compilers currently convert consume to acquire, so you have > > your wish there. Of course, that also means that they generate actual > > unneeded memory-barrier instructions, which seems extremely sub-optimal > > to me. > > GCC doesn't currently, but it also doesn't seem to track the > dependencies, but that's a bug: > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448 Ah, cool! Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 4:43 ` Linus Torvalds 2014-02-20 8:30 ` Paul E. McKenney @ 2014-02-20 17:49 ` Torvald Riegel 2014-02-20 18:25 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 17:49 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-19 at 20:43 -0800, Linus Torvalds wrote: [Paul has already answered many of your questions, and my reply to your previous email should also answer some.] > If the consumer of an atomic load isn't a pointer chasing operation, > then the consume should be defined to be the same as acquire. None of > this "conditionals break consumers". No, conditionals on the > dependency path should turn consumers into acquire, because otherwise > the "consume" load is dangerous as hell. Yes, mo_consume is more tricky than mo_acquire. However, that has an advantage because you can avoid getting stronger barriers if you don't need them (ie, you can avoid the "auto-update to acquire" you seem to have in mind). The auto-upgrade would be a possible semantics, I agree. Another option may be to let an implementation optimize the HW barriers that it uses for mo_acquire. That is, if the compiler sees that (1) the result of an mo_acquire load is used on certain code paths *only* for consumers that carry the dependency *and* (2) there are no other operations on that code path that can possibly rely on the mo_acquire ordering guarantees, then the compiler can use a weaker HW barrier on archs such as PowerPC or ARM. That is similar to the rules for mo_consume you seem to have in mind, but makes it a compiler optimization on mo_acquire. However, the compiler has to be conservative here, so having a mo_consume that is trickier to use but doesn't ever silently introduce stronger HW barriers seems to be useful (modulo the difficulties regarding to how it's currently specified in the standard). > And if the definition of acquire doesn't include the control > dependency either, then the C atomic memory model is just completely > and utterly broken, since the above *trivial* and clearly useful > example is broken. In terms of the model, if you establish a synchronizes-with using a reads-from that has (or crosses) a release/acquire memory-order pair, then this synchronizes-with will also order other operations that it's sequenced-before (see the composition of synchronizes-with and sequenced-before in the inter-thread-happens-before definition in n3132 6.15). So yes, mo_acquire does that the logical "control dependencies" / sequenced-before into account. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:49 ` Torvald Riegel @ 2014-02-20 18:25 ` Linus Torvalds 2014-02-20 19:02 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 18:25 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 9:49 AM, Torvald Riegel <triegel@redhat.com> wrote: > > Yes, mo_consume is more tricky than mo_acquire. > > However, that has an advantage because you can avoid getting stronger > barriers if you don't need them (ie, you can avoid the "auto-update to > acquire" you seem to have in mind). Oh, I agree about that part - I very much understand the reason for "consume", and I can see how it is more relaxed than "acquire" under many circumstances. I just think that you actually *do* want to have "consume" even for flag values, exactly *because* it is potentially cheaper than acquire. In fact, I'd argue that making consume reliable in the face of control dependencies is actually a *good* thing. It may not matter for something like x86, where consume and acquire end up with the same simple load, but even there it might relax instruction scheduling a bit, since a 'consume' would have a barrier just to the *users* of the value loaded, while 'acquire' would still have a scheduling barrier to any subsequent operations. So I claim that for a sequence like my example, where the reader basically does something like load_atomic(&initialized, consume) ? value : -1; the "consume" version can actually generate better code than "acquire" - if "consume" is specified the way *I* specified it. The way the C standard specifies it, the above code is *buggy*. Agreed? It's really really subtly buggy, and I think that bug is not only a real danger, I think it is logically hard to understand why. The bug only makes sense to people who understand how memory ordering and branch prediction interacts. The way *I* suggested "consume" be implemented, the above not only works and is sensible, it actually generates possibly better code than forcing the programmer to use the (illogical) "acquire" operation. Why? Let me give you another - completely realistic, even if obviously a bit made up - example: int return_expensive_system_value(void) { static atomic_t initialized; static int calculated; if (atomic_read(&initialized, mo_consume)) return calculated; //let's say that this code opens /proc/cpuinfo and counts number of CPU's or whatever ... calculated = read_value_from_system_files(); atomic_write(&initialized, 1, mo_release); return calculated; } and let's all agree that this is a somewhat realistic example, and we can imagine why/how somebody would write code like this. It's basically a very common lazy initialization pattern, you'll find this in libraries, in kernels, in application code yadda yadda. No argument? Now, let's assume that it turns out that this value ends up being really performance-critical, so the programmer makes the fast-path an inline function, tells the compiler that "initialized" read is likely, and generally wants the compiler to optimize it to hell and back. Still sounds reasonable and realistic? In other words, the *expected* code sequence for this is (on x86, which doesn't need any barriers): cmpl $0, initialized je unlikely_out_of_line_case movl calculated, eax and on ARM/power you'd see a 'sync' instruction or whatever. So far 'acquire' and 'consume' have exacly the same code generation on power of x86, so your argument can be: "Ok, so let's just use the inconvenient and hard-to-understand 'consume' semantics that the current standard has, and tell the programmer that he should use 'acquire' and not worry his little head about the difference because he will never understand it anyway". Sure, that would be an inconvencience for programmers, but hey, they're programming in C or C++, so they are *expected* to be manly men or womanly women, and a little illogical inconvenience never hurt anybody. After all, compared to the aliasing rules, that "use acquire, not consume" rule is positively *simple*, no? Are we all in agreement so far? But no, the "consume" thing can actually generate better code. Trivial example: int my_threads_value; extern int magic_system_multiplier; my_thread_value = return_expensive_system_value(); my_thread_value *= magic_system_multiplier; and in the "acquire" model, the "acquire" itself means that the load from magic_system_multiplier is now constrained by the acquire memory ordering on "initialized". While in my *sane* model, where you can consume things even if they then result in control dependencies, there will still eventually be a "sync" instruction on powerpc (because you really need one between the load of 'initialized' and the load of 'calculated'), but the compiler would be free to schedule the load of 'magic_system_multiplier' earlier. So as far as I can tell, we want the 'consume' memory ordering to honor *all* dependencies, because - it's simpler - it's more logical - it's less error-prone - and it allows better code generation Hmm? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:25 ` Linus Torvalds @ 2014-02-20 19:02 ` Linus Torvalds 2014-02-20 19:06 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 19:02 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 10:25 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > While in my *sane* model, where you can consume things even if they > then result in control dependencies, there will still eventually be a > "sync" instruction on powerpc (because you really need one between the > load of 'initialized' and the load of 'calculated'), but the compiler > would be free to schedule the load of 'magic_system_multiplier' > earlier. Actually, "consume" is more interesting than that. Looking at the bugzilla entry Torvald pointed at, it has the trick to always turn any "consume" dependency into an address data dependency. So another reason why you *want* to allow "consume" + "control dependency" is that it opens up the window for many more interesting and relevant optimizations than "acquire" does. Again, let's take that "trivial" expression: return atomic_read(&initialized, consume) ? value : -1; and the compiler can actually turn this into an interesting address data dependency and optimize it to basically use address arithmetic and turn it into something like return *(&value + (&((int)-1)-value)*!atomic_read(&initialized, consume)); Of course, the *programmer* could have done that himself, but the above is actually a *pessimization* on x86 or other strongly ordered machines, so doing it at a source code level is actually a bad idea (not to mention that it's horribly unreadable). I could easily have gotten the address generation trick above wrong (see my comment about "horribly unreadable" and no sane person doing this at a source level), but that "complex" expression is not necessarily at all complex for a compiler. If "value" is a static variable, the compiler could create another read-only static variable that contains that "-1" value, and the difference in addresses would be a link-time constant, so it would not necessarily be all that ugly from a code generation standpoint. There are other ways to turn it into an address dependency, so the whole "consume as an input to conditionals" really does seem to have several optimization advantages (over "acquire"). Again, the way I'd expect a compiler writer to actually *do* this is to just default to "ac Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 19:02 ` Linus Torvalds @ 2014-02-20 19:06 ` Linus Torvalds 0 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 19:06 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 11:02 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Again, the way I'd expect a compiler writer to actually *do* this is > to just default to "ac Oops, pressed send by mistake too early. I was almost done: I'd expect a compiler to just default to "acquire" semantics, but then have a few "obvious peephole" optimizations for cases that it encounters and where it is easy to replace the synchronization point with just an address dependency. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 4:01 ` Paul E. McKenney 2014-02-20 4:43 ` Linus Torvalds @ 2014-02-20 17:26 ` Torvald Riegel 2014-02-20 18:18 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 17:26 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > >> > > >> Can you point to it? Because I can find a draft standard, and it sure > > >> as hell does *not* contain any clarity of the model. It has a *lot* of > > >> verbiage, but it's pretty much impossible to actually understand, even > > >> for somebody who really understands memory ordering. > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > This has an explanation of the model up front, and then the detailed > > > formulae in Section 6. This is from 2010, and there might have been > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > Ahh, this is different from what others pointed at. Same people, > > similar name, but not the same paper. > > > > I will read this version too, but from reading the other one and the > > standard in parallel and trying to make sense of it, it seems that I > > may have originally misunderstood part of the whole control dependency > > chain. > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > dependencies made me originally think that the standard tried very > > hard to break any control dependencies. Which I felt was insane, when > > then some of the examples literally were about the testing of the > > value of an atomic read. The data dependency matters quite a bit. The > > fact that the other "Mathematical" paper then very much talked about > > consume only in the sense of following a pointer made me think so even > > more. > > > > But reading it some more, I now think that the whole "data dependency" > > logic (which is where the special left-hand side rule of the ternary > > and logical operators come in) are basically an exception to the rule > > that sequence points end up being also meaningful for ordering (ok, so > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > So while an expression like > > > > atomic_read(p, consume) ? a : b; > > > > doesn't have a data dependency from the atomic read that forces > > serialization, writing > > > > if (atomic_read(p, consume)) > > a; > > else > > b; > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > "a", and I'm hoping that there is no question that the control > > dependency still acts as an ordering point. > > The control dependency should order subsequent stores, at least assuming > that "a" and "b" don't start off with identical stores that the compiler > could pull out of the "if" and merge. The same might also be true for ?: > for all I know. (But see below) I don't think this is quite true. I agree that a conditional store will not be executed speculatively (note that if it would happen in both the then and the else branch, it's not conditional); so, the store in "a;" (assuming it would be a store) won't happen unless the thread can really observe a true value for p. However, this is *this thread's* view of the world, but not guaranteed to constrain how any other thread sees the state. mo_consume does not contribute to inter-thread-happens-before in the same way that mo_acquire does (which *does* put a constraint on i-t-h-b, and thus enforces a global constraint that all threads have to respect). Is it clear which distinction I'm trying to show here? > That said, in this case, you could substitute relaxed for consume and get > the same effect. The return value from atomic_read() gets absorbed into > the "if" condition, so there is no dependency-ordered-before relationship, > so nothing for consume to do. > > One caution... The happens-before relationship requires you to trace a > full path between the two operations of interest. This is illustrated > by the following example, with both x and y initially zero: > > T1: atomic_store_explicit(&x, 1, memory_order_relaxed); > r1 = atomic_load_explicit(&y, memory_order_relaxed); > > T2: atomic_store_explicit(&y, 1, memory_order_relaxed); > r2 = atomic_load_explicit(&x, memory_order_relaxed); > > There is a happens-before relationship between T1's load and store, > and another happens-before relationship between T2's load and store, > but there is no happens-before relationship from T1 to T2, and none > in the other direction, either. And you don't get to assume any > ordering based on reasoning about these two disjoint happens-before > relationships. > > So it is quite possible for r1==1&&r2==1 after both threads complete. > > Which should be no surprise: This misordering can happen even on x86, > which would need a full smp_mb() to prevent it. > > > THAT was one of my big confusions, the discussion about control > > dependencies and the fact that the logical ops broke the data > > dependency made me believe that the standard tried to actively avoid > > the whole issue with "control dependencies can break ordering > > dependencies on some CPU's due to branch prediction and memory > > re-ordering by the CPU". > > > > But after all the reading, I'm starting to think that that was never > > actually the implication at all, and the "logical ops breaks the data > > dependency rule" is simply an exception to the sequence point rule. > > All other sequence points still do exist, and do imply an ordering > > that matters for "consume" > > > > Am I now reading it right? > > As long as there is an unbroken chain of -data- dependencies from the > consume to the later access in question, and as long as that chain > doesn't go through the excluded operations, yes. > > > So the clarification is basically to the statement that the "if > > (consume(p)) a" version *would* have an ordering guarantee between the > > read of "p" and "a", but the "consume(p) ? a : b" would *not* have > > such an ordering guarantee. Yes? > > Neither has a data-dependency guarantee, because there is no data > dependency from the load to either "a" or "b". After all, the value > loaded got absorbed into the "if" condition. Agreed. > However, according to > discussions earlier in this thread, the "if" variant would have a > control-dependency ordering guarantee for any stores in "a" and "b" > (but not loads!). The ?: form might also have a control-dependency > guarantee for any stores in "a" and "b" (again, not loads). Don't quite agree; see above for my opinion on this. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:26 ` Torvald Riegel @ 2014-02-20 18:18 ` Paul E. McKenney 2014-02-22 18:30 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:18 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: > xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) > > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > > >> > > > >> Can you point to it? Because I can find a draft standard, and it sure > > > >> as hell does *not* contain any clarity of the model. It has a *lot* of > > > >> verbiage, but it's pretty much impossible to actually understand, even > > > >> for somebody who really understands memory ordering. > > > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > This has an explanation of the model up front, and then the detailed > > > > formulae in Section 6. This is from 2010, and there might have been > > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > > > Ahh, this is different from what others pointed at. Same people, > > > similar name, but not the same paper. > > > > > > I will read this version too, but from reading the other one and the > > > standard in parallel and trying to make sense of it, it seems that I > > > may have originally misunderstood part of the whole control dependency > > > chain. > > > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > > dependencies made me originally think that the standard tried very > > > hard to break any control dependencies. Which I felt was insane, when > > > then some of the examples literally were about the testing of the > > > value of an atomic read. The data dependency matters quite a bit. The > > > fact that the other "Mathematical" paper then very much talked about > > > consume only in the sense of following a pointer made me think so even > > > more. > > > > > > But reading it some more, I now think that the whole "data dependency" > > > logic (which is where the special left-hand side rule of the ternary > > > and logical operators come in) are basically an exception to the rule > > > that sequence points end up being also meaningful for ordering (ok, so > > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > > > So while an expression like > > > > > > atomic_read(p, consume) ? a : b; > > > > > > doesn't have a data dependency from the atomic read that forces > > > serialization, writing > > > > > > if (atomic_read(p, consume)) > > > a; > > > else > > > b; > > > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > > "a", and I'm hoping that there is no question that the control > > > dependency still acts as an ordering point. > > > > The control dependency should order subsequent stores, at least assuming > > that "a" and "b" don't start off with identical stores that the compiler > > could pull out of the "if" and merge. The same might also be true for ?: > > for all I know. (But see below) > > I don't think this is quite true. I agree that a conditional store will > not be executed speculatively (note that if it would happen in both the > then and the else branch, it's not conditional); so, the store in > "a;" (assuming it would be a store) won't happen unless the thread can > really observe a true value for p. However, this is *this thread's* > view of the world, but not guaranteed to constrain how any other thread > sees the state. mo_consume does not contribute to > inter-thread-happens-before in the same way that mo_acquire does (which > *does* put a constraint on i-t-h-b, and thus enforces a global > constraint that all threads have to respect). > > Is it clear which distinction I'm trying to show here? If you are saying that the control dependencies are a result of a combination of the standard and the properties of the hardware that Linux runs on, I am with you. (As opposed to control dependencies being a result solely of the standard.) This was a deliberate decision in 2007 or so. At that time, the documentation on CPU memory orderings were pretty crude, and it was not clear that all relevant hardware respected control dependencies. Back then, if you wanted an authoritative answer even to a fairly simple memory-ordering question, you had to find a hardware architect, and you probably waited weeks or even months for the answer. Thanks to lots of work from the Cambridge guys at about the time that the standard was finalized, we have a much better picture of what the hardware does. > > That said, in this case, you could substitute relaxed for consume and get > > the same effect. The return value from atomic_read() gets absorbed into > > the "if" condition, so there is no dependency-ordered-before relationship, > > so nothing for consume to do. > > > > One caution... The happens-before relationship requires you to trace a > > full path between the two operations of interest. This is illustrated > > by the following example, with both x and y initially zero: > > > > T1: atomic_store_explicit(&x, 1, memory_order_relaxed); > > r1 = atomic_load_explicit(&y, memory_order_relaxed); > > > > T2: atomic_store_explicit(&y, 1, memory_order_relaxed); > > r2 = atomic_load_explicit(&x, memory_order_relaxed); > > > > There is a happens-before relationship between T1's load and store, > > and another happens-before relationship between T2's load and store, > > but there is no happens-before relationship from T1 to T2, and none > > in the other direction, either. And you don't get to assume any > > ordering based on reasoning about these two disjoint happens-before > > relationships. > > > > So it is quite possible for r1==1&&r2==1 after both threads complete. > > > > Which should be no surprise: This misordering can happen even on x86, > > which would need a full smp_mb() to prevent it. > > > > > THAT was one of my big confusions, the discussion about control > > > dependencies and the fact that the logical ops broke the data > > > dependency made me believe that the standard tried to actively avoid > > > the whole issue with "control dependencies can break ordering > > > dependencies on some CPU's due to branch prediction and memory > > > re-ordering by the CPU". > > > > > > But after all the reading, I'm starting to think that that was never > > > actually the implication at all, and the "logical ops breaks the data > > > dependency rule" is simply an exception to the sequence point rule. > > > All other sequence points still do exist, and do imply an ordering > > > that matters for "consume" > > > > > > Am I now reading it right? > > > > As long as there is an unbroken chain of -data- dependencies from the > > consume to the later access in question, and as long as that chain > > doesn't go through the excluded operations, yes. > > > > > So the clarification is basically to the statement that the "if > > > (consume(p)) a" version *would* have an ordering guarantee between the > > > read of "p" and "a", but the "consume(p) ? a : b" would *not* have > > > such an ordering guarantee. Yes? > > > > Neither has a data-dependency guarantee, because there is no data > > dependency from the load to either "a" or "b". After all, the value > > loaded got absorbed into the "if" condition. > > Agreed. > > > However, according to > > discussions earlier in this thread, the "if" variant would have a > > control-dependency ordering guarantee for any stores in "a" and "b" > > (but not loads!). The ?: form might also have a control-dependency > > guarantee for any stores in "a" and "b" (again, not loads). > > Don't quite agree; see above for my opinion on this. And see above for my best guess at what your opinion is based on. If my guess is correct, we might even be in agreement. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 18:18 ` Paul E. McKenney @ 2014-02-22 18:30 ` Torvald Riegel 2014-02-22 20:17 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-22 18:30 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote: > On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: > > xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com > > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) > > > > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > > > >> > > > > >> Can you point to it? Because I can find a draft standard, and it sure > > > > >> as hell does *not* contain any clarity of the model. It has a *lot* of > > > > >> verbiage, but it's pretty much impossible to actually understand, even > > > > >> for somebody who really understands memory ordering. > > > > > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > This has an explanation of the model up front, and then the detailed > > > > > formulae in Section 6. This is from 2010, and there might have been > > > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > > > > > Ahh, this is different from what others pointed at. Same people, > > > > similar name, but not the same paper. > > > > > > > > I will read this version too, but from reading the other one and the > > > > standard in parallel and trying to make sense of it, it seems that I > > > > may have originally misunderstood part of the whole control dependency > > > > chain. > > > > > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > > > dependencies made me originally think that the standard tried very > > > > hard to break any control dependencies. Which I felt was insane, when > > > > then some of the examples literally were about the testing of the > > > > value of an atomic read. The data dependency matters quite a bit. The > > > > fact that the other "Mathematical" paper then very much talked about > > > > consume only in the sense of following a pointer made me think so even > > > > more. > > > > > > > > But reading it some more, I now think that the whole "data dependency" > > > > logic (which is where the special left-hand side rule of the ternary > > > > and logical operators come in) are basically an exception to the rule > > > > that sequence points end up being also meaningful for ordering (ok, so > > > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > > > > > So while an expression like > > > > > > > > atomic_read(p, consume) ? a : b; > > > > > > > > doesn't have a data dependency from the atomic read that forces > > > > serialization, writing > > > > > > > > if (atomic_read(p, consume)) > > > > a; > > > > else > > > > b; > > > > > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > > > "a", and I'm hoping that there is no question that the control > > > > dependency still acts as an ordering point. > > > > > > The control dependency should order subsequent stores, at least assuming > > > that "a" and "b" don't start off with identical stores that the compiler > > > could pull out of the "if" and merge. The same might also be true for ?: > > > for all I know. (But see below) > > > > I don't think this is quite true. I agree that a conditional store will > > not be executed speculatively (note that if it would happen in both the > > then and the else branch, it's not conditional); so, the store in > > "a;" (assuming it would be a store) won't happen unless the thread can > > really observe a true value for p. However, this is *this thread's* > > view of the world, but not guaranteed to constrain how any other thread > > sees the state. mo_consume does not contribute to > > inter-thread-happens-before in the same way that mo_acquire does (which > > *does* put a constraint on i-t-h-b, and thus enforces a global > > constraint that all threads have to respect). > > > > Is it clear which distinction I'm trying to show here? > > If you are saying that the control dependencies are a result of a > combination of the standard and the properties of the hardware that > Linux runs on, I am with you. (As opposed to control dependencies being > a result solely of the standard.) I'm not quite sure I understand what you mean :) Do you mean the control dependencies in the binary code, or the logical "control dependencies" in source programs? > This was a deliberate decision in 2007 or so. At that time, the > documentation on CPU memory orderings were pretty crude, and it was > not clear that all relevant hardware respected control dependencies. > Back then, if you wanted an authoritative answer even to a fairly simple > memory-ordering question, you had to find a hardware architect, and you > probably waited weeks or even months for the answer. Thanks to lots > of work from the Cambridge guys at about the time that the standard was > finalized, we have a much better picture of what the hardware does. But this part I understand. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-22 18:30 ` Torvald Riegel @ 2014-02-22 20:17 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-22 20:17 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 22, 2014 at 07:30:37PM +0100, Torvald Riegel wrote: > xagsmtp2.20140222183231.5343@emeavsc.vnet.ibm.com > X-Xagent-Gateway: emeavsc.vnet.ibm.com (XAGSMTP2 at EMEAVSC) > > On Thu, 2014-02-20 at 10:18 -0800, Paul E. McKenney wrote: > > On Thu, Feb 20, 2014 at 06:26:08PM +0100, Torvald Riegel wrote: > > > xagsmtp2.20140220172700.0416@vmsdvm4.vnet.ibm.com > > > X-Xagent-Gateway: vmsdvm4.vnet.ibm.com (XAGSMTP2 at VMSDVM4) > > > > > > On Wed, 2014-02-19 at 20:01 -0800, Paul E. McKenney wrote: > > > > On Wed, Feb 19, 2014 at 04:53:49PM -0800, Linus Torvalds wrote: > > > > > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > > > > > >> > > > > > >> Can you point to it? Because I can find a draft standard, and it sure > > > > > >> as hell does *not* contain any clarity of the model. It has a *lot* of > > > > > >> verbiage, but it's pretty much impossible to actually understand, even > > > > > >> for somebody who really understands memory ordering. > > > > > > > > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > > This has an explanation of the model up front, and then the detailed > > > > > > formulae in Section 6. This is from 2010, and there might have been > > > > > > smaller changes since then, but I'm not aware of any bigger ones. > > > > > > > > > > Ahh, this is different from what others pointed at. Same people, > > > > > similar name, but not the same paper. > > > > > > > > > > I will read this version too, but from reading the other one and the > > > > > standard in parallel and trying to make sense of it, it seems that I > > > > > may have originally misunderstood part of the whole control dependency > > > > > chain. > > > > > > > > > > The fact that the left side of "? :", "&&" and "||" breaks data > > > > > dependencies made me originally think that the standard tried very > > > > > hard to break any control dependencies. Which I felt was insane, when > > > > > then some of the examples literally were about the testing of the > > > > > value of an atomic read. The data dependency matters quite a bit. The > > > > > fact that the other "Mathematical" paper then very much talked about > > > > > consume only in the sense of following a pointer made me think so even > > > > > more. > > > > > > > > > > But reading it some more, I now think that the whole "data dependency" > > > > > logic (which is where the special left-hand side rule of the ternary > > > > > and logical operators come in) are basically an exception to the rule > > > > > that sequence points end up being also meaningful for ordering (ok, so > > > > > C11 seems to have renamed "sequence points" to "sequenced before"). > > > > > > > > > > So while an expression like > > > > > > > > > > atomic_read(p, consume) ? a : b; > > > > > > > > > > doesn't have a data dependency from the atomic read that forces > > > > > serialization, writing > > > > > > > > > > if (atomic_read(p, consume)) > > > > > a; > > > > > else > > > > > b; > > > > > > > > > > the standard *does* imply that the atomic read is "happens-before" wrt > > > > > "a", and I'm hoping that there is no question that the control > > > > > dependency still acts as an ordering point. > > > > > > > > The control dependency should order subsequent stores, at least assuming > > > > that "a" and "b" don't start off with identical stores that the compiler > > > > could pull out of the "if" and merge. The same might also be true for ?: > > > > for all I know. (But see below) > > > > > > I don't think this is quite true. I agree that a conditional store will > > > not be executed speculatively (note that if it would happen in both the > > > then and the else branch, it's not conditional); so, the store in > > > "a;" (assuming it would be a store) won't happen unless the thread can > > > really observe a true value for p. However, this is *this thread's* > > > view of the world, but not guaranteed to constrain how any other thread > > > sees the state. mo_consume does not contribute to > > > inter-thread-happens-before in the same way that mo_acquire does (which > > > *does* put a constraint on i-t-h-b, and thus enforces a global > > > constraint that all threads have to respect). > > > > > > Is it clear which distinction I'm trying to show here? > > > > If you are saying that the control dependencies are a result of a > > combination of the standard and the properties of the hardware that > > Linux runs on, I am with you. (As opposed to control dependencies being > > a result solely of the standard.) > > I'm not quite sure I understand what you mean :) Do you mean the > control dependencies in the binary code, or the logical "control > dependencies" in source programs? At present, the intersection of those two sets, but only including those control dependencies beginning with with a memory_order_consume load or a [[carries_dependency]] function argument or return value. Or something like that. ;-) > > This was a deliberate decision in 2007 or so. At that time, the > > documentation on CPU memory orderings were pretty crude, and it was > > not clear that all relevant hardware respected control dependencies. > > Back then, if you wanted an authoritative answer even to a fairly simple > > memory-ordering question, you had to find a hardware architect, and you > > probably waited weeks or even months for the answer. Thanks to lots > > of work from the Cambridge guys at about the time that the standard was > > finalized, we have a much better picture of what the hardware does. > > But this part I understand. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 0:53 ` Linus Torvalds 2014-02-20 4:01 ` Paul E. McKenney @ 2014-02-20 17:14 ` Torvald Riegel 2014-02-20 17:34 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 17:14 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty, Peter Sewell On Wed, 2014-02-19 at 16:53 -0800, Linus Torvalds wrote: > On Tue, Feb 18, 2014 at 11:47 AM, Torvald Riegel <triegel@redhat.com> wrote: > > On Tue, 2014-02-18 at 09:44 -0800, Linus Torvalds wrote: > >> > >> Can you point to it? Because I can find a draft standard, and it sure > >> as hell does *not* contain any clarity of the model. It has a *lot* of > >> verbiage, but it's pretty much impossible to actually understand, even > >> for somebody who really understands memory ordering. > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > This has an explanation of the model up front, and then the detailed > > formulae in Section 6. This is from 2010, and there might have been > > smaller changes since then, but I'm not aware of any bigger ones. > > Ahh, this is different from what others pointed at. Same people, > similar name, but not the same paper. > > I will read this version too, but from reading the other one and the > standard in parallel and trying to make sense of it, it seems that I > may have originally misunderstood part of the whole control dependency > chain. > > The fact that the left side of "? :", "&&" and "||" breaks data > dependencies made me originally think that the standard tried very > hard to break any control dependencies. Hmm... (I'm not quite sure how to express this properly, so bear with me.) The control dependencies as you seem to consider aren't there *as is*, and removing "? :", "&&", and "||" from what's called Carries-a-dependency-to in n3132 is basically removing control dependences in expressions from that relation (ie, so it's just data dependences). However, control dependences in the program are not ignored, because they control what steps the abstract machine would do. If we have candidate executions with a particular choice for the reads-from relation, they must make sense in terms of program logic (which does include conditionals and such), or a candidate execution will not be consistent. The difference is in how reads-from is constrainted by different memory orders. I hope what I mean gets clearer below when looking at the example... Mark Batty / Peter Sewell: Do you have a better explanation? (And please correct anything wrong in what I wrote ;) > Which I felt was insane, when > then some of the examples literally were about the testing of the > value of an atomic read. The data dependency matters quite a bit. The > fact that the other "Mathematical" paper then very much talked about > consume only in the sense of following a pointer made me think so even > more. > > But reading it some more, I now think that the whole "data dependency" > logic (which is where the special left-hand side rule of the ternary > and logical operators come in) are basically an exception to the rule > that sequence points end up being also meaningful for ordering (ok, so > C11 seems to have renamed "sequence points" to "sequenced before"). I'm not quite sure what you mean, but I think that this isn't quite the case. mo_consume and mo_acquire combine with sequenced-before differently. Have a look at the definition of inter-thread-happens-before (6.15 in n3132): for synchronizes-with, the composition with sequenced-before is used (ie, what you'd get with mo_acquire), whereas for mo_consume, we just have dependency_ordered_before (so, data dependencies only). > So while an expression like > > atomic_read(p, consume) ? a : b; > > doesn't have a data dependency from the atomic read that forces > serialization, writing > > if (atomic_read(p, consume)) > a; > else > b; > > the standard *does* imply that the atomic read is "happens-before" wrt > "a", and I'm hoping that there is no question that the control > dependency still acts as an ordering point. I don't think it does for mo_consume (but does for mo_acquire). Both examples do basically the same in the abstract machine, and I think it would be surprising if rewriting could the code with something seemingly equivalent would make a semantic difference. AFAICT, "? :" etc. is removed to allow the standard defining data dependencies as being based on expressions that take the mo_consume value as input, without also pulling in control dependencies. To look at this example in the cppmem tool, we can transform your example above into: int main() { atomic_int x = 0; int y = 0; {{{ { y = 1; x.store(1, memory_order_release); } ||| { if (x.load(memory_order_consume)) r2 = y; } }}}; return 0; } This has 2 threads, one is doing the store to y (the data) and the release operation, the other is using consume to try to read the data. This yields 2 consistent executions (we can read 2 values of x), but the execution in which we read y has a data race. I'm not quite sure how the tool parses/models data dependencies in expressions, but the program (ie, your example) doesn't carry a dependency from the mo_consume load to the load from y (as defined in C ++ 1.10p9 and n3132 6.13). Thus, the mo_release store is dependency-ordered-before the mo_consume load ("dob" in the tool's output graph), but that doesn't extend to the load from y. Therefore, the load is unordered wrt. to the first thread's "y = 1", and because they are nonatomic, we get a data race (labeled "dr"). (The labels on the edges are easier to see in the "dot" Display Layout, even though one has to then figure out which vertex belongs to which thread.) If you enable it with the display options, the tool will show a control dependency, but that is currently unused it seems (see 6.20 in n3132). If we change mo_consume to mo_acquire in the example, the race goes away. (This is visible in the inter-thread-happens-before relation, which can be shown by "removing suppress_edge ithb." under "edit display options" (but this makes the graph kind of messy)). BTW, a pretty similar way to expressing this cppmem code above is using readsvalue() to constrain executions: int main() { atomic_int x = 0; int y = 0; {{{ { y = 1; x.store(1, memory_order_release); } ||| { r1 = x.load(memory_order_consume).readsvalue(1); r2 = y; } }}}; return 0; } > THAT was one of my big confusions, the discussion about control > dependencies and the fact that the logical ops broke the data > dependency made me believe that the standard tried to actively avoid > the whole issue with "control dependencies can break ordering > dependencies on some CPU's due to branch prediction and memory > re-ordering by the CPU". > > But after all the reading, I'm starting to think that that was never > actually the implication at all, and the "logical ops breaks the data > dependency rule" is simply an exception to the sequence point rule. > All other sequence points still do exist, and do imply an ordering > that matters for "consume" > > Am I now reading it right? > > So the clarification is basically to the statement that the "if > (consume(p)) a" version *would* have an ordering guarantee between the > read of "p" and "a", but the "consume(p) ? a : b" would *not* have > such an ordering guarantee. Yes? Not as I understand it. If my reply above wasn't clear, let me know and I'll try to rephrase it into something that is. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:14 ` Torvald Riegel @ 2014-02-20 17:34 ` Linus Torvalds 2014-02-20 18:12 ` Torvald Riegel 2014-02-20 18:26 ` Paul E. McKenney 0 siblings, 2 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-20 17:34 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty, Peter Sewell On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote: >> >> So the clarification is basically to the statement that the "if >> (consume(p)) a" version *would* have an ordering guarantee between the >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have >> such an ordering guarantee. Yes? > > Not as I understand it. If my reply above wasn't clear, let me know and > I'll try to rephrase it into something that is. Yeah, so you and Paul agree. And as I mentioned in the email that crossed with yours, I think that means that the standard is overly complex, hard to understand, fragile, and all *pointlessly* so. Btw, there are many ways that "use a consume as an input to a conditional" can happen. In particular, even if the result is actually *used* like a pointer as far as the programmer is concerned, tricks like pointer compression etc can well mean that the "pointer" is actually at least partly implemented using conditionals, so that some paths end up being only dependent through a comparison of the pointer value. So I very much did *not* want to complicate the "litmus test" code snippet when Paul tried to make it more complex, but I do think that there are cases where code that "looks" like pure pointer chasing actually is not for some cases, and then can become basically that litmus test for some path. Just to give you an example: in simple list handling it is not at all unusual to have a special node that is discovered by comparing the address, not by just loading from the pointer and following the list itself. Examples of that would be a HEAD node in a doubly linked list (Linux uses this concept quite widely, it's our default list implementation), or it could be a place-marker ("cursor" entry) in the list for safe traversal in the presence of concurrent deletes etc. And obviously there is the already much earlier mentioned compiler-induced compare, due to value speculation, that can basically create such sequences even wherethey did not originally exist in the source code itself. So even if you work with "pointer dereferences", and thus match that particular consume pattern, I really don't see why anybody would think that "hey, we can ignore any control dependencies" is a good idea. It's a *TERRIBLE* idea. And as mentioned, it's a terrible idea with no upsides. It doesn't help compiler optimizations for the case it's *intended* to help, since those optimizations can still be done without the horribly broken semantics. It doesn't help compiler writers, it just confuses them. And it sure as hell doesn't help users. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:34 ` Linus Torvalds @ 2014-02-20 18:12 ` Torvald Riegel 2014-02-20 18:26 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-20 18:12 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty, Peter Sewell On Thu, 2014-02-20 at 09:34 -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote: > >> > >> So the clarification is basically to the statement that the "if > >> (consume(p)) a" version *would* have an ordering guarantee between the > >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have > >> such an ordering guarantee. Yes? > > > > Not as I understand it. If my reply above wasn't clear, let me know and > > I'll try to rephrase it into something that is. > > Yeah, so you and Paul agree. And as I mentioned in the email that > crossed with yours, I think that means that the standard is overly > complex, hard to understand, fragile, and all *pointlessly* so. Let's step back a little here and distinguish different things: 1) AFAICT, mo_acquire provides all the ordering guarantees you want. Thus, I suggest focusing on mo_acquire for now, especially when reading the model. Are you fine with the mo_acquire semantics? 2) mo_acquire is independent of carries-a-dependency and similar definitions/relations, so you can ignore all those to reason about mo_acquire. 3) Do you have concerns over the runtime costs of barriers with mo_acquire semantics? If so, that's a valid discussion point, and we can certainly dig deeper into this topic to see how we can possibly use weaker HW barriers by exploiting things the compiler sees and can potentially preserve (e.g., control dependencies). There might be some stuff the compiler can do without needing further input from the programmer. 4) mo_consume is kind of the difficult special-case variant of mo_acquire. We should discuss it (including whether it's helpful) separately from the memory model, because it's not essential. > Btw, there are many ways that "use a consume as an input to a > conditional" can happen. In particular, even if the result is actually > *used* like a pointer as far as the programmer is concerned, tricks > like pointer compression etc can well mean that the "pointer" is > actually at least partly implemented using conditionals, so that some > paths end up being only dependent through a comparison of the pointer > value. AFAIU, this is similar to my concerns about how a compiler can reasonably implement the ordering guarantees: The mo_consume value may be used like a pointer in source code, but how this looks in the generated code be different after reasonable transformations (including transformations to control dependencies, so the compiler would have to avoid those). Or did I misunderstand? > So I very much did *not* want to complicate the "litmus test" code > snippet when Paul tried to make it more complex, but I do think that > there are cases where code that "looks" like pure pointer chasing > actually is not for some cases, and then can become basically that > litmus test for some path. > > Just to give you an example: in simple list handling it is not at all > unusual to have a special node that is discovered by comparing the > address, not by just loading from the pointer and following the list > itself. Examples of that would be a HEAD node in a doubly linked list > (Linux uses this concept quite widely, it's our default list > implementation), or it could be a place-marker ("cursor" entry) in the > list for safe traversal in the presence of concurrent deletes etc. > > And obviously there is the already much earlier mentioned > compiler-induced compare, due to value speculation, that can basically > create such sequences even wherethey did not originally exist in the > source code itself. > > So even if you work with "pointer dereferences", and thus match that > particular consume pattern, I really don't see why anybody would think > that "hey, we can ignore any control dependencies" is a good idea. > It's a *TERRIBLE* idea. > > And as mentioned, it's a terrible idea with no upsides. It doesn't > help compiler optimizations for the case it's *intended* to help, > since those optimizations can still be done without the horribly > broken semantics. It doesn't help compiler writers, it just confuses > them. I'm worried about how compilers can implement mo_consume without prohibiting lots of optimizations on the code. Your thoughts seem to point in a similar direction. I think we should continue by discussing mo_acquire first. It has easier semantics and allows a relatively simple implementation in compilers (although there might be not-quite-so-simple optimizations). It's unfortunate we started the discussion with the tricky special case first; maybe that's what contributed to the confusion. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-20 17:34 ` Linus Torvalds 2014-02-20 18:12 ` Torvald Riegel @ 2014-02-20 18:26 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-20 18:26 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc, Mark Batty, Peter Sewell On Thu, Feb 20, 2014 at 09:34:57AM -0800, Linus Torvalds wrote: > On Thu, Feb 20, 2014 at 9:14 AM, Torvald Riegel <triegel@redhat.com> wrote: > >> > >> So the clarification is basically to the statement that the "if > >> (consume(p)) a" version *would* have an ordering guarantee between the > >> read of "p" and "a", but the "consume(p) ? a : b" would *not* have > >> such an ordering guarantee. Yes? > > > > Not as I understand it. If my reply above wasn't clear, let me know and > > I'll try to rephrase it into something that is. > > Yeah, so you and Paul agree. And as I mentioned in the email that > crossed with yours, I think that means that the standard is overly > complex, hard to understand, fragile, and all *pointlessly* so. > > Btw, there are many ways that "use a consume as an input to a > conditional" can happen. In particular, even if the result is actually > *used* like a pointer as far as the programmer is concerned, tricks > like pointer compression etc can well mean that the "pointer" is > actually at least partly implemented using conditionals, so that some > paths end up being only dependent through a comparison of the pointer > value. > > So I very much did *not* want to complicate the "litmus test" code > snippet when Paul tried to make it more complex, but I do think that > there are cases where code that "looks" like pure pointer chasing > actually is not for some cases, and then can become basically that > litmus test for some path. > > Just to give you an example: in simple list handling it is not at all > unusual to have a special node that is discovered by comparing the > address, not by just loading from the pointer and following the list > itself. Examples of that would be a HEAD node in a doubly linked list > (Linux uses this concept quite widely, it's our default list > implementation), or it could be a place-marker ("cursor" entry) in the > list for safe traversal in the presence of concurrent deletes etc. Yep, good example. And also an example where the comparison does not need to create any particular ordering. > And obviously there is the already much earlier mentioned > compiler-induced compare, due to value speculation, that can basically > create such sequences even wherethey did not originally exist in the > source code itself. > > So even if you work with "pointer dereferences", and thus match that > particular consume pattern, I really don't see why anybody would think > that "hey, we can ignore any control dependencies" is a good idea. > It's a *TERRIBLE* idea. > > And as mentioned, it's a terrible idea with no upsides. It doesn't > help compiler optimizations for the case it's *intended* to help, > since those optimizations can still be done without the horribly > broken semantics. It doesn't help compiler writers, it just confuses > them. > > And it sure as hell doesn't help users. Yep. And in 2014 we know a lot more about the hardware, so could make a reasonable proposal. In contrast, back in 2007 and 2008 memory ordering was much more of a dark art, and proposals for control dependencies therefore didn't get very far. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:24 ` Linus Torvalds 2014-02-18 3:42 ` Linus Torvalds @ 2014-02-18 5:01 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 5:01 UTC (permalink / raw) To: Linus Torvalds Cc: Torvald Riegel, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 07:24:56PM -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 7:00 PM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > One example that I learned about last week uses the branch-prediction > > hardware to validate value speculation. And no, I am not at all a fan > > of value speculation, in case you were curious. > > Heh. See the example I used in my reply to Alec Teal. It basically > broke the same dependency the same way. ;-) > Yes, value speculation of reads is simply wrong, the same way > speculative writes are simply wrong. The dependency chain matters, and > is meaningful, and breaking it is actively bad. > > As far as I can tell, the intent is that you can't do value > speculation (except perhaps for the "relaxed", which quite frankly > sounds largely useless). But then I do get very very nervous when > people talk about "proving" certain values. That was certainly my intent, but as you might have notice in the discussion earlier in this thread, the intent can get lost pretty quickly. ;-) The HPC guys appear to be the most interested in breaking dependencies. Their software does't rely on dependencies, and from their viewpoint anything that has any chance of leaving an FP unit of any type idle is a very bad thing. But there are probably other benchmarks for which breaking dependencies gives a few percent performance boost. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 3:00 ` Paul E. McKenney 2014-02-18 3:24 ` Linus Torvalds @ 2014-02-18 15:56 ` Torvald Riegel 2014-02-18 16:51 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 15:56 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > Which example do you have in mind here? Haven't we resolved all the > > > debated examples, or did I miss any? > > > > Well, Paul seems to still think that the standard possibly allows > > speculative writes or possibly value speculation in ways that break > > the hardware-guaranteed orderings. > > It is not that I know of any specific problems, but rather that I > know I haven't looked under all the rocks. Plus my impression from > my few years on the committee is that the standard will be pushed to > the limit when it comes time to add optimizations. > > One example that I learned about last week uses the branch-prediction > hardware to validate value speculation. And no, I am not at all a fan > of value speculation, in case you were curious. However, it is still > an educational example. > > This is where you start: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > do_something(p->a, p->b, p->c); > p->d = 1; I assume that's the source code. > Then you leverage branch-prediction hardware as follows: > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > if (p == GUESS) { > do_something(GUESS->a, GUESS->b, GUESS->c); > GUESS->d = 1; > } else { > do_something(p->a, p->b, p->c); > p->d = 1; > } I assume that this is a potential transformation by a compiler. > The CPU's branch-prediction hardware squashes speculation in the case where > the guess was wrong, and this prevents the speculative store to ->d from > ever being visible. However, the then-clause breaks dependencies, which > means that the loads -could- be speculated, so that do_something() gets > passed pre-initialization values. > > Now, I hope and expect that the wording in the standard about dependency > ordering prohibits this sort of thing. But I do not yet know for certain. The transformation would be incorrect. p->a in the source code carries a dependency, and as you say, the transformed code wouldn't have that dependency any more. So the transformed code would loose ordering constraints that it has in the virtual machine, so in the absence of other proofs of correctness based on properties not shown in the example, the transformed code would not result in the same behavior as allowed by the abstract machine. If the transformation would actually be by a programmer, then this wouldn't do the same as the first example because mo_consume doesn't work through the if statement. Are there other specified concerns that you have regarding this example? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 15:56 ` Torvald Riegel @ 2014-02-18 16:51 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-18 16:51 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Tue, Feb 18, 2014 at 04:56:40PM +0100, Torvald Riegel wrote: > On Mon, 2014-02-17 at 19:00 -0800, Paul E. McKenney wrote: > > On Mon, Feb 17, 2014 at 12:18:21PM -0800, Linus Torvalds wrote: > > > On Mon, Feb 17, 2014 at 11:55 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > > > Which example do you have in mind here? Haven't we resolved all the > > > > debated examples, or did I miss any? > > > > > > Well, Paul seems to still think that the standard possibly allows > > > speculative writes or possibly value speculation in ways that break > > > the hardware-guaranteed orderings. > > > > It is not that I know of any specific problems, but rather that I > > know I haven't looked under all the rocks. Plus my impression from > > my few years on the committee is that the standard will be pushed to > > the limit when it comes time to add optimizations. > > > > One example that I learned about last week uses the branch-prediction > > hardware to validate value speculation. And no, I am not at all a fan > > of value speculation, in case you were curious. However, it is still > > an educational example. > > > > This is where you start: > > > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > > do_something(p->a, p->b, p->c); > > p->d = 1; > > I assume that's the source code. Yep! > > Then you leverage branch-prediction hardware as follows: > > > > p = gp.load_explicit(memory_order_consume); /* AKA rcu_dereference() */ > > if (p == GUESS) { > > do_something(GUESS->a, GUESS->b, GUESS->c); > > GUESS->d = 1; > > } else { > > do_something(p->a, p->b, p->c); > > p->d = 1; > > } > > I assume that this is a potential transformation by a compiler. Again, yep! > > The CPU's branch-prediction hardware squashes speculation in the case where > > the guess was wrong, and this prevents the speculative store to ->d from > > ever being visible. However, the then-clause breaks dependencies, which > > means that the loads -could- be speculated, so that do_something() gets > > passed pre-initialization values. > > > > Now, I hope and expect that the wording in the standard about dependency > > ordering prohibits this sort of thing. But I do not yet know for certain. > > The transformation would be incorrect. p->a in the source code carries > a dependency, and as you say, the transformed code wouldn't have that > dependency any more. So the transformed code would loose ordering > constraints that it has in the virtual machine, so in the absence of > other proofs of correctness based on properties not shown in the > example, the transformed code would not result in the same behavior as > allowed by the abstract machine. Glad that you agree! ;-) > If the transformation would actually be by a programmer, then this > wouldn't do the same as the first example because mo_consume doesn't > work through the if statement. Agreed. > Are there other specified concerns that you have regarding this example? Nope. Just generalized paranoia. (But just because I am paranoid doesn't mean that there isn't a bug lurking somewhere in the standard, the compiler, the kernel, or my own head!) I will likely have more once I start mapping Linux kernel atomics to the C11 standard. One more paper past N3934 comes first, though. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 19:55 ` Torvald Riegel 2014-02-17 20:18 ` Linus Torvalds @ 2014-02-17 20:23 ` Paul E. McKenney 2014-02-17 21:05 ` Torvald Riegel 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-17 20:23 UTC (permalink / raw) To: Torvald Riegel Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote: > On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: > > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > I think a major benefit of C11's memory model is that it gives a > > > *precise* specification for how a compiler is allowed to optimize. > > > > Clearly it does *not*. This whole discussion is proof of that. It's > > not at all clear, > > It might not be an easy-to-understand specification, but as far as I'm > aware it is precise. The Cambridge group's formalization certainly is > precise. From that, one can derive (together with the usual rules for > as-if etc.) what a compiler is allowed to do (assuming that the standard > is indeed precise). My replies in this discussion have been based on > reasoning about the standard, and not secret knowledge (with the > exception of no-out-of-thin-air, which is required in the standard's > prose but not yet formalized). > > I agree that I'm using the formalization as a kind of placeholder for > the standard's prose (which isn't all that easy to follow for me > either), but I guess there's no way around an ISO standard using prose. > > If you see a case in which the standard isn't precise, please bring it > up or open a C++ CWG issue for it. I suggest that I go through the Linux kernel's requirements for atomics and memory barriers and see how they map to C11 atomics. With that done, we would have very specific examples to go over. Without that done, the discussion won't converge very well. Seem reasonable? Thanx, Paul > > and the standard apparently is at least debatably > > allowing things that shouldn't be allowed. > > Which example do you have in mind here? Haven't we resolved all the > debated examples, or did I miss any? > > > It's also a whole lot more > > complicated than "volatile", so the likelihood of a compiler writer > > actually getting it right - even if the standard does - is lower. > > It's not easy, that's for sure, but none of the high-performance > alternatives are easy either. There are testing tools out there based > on the formalization of the model, and we've found bugs with them. > > And the alternative of using something not specified by the standard is > even worse, I think, because then you have to guess what a compiler > might do, without having any constraints; IOW, one is resorting to "no > sane compiler would do that", and that doesn't seem to very robust > either. > > ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 20:23 ` Paul E. McKenney @ 2014-02-17 21:05 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 21:05 UTC (permalink / raw) To: paulmck Cc: Linus Torvalds, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 12:23 -0800, Paul E. McKenney wrote: > On Mon, Feb 17, 2014 at 08:55:47PM +0100, Torvald Riegel wrote: > > On Sat, 2014-02-15 at 10:49 -0800, Linus Torvalds wrote: > > > On Sat, Feb 15, 2014 at 9:45 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > > > > > I think a major benefit of C11's memory model is that it gives a > > > > *precise* specification for how a compiler is allowed to optimize. > > > > > > Clearly it does *not*. This whole discussion is proof of that. It's > > > not at all clear, > > > > It might not be an easy-to-understand specification, but as far as I'm > > aware it is precise. The Cambridge group's formalization certainly is > > precise. From that, one can derive (together with the usual rules for > > as-if etc.) what a compiler is allowed to do (assuming that the standard > > is indeed precise). My replies in this discussion have been based on > > reasoning about the standard, and not secret knowledge (with the > > exception of no-out-of-thin-air, which is required in the standard's > > prose but not yet formalized). > > > > I agree that I'm using the formalization as a kind of placeholder for > > the standard's prose (which isn't all that easy to follow for me > > either), but I guess there's no way around an ISO standard using prose. > > > > If you see a case in which the standard isn't precise, please bring it > > up or open a C++ CWG issue for it. > > I suggest that I go through the Linux kernel's requirements for atomics > and memory barriers and see how they map to C11 atomics. With that done, > we would have very specific examples to go over. Without that done, the > discussion won't converge very well. > > Seem reasonable? Sounds good! ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-14 19:50 ` Linus Torvalds 2014-02-14 20:02 ` Linus Torvalds @ 2014-02-15 17:30 ` Torvald Riegel 2014-02-15 19:15 ` Linus Torvalds 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-15 17:30 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Fri, 2014-02-14 at 11:50 -0800, Linus Torvalds wrote: > On Fri, Feb 14, 2014 at 9:29 AM, Paul E. McKenney > <paulmck@linux.vnet.ibm.com> wrote: > > > > Linus, Peter, any objections to marking places where we are relying on > > ordering from control dependencies against later stores? This approach > > seems to me to have significant documentation benefits. > > Quite frankly, I think it's stupid, and the "documentation" is not a > benefit, it's just wrong. I think the example is easy to misunderstand, because the context isn't clear. Therefore, let me first try to clarify the background. (1) The abstract machine does not write speculatively. (2) Emitting a branch instruction and executing a branch at runtime is not part of the specified behavior of the abstract machine. Of course, the abstract machine performs conditional execution, but that just specifies the output / side effects that it must produce (e.g., volatile stores) -- not with which hardware instructions it is producing this. (3) A compiled program must produce the same output as if executed by the abstract machine. Thus, we need to be careful what "speculative store" is meant to refer to. A few examples: if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed)); Here, the requirement is that in terms of program logic, y is assigned 3 if x equals 1. It's not specified how an implementation does that. * If the compiler can prove that x is always 1, then it can remove the branch. This is because of (2). Because of the proof, (1) is not violated. * If the compiler can prove that the store to y is never observed or does not change the program's output, the store can be removed. if (atomic_load(&x, mo_relaxed) == 1) { atomic_store(&y, 3, mo_relaxed)); other_a(); } else { atomic_store(&y, 3, mo_relaxed)); other_b(); } Here, y will be assigned to regardless of the value of x. * The compiler can hoist the store out of the two branches. This is because the store and the branch instruction aren't observable outcomes of the abstract machine. * The compiler can even move the store to y before the load from x (unless this affects logical program order of this thread in some way.) This is because the load/store are ordered by sequenced-before (intra-thread), but mo_relaxed allows the hardware to reorder, so the compiler can do it as well (IOW, other threads can't expect a particular order). if (atomic_load(&x, mo_acquire) == 1) atomic_store(&y, 3, mo_relaxed)); This is similar to the first case, but with stronger memory order. * If the compiler proves that x is always 1, then it does so by showing that the load will always be able to read from a particular store (or several of them) that (all) assigned 1 to x -- as specified by the abstract machine and taking the forward progress guarantees into account. In general, it still has to establish the synchronized-with edge if any of those stores used release_mo (or other fences resulting in the same situation), so it can't just get rid of the acquire "fence" in this case. (There are probably situations in which this can be done, but I can't characterize them easily at the moment.) These examples all rely on the abstract machine as specified in the current standard. In contrast, the example that Paul (and Peter, I assume) where looking at is not currently modeled by the standard. AFAIU, they want to exploit that control dependencies, when encountered in binary code, can result in the hardware giving certain ordering guarantees. This is vaguely similar to mo_consume which is about data dependencies. mo_consume is, partially due to how it's specified, pretty hard to implement for compilers in a way that actually exploits and preserves data dependencies and not just substitutes mo_consume for a stronger memory order. Part of this problem is that the standard takes an opt-out approach regarding the code that should track dependencies (e.g., certain operators are specified as not preserving them), instead of cleanly carving out meaningful operators where one can track dependencies without obstructing generally useful compiler optimizations (i.e., "opt-in"). This leads to cases such as that in "*(p + f - f)", the compiler either has to keep f - f or emit a stronger fence if f is originating from a mo_consume load. Furthermore, dependencies are supposed to be tracked across any load and store, so the compiler needs to do points-to if it wants to optimize this as much as possible. Paul and I have been thinking about alternatives, and one of them was doing the opt-in by demarcating code that needs explicit dependency tracking because it wants to exploit mo_consume. Back to HW control dependencies, this lead to the idea of marking the "control dependencies" in the source code (ie, on the abstract machine level), that need to be preserved in the generated binary code, even if they have no semantic meaning on the abstract machine level. So, this is something extra that isn't modeled in the standard currently, because of (1) and (2) above. (Note that it's clearly possible that I misunderstand the goals of Paul/Peter. But then this would just indicate that working on precise specifications does help :) > How would you figure out whether your added "documentation" holds true > for particular branches but not others? > > How could you *ever* trust a compiler that makes the dependency meaningful? Does the above clarify the situation? If not, can you perhaps rephrase any remaining questions? > Again, let's keep this simple and sane: > > - if a compiler ever generates code where an atomic store movement is > "visible" in any way, then that compiler is broken shit. Unless volatile, the store is not part of the "visible" output of the abstract machine, and such an implementation "detail". In turn, any correct store movement must not affect the output of the program, so the implementation detail remains invisible. > I don't understand why you even argue this. Seriously, Paul, you seem > to *want* to think that "broken shit" is acceptable, and that we > should then add magic markers to say "now you need to *not* be broken > shit". > > Here's a magic marker for you: DON'T USE THAT BROKEN COMPILER. > > And if a compiler can *prove* that whatever code movement it does > cannot make a difference, then let it do so. No amount of > "documentation" should matter. Enabling that is certainly a goal of how the standard specifies all this. I'll let you sort out whether you want to exploit the control dependency thing :) > Seriously, this whole discussion has been completely moronic. I don't > understand why you even bring shit like this up: > > > > r1 = atomic_load(x, memory_order_control); > > > if (control_dependency(r1)) > > > atomic_store(y, memory_order_relaxed); > > I mean, really? Anybody who writes code like that, or any compiler > where that "control_dependency()" marker makes any difference > what-so-ever for code generation should just be retroactively aborted. It doesn't make a difference in the standard as specified (well, there's no control_dependency :). I hope the background above clarifies the discussed extension idea this originated from. > There is absolutely *zero* reason for that "control_dependency()" > crap. If you ever find a reason for it, it is either because the > compiler is buggy, or because the standard is so shit that we should > never *ever* use the atomics. > > Seriously. This thread has devolved into some kind of "just what kind > of idiotic compiler cesspool crap could we accept". Get away from that > f*cking mindset. We don't accept *any* crap. > > Why are we still discussing this idiocy? It's irrelevant. If the > standard really allows random store speculation, the standard doesn't > matter, and sane people shouldn't waste their time arguing about it. It disallows it if this changes program semantics as specified by the abstract machine. Does that answer your concerns? (Or, IOW, do you still wonder whether it's crap? ;) ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 17:30 ` Torvald Riegel @ 2014-02-15 19:15 ` Linus Torvalds 2014-02-17 22:09 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-15 19:15 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel <triegel@redhat.com> wrote: > > I think the example is easy to misunderstand, because the context isn't > clear. Therefore, let me first try to clarify the background. > > (1) The abstract machine does not write speculatively. > (2) Emitting a branch instruction and executing a branch at runtime is > not part of the specified behavior of the abstract machine. Of course, > the abstract machine performs conditional execution, but that just > specifies the output / side effects that it must produce (e.g., volatile > stores) -- not with which hardware instructions it is producing this. > (3) A compiled program must produce the same output as if executed by > the abstract machine. Ok, I'm fine with that. > Thus, we need to be careful what "speculative store" is meant to refer > to. A few examples: > > if (atomic_load(&x, mo_relaxed) == 1) > atomic_store(&y, 3, mo_relaxed)); No, please don't use this idiotic example. It is wrong. The fact is, if a compiler generates anything but the obvious sequence (read/cmp/branch/store - where branch/store might obviously be done with some other machine conditional like a predicate), the compiler is wrong. Anybody who argues anything else is wrong, or confused, or confusing. Instead, argue about *other* sequences where the compiler can do something. For example, this sequence: atomic_store(&x, a, mo_relaxed); b = atomic_load(&x, mo_relaxed); can validly be transformed to atomic_store(&x, a, mo_relaxed); b = (typeof(x)) a; and I think everybody agrees about that. In fact, that optimization can be done even for mo_strict. But even that "obvious" optimization has subtle cases. What if the store is relaxed, but the load is strict? You can't do the optimization without a lot of though, because dropping the strict load would drop an ordering point. So even the "store followed by exact same load" case has subtle issues. With similar caveats, it is perfectly valid to merge two consecutive loads, and to merge two consecutive stores. Now that means that the sequence atomic_store(&x, 1, mo_relaxed); if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed); can first be optimized to atomic_store(&x, 1, mo_relaxed); if (1 == 1) atomic_store(&y, 3, mo_relaxed); and then you get the end result that you wanted in the first place (including the ability to re-order the two stores due to the relaxed ordering, assuming they can be proven to not alias - and please don't use the idiotic type-based aliasing rules). Bringing up your first example is pure and utter confusion. Don't do it. Instead, show what are obvious and valid transformations, and then you can bring up these kinds of combinations as "look, this is obviously also correct". Now, the reason I want to make it clear that the code example you point to is a crap example is that because it doesn't talk about the *real* issue, it also misses a lot of really important details. For example, when you say "if the compiler can prove that the conditional is always true" then YOU ARE TOTALLY WRONG. So why do I say you are wrong, after I just gave you an example of how it happens? Because my example went back to the *real* issue, and there are actual real semantically meaningful details with doing things like load merging. To give an example, let's rewrite things a bit more to use an extra variable: atomic_store(&x, 1, mo_relaxed); a = atomic_load(&1, mo_relaxed); if (a == 1) atomic_store(&y, 3, mo_relaxed); which looks exactly the same. If you now say "if you can prove the conditional is always true, you can make the store unconditional", YOU ARE WRONG. Why? This sequence: atomic_store(&x, 1, mo_relaxed); a = atomic_load(&x, mo_relaxed); atomic_store(&y, 3, mo_relaxed); is actually - and very seriously - buggy. Why? Because you have effectively split the atomic_load into two loads - one for the value of 'a', and one for your 'proof' that the store is unconditional. Maybe you go "Nobody sane would do that", and you'd be right. But compilers aren't "sane" (and compiler _writers_ I have some serious doubts about), and how you describe the problem really does affect the solution. My description ("you can combine two subsequent atomic accesses, if you are careful about still having the same ordering points") doesn't have the confusion. It makes it clear that no, you can't speculate writes, but yes, obviously you can combine certain things (think "write buffers" and "memory caches"), and that may make you able to remove the conditional. Which is very different from speculating writes. But my description also makes it clear that the transformation above was buggy, but that rewriting it as a = 1; atomic_store(&y, 3, mo_relaxed); atomic_store(&x, a, mo_relaxed); is fine (doing the re-ordering here just to make a point). So I suspect we actually agree, I just think that the "example" that has been floating around, and the explanation for it, has been actively bad and misleading. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-15 19:15 ` Linus Torvalds @ 2014-02-17 22:09 ` Torvald Riegel 2014-02-17 22:32 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 22:09 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote: > On Sat, Feb 15, 2014 at 9:30 AM, Torvald Riegel <triegel@redhat.com> wrote: > > > > I think the example is easy to misunderstand, because the context isn't > > clear. Therefore, let me first try to clarify the background. > > > > (1) The abstract machine does not write speculatively. > > (2) Emitting a branch instruction and executing a branch at runtime is > > not part of the specified behavior of the abstract machine. Of course, > > the abstract machine performs conditional execution, but that just > > specifies the output / side effects that it must produce (e.g., volatile > > stores) -- not with which hardware instructions it is producing this. > > (3) A compiled program must produce the same output as if executed by > > the abstract machine. > > Ok, I'm fine with that. > > > Thus, we need to be careful what "speculative store" is meant to refer > > to. A few examples: > > > > if (atomic_load(&x, mo_relaxed) == 1) > > atomic_store(&y, 3, mo_relaxed)); > > No, please don't use this idiotic example. It is wrong. It won't be useful in practice in a lot of cases, but that doesn't mean it's wrong. It's clearly not illegal code. It also serves a purpose: a simple example to reason about a few aspects of the memory model. > The fact is, if a compiler generates anything but the obvious sequence > (read/cmp/branch/store - where branch/store might obviously be done > with some other machine conditional like a predicate), the compiler is > wrong. Why? I've reasoned why (1) to (3) above allow in certain cases (i.e., the first load always returning 1) for the branch (or other machine conditional) to not be emitted. So please either poke holes into this reasoning, or clarify that you don't in fact, contrary to what you wrote above, agree with (1) to (3). > Anybody who argues anything else is wrong, or confused, or confusing. I appreciate your opinion, and maybe I'm just one of the three things above (my vote is on "confusing"). But without you saying why doesn't help me see what's the misunderstanding here. > Instead, argue about *other* sequences where the compiler can do something. I'd prefer if we could clarify the misunderstanding for the simple case first that doesn't involve stronger ordering requirements in the form of non-relaxed MOs. > For example, this sequence: > > atomic_store(&x, a, mo_relaxed); > b = atomic_load(&x, mo_relaxed); > > can validly be transformed to > > atomic_store(&x, a, mo_relaxed); > b = (typeof(x)) a; > > and I think everybody agrees about that. In fact, that optimization > can be done even for mo_strict. Yes. > But even that "obvious" optimization has subtle cases. What if the > store is relaxed, but the load is strict? You can't do the > optimization without a lot of though, because dropping the strict load > would drop an ordering point. So even the "store followed by exact > same load" case has subtle issues. Yes if a compiler wants to optimize that, it has to give it more thought. My gut feeling is that either the store should get the stronger ordering, or the accesses should be merged. But I'd have to think more about that one (which I can do on request). > With similar caveats, it is perfectly valid to merge two consecutive > loads, and to merge two consecutive stores. > > Now that means that the sequence > > atomic_store(&x, 1, mo_relaxed); > if (atomic_load(&x, mo_relaxed) == 1) > atomic_store(&y, 3, mo_relaxed); > > can first be optimized to > > atomic_store(&x, 1, mo_relaxed); > if (1 == 1) > atomic_store(&y, 3, mo_relaxed); > > and then you get the end result that you wanted in the first place > (including the ability to re-order the two stores due to the relaxed > ordering, assuming they can be proven to not alias - and please don't > use the idiotic type-based aliasing rules). > > Bringing up your first example is pure and utter confusion. Sorry if it was confusing. But then maybe we need to talk about it more, because it shouldn't be confusing if we agree on what the memory model allows and what not. I had originally picked the example because it was related to the example Paul/Peter brought up. > Don't do > it. Instead, show what are obvious and valid transformations, and then > you can bring up these kinds of combinations as "look, this is > obviously also correct". I have my doubts whether the best way to reason about the memory model is by thinking about specific compiler transformations. YMMV, obviously. The -- kind of vague -- reason is that the allowed transformations will be more complicated to reason about than the allowed output of a concurrent program when understanding the memory model (ie, ordering and interleaving of memory accesses, etc.). However, I can see that when trying to optimize with a hardware memory model in mind, this might look appealing. What the compiler will do is exploiting knowledge about all possible executions. For example, if it knows that x is always 1, it will do the transform. The user would need to consider that case anyway, but he/she can do without thinking about all potentially allowed compiler optimizations. > Now, the reason I want to make it clear that the code example you > point to is a crap example is that because it doesn't talk about the > *real* issue, it also misses a lot of really important details. > > For example, when you say "if the compiler can prove that the > conditional is always true" then YOU ARE TOTALLY WRONG. What are you trying to say? That a compiler can't prove that this is the case in any program? Or that it won't be likely it can? Or what else? > So why do I say you are wrong, after I just gave you an example of how > it happens? Because my example went back to the *real* issue, and > there are actual real semantically meaningful details with doing > things like load merging. > > To give an example, let's rewrite things a bit more to use an extra variable: > > atomic_store(&x, 1, mo_relaxed); > a = atomic_load(&1, mo_relaxed); > if (a == 1) > atomic_store(&y, 3, mo_relaxed); > > which looks exactly the same. I'm confused. Is this a new example? Or is it supposed to be a compiler's rewrite of the following? atomic_store(&x, 1, mo_relaxed); if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed); If that's supposed to be the case this is the same. If it is supposed to be a rewrite of the original example with *just* two stores, this is obviously an incorrect transformation (unless it can be proven that no other thread writes to x). The compiler can't make stuff conditional that wasn't conditional before. That would change the allowed behavior of the abstract machine. > If you now say "if you can prove the > conditional is always true, you can make the store unconditional", YOU > ARE WRONG. > > Why? > > This sequence: > > atomic_store(&x, 1, mo_relaxed); > a = atomic_load(&x, mo_relaxed); > atomic_store(&y, 3, mo_relaxed); > > is actually - and very seriously - buggy. > > Why? Because you have effectively split the atomic_load into two loads > - one for the value of 'a', and one for your 'proof' that the store is > unconditional. I can't follow that, because it isn't clear to me which code sequences are meant to belong together, and which transformations the compiler is supposed to make. If you would clarify that, then I can reply to this part. > Maybe you go "Nobody sane would do that", and you'd be right. But > compilers aren't "sane" (and compiler _writers_ I have some serious > doubts about), and how you describe the problem really does affect the > solution. To me, that sounds like a strong argument for a strong and precise specification, and formal methods. If we have that, stuff can be resolved without having to rely on "sane" people, as long as they can follow logic. Because this removes the problem from "hopefully all of us want the same" (which is sometimes called "sane", from both sides of a disagreement) to "just stick to the specification". > My description ("you can combine two subsequent atomic accesses, if > you are careful about still having the same ordering points") doesn't > have the confusion. I'd much rather take the memory model's specification. One example is that it separates safety/correctness guarantees from liveness/progress guarantees. "the same ordering points" can be understood as not being able to merge subsequent loads to the same memory location. Both progress guarantees and correctness guarantees can affect this, but they have different consequences. For example, if it may be perfectly fine to assume that two subsequent loads to the same memory location may always execute atomically (and thus join them), you don't want to do this to all such loads in a spin-wait loop (or you'll likely prevent forward progress). In the end, we need to define "are careful about" and similar clauses. Eventually, after defining everything, we'll end up with a memory model that is specified in as much detail as C11's is. > It makes it clear that no, you can't speculate > writes, but yes, obviously you can combine certain things (think > "write buffers" and "memory caches"), and that may make you able to > remove the conditional. Which is very different from speculating > writes. It's equally fine to reason about executions and allowed interleavings on the level of the source program (ie, the memory model and abstract machine). For example, combining certain things is essentially assuming the atomic execution of those things; it will be clear from the program, without thinking write buffers and caches, whether that's possible or not in the source program. If it is, the compiler can do it too. The compiler must not loose anything (eg, memory orders) when doing that; if it wants to remove memory orders, it must show that this doesn't result in executions not possible before. Also, it must not prevent executions that ensure forward progress (e.g., a spin-wait loop should better not be atomic wrt. everything else, or you'll block forever). > But my description also makes it clear that the transformation above > was buggy, but that rewriting it as > > a = 1; > atomic_store(&y, 3, mo_relaxed); > atomic_store(&x, a, mo_relaxed); > > is fine (doing the re-ordering here just to make a point). Oh, so you were assuming that a is actually used afterwards? > So I suspect we actually agree, I just think that the "example" that > has been floating around, and the explanation for it, has been > actively bad and misleading. I do hope we agree. I'm still confused why you think the branch needs to be emitted in the first example I brought up, but other than that, we seem to be on the same page. I'm hopeful we can clarify the other points if we keep talking about different examples. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:09 ` Torvald Riegel @ 2014-02-17 22:32 ` Linus Torvalds 2014-02-17 23:17 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-17 22:32 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel <triegel@redhat.com> wrote: > On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote: >> > >> > if (atomic_load(&x, mo_relaxed) == 1) >> > atomic_store(&y, 3, mo_relaxed)); >> >> No, please don't use this idiotic example. It is wrong. > > It won't be useful in practice in a lot of cases, but that doesn't mean > it's wrong. It's clearly not illegal code. It also serves a purpose: a > simple example to reason about a few aspects of the memory model. It's not illegal code, but i you claim that you can make that store unconditional, it's a pointless and wrong example. >> The fact is, if a compiler generates anything but the obvious sequence >> (read/cmp/branch/store - where branch/store might obviously be done >> with some other machine conditional like a predicate), the compiler is >> wrong. > > Why? I've reasoned why (1) to (3) above allow in certain cases (i.e., > the first load always returning 1) for the branch (or other machine > conditional) to not be emitted. So please either poke holes into this > reasoning, or clarify that you don't in fact, contrary to what you wrote > above, agree with (1) to (3). The thing is, the first load DOES NOT RETURN 1. It returns whatever that memory location contains. End of story. Stop claiming it "can return 1".. It *never* returns 1 unless you do the load and *verify* it, or unless the load itself can be made to go away. And with the code sequence given, that just doesn't happen. END OF STORY. So your argument is *shit*. Why do you continue to argue it? I told you how that load can go away, and you agreed. But IT CANNOT GO AWAY any other way. You cannot claim "the compiler knows". The compiler doesn't know. It's that simple. >> So why do I say you are wrong, after I just gave you an example of how >> it happens? Because my example went back to the *real* issue, and >> there are actual real semantically meaningful details with doing >> things like load merging. >> >> To give an example, let's rewrite things a bit more to use an extra variable: >> >> atomic_store(&x, 1, mo_relaxed); >> a = atomic_load(&1, mo_relaxed); >> if (a == 1) >> atomic_store(&y, 3, mo_relaxed); >> >> which looks exactly the same. > > I'm confused. Is this a new example? That is a new example. The important part is that it has left a "trace" for the programmer: because 'a' contains the value, the programmer can now look at the value later and say "oh, we know we did a store iff a was 1" >> This sequence: >> >> atomic_store(&x, 1, mo_relaxed); >> a = atomic_load(&x, mo_relaxed); >> atomic_store(&y, 3, mo_relaxed); >> >> is actually - and very seriously - buggy. >> >> Why? Because you have effectively split the atomic_load into two loads >> - one for the value of 'a', and one for your 'proof' that the store is >> unconditional. > > I can't follow that, because it isn't clear to me which code sequences > are meant to belong together, and which transformations the compiler is > supposed to make. If you would clarify that, then I can reply to this > part. Basically, if the compiler allows the condition of "I wrote 3 to the y, but the programmer sees 'a' has another value than 1 later" then the compiler is one buggy pile of shit. It fundamentally broke the whole concept of atomic accesses. Basically the "atomic" access to 'x' turned into two different accesses: the one that "proved" that x had the value 1 (and caused the value 3 to be written), and the other load that then write that other value into 'a'. It's really not that complicated. And this is why descriptions like this should ABSOLUTELY NOT BE WRITTEN as "if the compiler can prove that 'x' had the value 1, it can remove the branch". Because that IS NOT SUFFICIENT. That was not a valid transformation of the atomic load. The only valid transformation was the one I stated, namely to remove the load entirely and replace it with the value written earlier in the same execution context. Really, why is so hard to understand? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 22:32 ` Linus Torvalds @ 2014-02-17 23:17 ` Torvald Riegel 2014-02-18 0:09 ` Linus Torvalds 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-17 23:17 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 14:32 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 2:09 PM, Torvald Riegel <triegel@redhat.com> wrote: > > On Sat, 2014-02-15 at 11:15 -0800, Linus Torvalds wrote: > >> > > >> > if (atomic_load(&x, mo_relaxed) == 1) > >> > atomic_store(&y, 3, mo_relaxed)); > >> > >> No, please don't use this idiotic example. It is wrong. > > > > It won't be useful in practice in a lot of cases, but that doesn't mean > > it's wrong. It's clearly not illegal code. It also serves a purpose: a > > simple example to reason about a few aspects of the memory model. > > It's not illegal code, but i you claim that you can make that store > unconditional, it's a pointless and wrong example. > > >> The fact is, if a compiler generates anything but the obvious sequence > >> (read/cmp/branch/store - where branch/store might obviously be done > >> with some other machine conditional like a predicate), the compiler is > >> wrong. > > > > Why? I've reasoned why (1) to (3) above allow in certain cases (i.e., > > the first load always returning 1) for the branch (or other machine > > conditional) to not be emitted. So please either poke holes into this > > reasoning, or clarify that you don't in fact, contrary to what you wrote > > above, agree with (1) to (3). > > The thing is, the first load DOES NOT RETURN 1. It returns whatever > that memory location contains. End of story. The memory location is just an abstraction for state, if it's not volatile. > Stop claiming it "can return 1".. It *never* returns 1 unless you do > the load and *verify* it, or unless the load itself can be made to go > away. And with the code sequence given, that just doesn't happen. END > OF STORY. void foo(); { atomic<int> x = 1; if (atomic_load(&x, mo_relaxed) == 1) atomic_store(&y, 3, mo_relaxed)); } This is a counter example to your claim, and yes, the compiler has proof that x is 1. It's deliberately simple, but I can replace this with other more advanced situations. For example, if x comes out of malloc (or, on the kernel side, something else that returns non-aliasing memory) and hasn't provably escaped to other threads yet. I haven't posted this full example, but I've *clearly* said that *if* the compiler can prove that the load would always return 1, it can remove it. And it's simple to see why that's the case: If this holds, then in all allowed executions it would load from a know store, the relaxed_mo gives no further ordering guarantees so we can just take the value, and we're good. > So your argument is *shit*. Why do you continue to argue it? Maybe because it isn't? Maybe you should try to at least trust that my intentions are good, even if distrusting my ability to reason. > I told you how that load can go away, and you agreed. But IT CANNOT GO > AWAY any other way. You cannot claim "the compiler knows". The > compiler doesn't know. It's that simple. Oh yes it can. Because of the same rules that allow you to perform the other transformations. Please try to see the similarities here. You previously said you don't want to mix volatile semantics and atomics. This is something that's being applied in this example. > >> So why do I say you are wrong, after I just gave you an example of how > >> it happens? Because my example went back to the *real* issue, and > >> there are actual real semantically meaningful details with doing > >> things like load merging. > >> > >> To give an example, let's rewrite things a bit more to use an extra variable: > >> > >> atomic_store(&x, 1, mo_relaxed); > >> a = atomic_load(&1, mo_relaxed); > >> if (a == 1) > >> atomic_store(&y, 3, mo_relaxed); > >> > >> which looks exactly the same. > > > > I'm confused. Is this a new example? > > That is a new example. The important part is that it has left a > "trace" for the programmer: because 'a' contains the value, the > programmer can now look at the value later and say "oh, we know we did > a store iff a was 1" > > >> This sequence: > >> > >> atomic_store(&x, 1, mo_relaxed); > >> a = atomic_load(&x, mo_relaxed); > >> atomic_store(&y, 3, mo_relaxed); > >> > >> is actually - and very seriously - buggy. > >> > >> Why? Because you have effectively split the atomic_load into two loads > >> - one for the value of 'a', and one for your 'proof' that the store is > >> unconditional. > > > > I can't follow that, because it isn't clear to me which code sequences > > are meant to belong together, and which transformations the compiler is > > supposed to make. If you would clarify that, then I can reply to this > > part. > > Basically, if the compiler allows the condition of "I wrote 3 to the > y, but the programmer sees 'a' has another value than 1 later" then > the compiler is one buggy pile of shit. It fundamentally broke the > whole concept of atomic accesses. Basically the "atomic" access to 'x' > turned into two different accesses: the one that "proved" that x had > the value 1 (and caused the value 3 to be written), and the other load > that then write that other value into 'a'. > > It's really not that complicated. Yes that's not complicated, but I assumed this to be obvious and wasn't aware that this is contentious. > And this is why descriptions like this should ABSOLUTELY NOT BE > WRITTEN as "if the compiler can prove that 'x' had the value 1, it can > remove the branch". Because that IS NOT SUFFICIENT. That was not a > valid transformation of the atomic load. Now I see where the confusion was. Sorry if I didn't point this out explicitly, but if proving that x had the value 1, the first thing a compiler would naturally do is to replace *the load* by 1, and *afterwards* remove the branch because it sees 1 == 1. Nonetheless, if being picky about it, keeping the load is correct if the proof that x will always have the value 1 is correct (it might prevent some optimizations though; in foo() above, keeping the load would also prevent removing the variable on the stack). In a correct compiler, this will of course lead to the memory location actually existing and having the value 1 in a compiled program. > The only valid transformation was the one I stated, namely to remove > the load entirely and replace it with the value written earlier in the > same execution context. No, your transformation is similar but has a different reasoning behind it. What the compiler (easily) proves in your example is that *this thread* is always allowed to observe it's prior store to x. That's a different assumption than that x will always be of value 1 when this code sequence is executed. Therefore, the results for the compilation are also slightly different. > Really, why is so hard to understand? It's not hard to understand, we've just been talking past each other. (But that's something we both participated in, I'd like to point out.) I think it also shows that reasoning about executions starting with what the compiler and HW can do to the code is more complex than reasoning about allowed executions of the abstract machine. If using the latter, and you would have formulated the proof the compiler does about the executions, we might have been able to see the misunderstanding earlier. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-17 23:17 ` Torvald Riegel @ 2014-02-18 0:09 ` Linus Torvalds 2014-02-18 15:46 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Linus Torvalds @ 2014-02-18 0:09 UTC (permalink / raw) To: Torvald Riegel Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel <triegel@redhat.com> wrote: > On Mon, 2014-02-17 at 14:32 -0800, > >> Stop claiming it "can return 1".. It *never* returns 1 unless you do >> the load and *verify* it, or unless the load itself can be made to go >> away. And with the code sequence given, that just doesn't happen. END >> OF STORY. > > void foo(); > { > atomic<int> x = 1; > if (atomic_load(&x, mo_relaxed) == 1) > atomic_store(&y, 3, mo_relaxed)); > } This is the very example I gave, where the real issue is not that "you prove that load returns 1", you instead say "store followed by a load can be combined". I (in another email I just wrote) tried to show why the "prove something is true" is a very dangerous model. Seriously, it's pure crap. It's broken. If the C standard defines atomics in terms of "provable equivalence", it's broken. Exactly because on a *virtual* machine you can prove things that are not actually true in a *real* machine. I have the example of value speculation changing the memory ordering model of the actual machine. See? Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-18 0:09 ` Linus Torvalds @ 2014-02-18 15:46 ` Torvald Riegel 0 siblings, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-18 15:46 UTC (permalink / raw) To: Linus Torvalds Cc: Paul McKenney, Will Deacon, Peter Zijlstra, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, akpm, mingo, gcc On Mon, 2014-02-17 at 16:09 -0800, Linus Torvalds wrote: > On Mon, Feb 17, 2014 at 3:17 PM, Torvald Riegel <triegel@redhat.com> wrote: > > On Mon, 2014-02-17 at 14:32 -0800, > > > >> Stop claiming it "can return 1".. It *never* returns 1 unless you do > >> the load and *verify* it, or unless the load itself can be made to go > >> away. And with the code sequence given, that just doesn't happen. END > >> OF STORY. > > > > void foo(); > > { > > atomic<int> x = 1; > > if (atomic_load(&x, mo_relaxed) == 1) > > atomic_store(&y, 3, mo_relaxed)); > > } > > This is the very example I gave, where the real issue is not that "you > prove that load returns 1", you instead say "store followed by a load > can be combined". > > I (in another email I just wrote) tried to show why the "prove > something is true" is a very dangerous model. Seriously, it's pure > crap. It's broken. I don't see anything dangerous in the example above with the language semantics as specified: It's a well-defined situation, given the rules of the language. I replied to the other email you wrote with my viewpoint on why the above is useful, how it compares to what you seem to what, and where I think we need to start to bridge the gap. > If the C standard defines atomics in terms of "provable equivalence", > it's broken. Exactly because on a *virtual* machine you can prove > things that are not actually true in a *real* machine. For the control dependencies you have in mind, it's actually the other way around. You expect the real machine's properties in a program whose semantics only give you the virtual machine's properties. Anything you prove on the virtual machine will be true on the real machine (in a correct implementation) -- but you can't expect to have real-machine properties on language that's based on the virtual machine. > I have the > example of value speculation changing the memory ordering model of the > actual machine. This example is not true for the language as specified. It is true for a modified language that you have in mind, but for this one I've just seen pretty rough rules so far. Please see my other reply. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 18:02 ` Paul E. McKenney 2014-02-10 0:27 ` Torvald Riegel @ 2014-02-10 11:48 ` Peter Zijlstra 2014-02-10 11:49 ` Will Deacon 1 sibling, 1 reply; 285+ messages in thread From: Peter Zijlstra @ 2014-02-10 11:48 UTC (permalink / raw) To: Paul E. McKenney Cc: Will Deacon, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > As near as I can tell, compiler writers hate the idea of prohibiting > speculative-store optimizations because it requires them to introduce > both control and data dependency tracking into their compilers. Many of > them seem to hate dependency tracking with a purple passion. At least, > such a hatred would go a long way towards explaining the incomplete > and high-overhead implementations of memory_order_consume, the long > and successful use of idioms based on the memory_order_consume pattern > notwithstanding [*]. ;-) Just tell them that because the hardware provides control dependencies we actually use and rely on them. Not that I expect they care too much what we do, given the current state of things. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 11:48 ` Peter Zijlstra @ 2014-02-10 11:49 ` Will Deacon 2014-02-10 12:05 ` Peter Zijlstra 2014-02-10 15:04 ` Paul E. McKenney 0 siblings, 2 replies; 285+ messages in thread From: Will Deacon @ 2014-02-10 11:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote: > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > As near as I can tell, compiler writers hate the idea of prohibiting > > speculative-store optimizations because it requires them to introduce > > both control and data dependency tracking into their compilers. Many of > > them seem to hate dependency tracking with a purple passion. At least, > > such a hatred would go a long way towards explaining the incomplete > > and high-overhead implementations of memory_order_consume, the long > > and successful use of idioms based on the memory_order_consume pattern > > notwithstanding [*]. ;-) > > Just tell them that because the hardware provides control dependencies > we actually use and rely on them. s/control/address/ ? Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 11:49 ` Will Deacon @ 2014-02-10 12:05 ` Peter Zijlstra 2014-02-10 15:04 ` Paul E. McKenney 1 sibling, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-10 12:05 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote: > On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > speculative-store optimizations because it requires them to introduce > > > both control and data dependency tracking into their compilers. Many of > > > them seem to hate dependency tracking with a purple passion. At least, > > > such a hatred would go a long way towards explaining the incomplete > > > and high-overhead implementations of memory_order_consume, the long > > > and successful use of idioms based on the memory_order_consume pattern > > > notwithstanding [*]. ;-) > > > > Just tell them that because the hardware provides control dependencies > > we actually use and rely on them. > > s/control/address/ ? Nope, control. Since stores cannot be speculated and thus require linear control flow history we can use it to order LOAD -> STORE when the LOAD is required for the control flow decision and the STORE depends on the control flow path. Also see commit 18c03c61444a211237f3d4782353cb38dba795df to Documentation/memory-barriers.txt --- commit c7f2e3cd6c1f4932ccc4135d050eae3f7c7aef63 Author: Peter Zijlstra <peterz@infradead.org> Date: Mon Nov 25 11:49:10 2013 +0100 perf: Optimize ring-buffer write by depending on control dependencies Remove a full barrier from the ring-buffer write path by relying on a control dependency to order a LOAD -> STORE scenario. Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-8alv40z6ikk57jzbaobnxrjl@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org> diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c index e8b168af135b..146a5792b1d2 100644 --- a/kernel/events/ring_buffer.c +++ b/kernel/events/ring_buffer.c @@ -61,19 +61,20 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * kernel user * - * READ ->data_tail READ ->data_head - * smp_mb() (A) smp_rmb() (C) - * WRITE $data READ $data - * smp_wmb() (B) smp_mb() (D) - * STORE ->data_head WRITE ->data_tail + * if (LOAD ->data_tail) { LOAD ->data_head + * (A) smp_rmb() (C) + * STORE $data LOAD $data + * smp_wmb() (B) smp_mb() (D) + * STORE ->data_head STORE ->data_tail + * } * * Where A pairs with D, and B pairs with C. * - * I don't think A needs to be a full barrier because we won't in fact - * write data until we see the store from userspace. So we simply don't - * issue the data WRITE until we observe it. Be conservative for now. + * In our case (A) is a control dependency that separates the load of + * the ->data_tail and the stores of $data. In case ->data_tail + * indicates there is no room in the buffer to store $data we do not. * - * OTOH, D needs to be a full barrier since it separates the data READ + * D needs to be a full barrier since it separates the data READ * from the tail WRITE. * * For B a WMB is sufficient since it separates two WRITEs, and for C @@ -81,7 +82,7 @@ static void perf_output_put_handle(struct perf_output_handle *handle) * * See perf_output_begin(). */ - smp_wmb(); + smp_wmb(); /* B, matches C */ rb->user_page->data_head = head; /* @@ -144,17 +145,26 @@ int perf_output_begin(struct perf_output_handle *handle, if (!rb->overwrite && unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size)) goto fail; + + /* + * The above forms a control dependency barrier separating the + * @tail load above from the data stores below. Since the @tail + * load is required to compute the branch to fail below. + * + * A, matches D; the full memory barrier userspace SHOULD issue + * after reading the data and before storing the new tail + * position. + * + * See perf_output_put_handle(). + */ + head += size; } while (local_cmpxchg(&rb->head, offset, head) != offset); /* - * Separate the userpage->tail read from the data stores below. - * Matches the MB userspace SHOULD issue after reading the data - * and before storing the new tail position. - * - * See perf_output_put_handle(). + * We rely on the implied barrier() by local_cmpxchg() to ensure + * none of the data stores below can be lifted up by the compiler. */ - smp_mb(); if (unlikely(head - local_read(&rb->wakeup) > rb->watermark)) local_add(rb->watermark, &rb->wakeup); ^ permalink raw reply related [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 11:49 ` Will Deacon 2014-02-10 12:05 ` Peter Zijlstra @ 2014-02-10 15:04 ` Paul E. McKenney 2014-02-10 16:22 ` Will Deacon 1 sibling, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-10 15:04 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote: > On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote: > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > speculative-store optimizations because it requires them to introduce > > > both control and data dependency tracking into their compilers. Many of > > > them seem to hate dependency tracking with a purple passion. At least, > > > such a hatred would go a long way towards explaining the incomplete > > > and high-overhead implementations of memory_order_consume, the long > > > and successful use of idioms based on the memory_order_consume pattern > > > notwithstanding [*]. ;-) > > > > Just tell them that because the hardware provides control dependencies > > we actually use and rely on them. > > s/control/address/ ? Both are important, but as Peter's reply noted, it was control dependencies under discussion. Data dependencies (which include the ARM/PowerPC notion of address dependencies) are called out by the standard already, but control dependencies are not. I am not all that satisified by current implementations of data dependencies, admittedly. Should be an interesting discussion. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 15:04 ` Paul E. McKenney @ 2014-02-10 16:22 ` Will Deacon 0 siblings, 0 replies; 285+ messages in thread From: Will Deacon @ 2014-02-10 16:22 UTC (permalink / raw) To: Paul E. McKenney Cc: Peter Zijlstra, Torvald Riegel, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 03:04:43PM +0000, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 11:49:29AM +0000, Will Deacon wrote: > > On Mon, Feb 10, 2014 at 11:48:13AM +0000, Peter Zijlstra wrote: > > > On Fri, Feb 07, 2014 at 10:02:16AM -0800, Paul E. McKenney wrote: > > > > As near as I can tell, compiler writers hate the idea of prohibiting > > > > speculative-store optimizations because it requires them to introduce > > > > both control and data dependency tracking into their compilers. Many of > > > > them seem to hate dependency tracking with a purple passion. At least, > > > > such a hatred would go a long way towards explaining the incomplete > > > > and high-overhead implementations of memory_order_consume, the long > > > > and successful use of idioms based on the memory_order_consume pattern > > > > notwithstanding [*]. ;-) > > > > > > Just tell them that because the hardware provides control dependencies > > > we actually use and rely on them. > > > > s/control/address/ ? > > Both are important, but as Peter's reply noted, it was control > dependencies under discussion. Data dependencies (which include the > ARM/PowerPC notion of address dependencies) are called out by the standard > already, but control dependencies are not. I am not all that satisified > by current implementations of data dependencies, admittedly. Should > be an interesting discussion. ;-) Ok, but since you can't use control dependencies to order LOAD -> LOAD, it's a pretty big ask of the compiler to make use of them for things like consume, where a data dependency will suffice for any combination of accesses. Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 16:50 ` Paul E. McKenney 2014-02-07 16:55 ` Will Deacon @ 2014-02-07 18:44 ` Torvald Riegel 1 sibling, 0 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-07 18:44 UTC (permalink / raw) To: paulmck Cc: Peter Zijlstra, Will Deacon, Ramana Radhakrishnan, David Howells, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, 2014-02-07 at 08:50 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 08:44:05AM +0100, Peter Zijlstra wrote: > > On Thu, Feb 06, 2014 at 08:20:51PM -0800, Paul E. McKenney wrote: > > > Hopefully some discussion of out-of-thin-air values as well. > > > > Yes, absolutely shoot store speculation in the head already. Then drive > > a wooden stake through its hart. > > > > C11/C++11 should not be allowed to claim itself a memory model until that > > is sorted. > > There actually is a proposal being put forward, but it might not make ARM > and Power people happy because it involves adding a compare, a branch, > and an ISB/isync after every relaxed load... Me, I agree with you, > much preferring the no-store-speculation approach. My vague recollection is that everyone agrees that out-of-thin-air values shouldn't be allowed, but that it's surprisingly complex to actually specify this properly. However, the example that Peter posted further down in the thread seems to be unrelated to out-of-thin-air. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 4:20 ` Paul E. McKenney 2014-02-07 7:44 ` Peter Zijlstra @ 2014-02-10 0:06 ` Torvald Riegel 2014-02-10 3:51 ` Paul E. McKenney 1 sibling, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-10 0:06 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > > > at the earliest opportunity. > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > memory_order_consume until the compilers implement it both correctly and > > > > > efficiently. They are not there yet, and there is currently no shortage > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > Do you have any input on > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > language standard's definition of dependencies? > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > > > or > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > of existence. One way to do that on both ARM and Power is to actually > > > emit code for "flag - flag", but there are a number of other ways to > > > make that work. > > > > And that's what would concern me, considering that these requirements > > seem to be able to creep out easily. Also, whereas the other atomics > > just constrain compilers wrt. reordering across atomic accesses or > > changes to the atomic accesses themselves, the dependencies are new > > requirements on pieces of otherwise non-synchronizing code. The latter > > seems far more involved to me. > > Well, the wording of 1.10p9 is pretty explicit on this point. > There are only a few exceptions to the rule that dependencies from > memory_order_consume loads must be tracked. And to your point about > requirements being placed on pieces of otherwise non-synchronizing code, > we already have that with plain old load acquire and store release -- > both of these put ordering constraints that affect the surrounding > non-synchronizing code. I think there's a significant difference. With acquire/release or more general memory orders, it's true that we can't order _across_ the atomic access. However, we can reorder and optimize without additional constraints if we do not reorder. This is not the case with consume memory order, as the (p + flag - flag) example shows. > This issue got a lot of discussion, and the compromise is that > dependencies cannot leak into or out of functions unless the relevant > parameters or return values are annotated with [[carries_dependency]]. > This means that the compiler can see all the places where dependencies > must be tracked. This is described in 7.6.4. I wasn't aware of 7.6.4 (but it isn't referred to as an additional constraint--what it is--in 1.10, so I guess at least that should be fixed). Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic difference, at least if one is assuming that normal optimization of sequential code is the default, and that maintaining things such as (flag-flag) is not; if optimizing away (flag-flag) would require the insertion of fences unless there is the carries_dependency attribute, then this would be bad I think. IMHO, the dependencies construct (carries_dependency, kill_dependency) seem to be backwards to me. They assume that the compiler preserves all those dependencies including (flag-flag) by default, which prohibits meaningful optimizations. Instead, I guess there should be a construct for explicitly exploiting the dependencies (e.g., a preserve_dependencies call, whose argument will not be optimized fully). Exploiting dependencies will be special code and isn't the common case, so it can be require additional annotations. > If a dependency chain > headed by a memory_order_consume load goes into or out of a function > without the aid of the [[carries_dependency]] attribute, the compiler > needs to do something else to enforce ordering, e.g., emit a memory > barrier. I agree that this is a way to see it. But I can't see how this will motivate compiler implementers to not just emit a stronger barrier right away. > From a Linux-kernel viewpoint, this is a bit ugly, as it requires > annotations and use of kill_dependency, but it was the best I could do > at the time. If things go as they usually do, there will be some other > reason why those are needed... Did you consider something along the "preserve_dependencies" call? If so, why did you go for kill_dependency? ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 0:06 ` Torvald Riegel @ 2014-02-10 3:51 ` Paul E. McKenney 2014-02-12 5:13 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-10 3:51 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > > > > at the earliest opportunity. > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > memory_order_consume until the compilers implement it both correctly and > > > > > > efficiently. They are not there yet, and there is currently no shortage > > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > > > Do you have any input on > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > > language standard's definition of dependencies? > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > > > > or > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > of existence. One way to do that on both ARM and Power is to actually > > > > emit code for "flag - flag", but there are a number of other ways to > > > > make that work. > > > > > > And that's what would concern me, considering that these requirements > > > seem to be able to creep out easily. Also, whereas the other atomics > > > just constrain compilers wrt. reordering across atomic accesses or > > > changes to the atomic accesses themselves, the dependencies are new > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > seems far more involved to me. > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > There are only a few exceptions to the rule that dependencies from > > memory_order_consume loads must be tracked. And to your point about > > requirements being placed on pieces of otherwise non-synchronizing code, > > we already have that with plain old load acquire and store release -- > > both of these put ordering constraints that affect the surrounding > > non-synchronizing code. > > I think there's a significant difference. With acquire/release or more > general memory orders, it's true that we can't order _across_ the atomic > access. However, we can reorder and optimize without additional > constraints if we do not reorder. This is not the case with consume > memory order, as the (p + flag - flag) example shows. Agreed, memory_order_consume does introduce additional restrictions. > > This issue got a lot of discussion, and the compromise is that > > dependencies cannot leak into or out of functions unless the relevant > > parameters or return values are annotated with [[carries_dependency]]. > > This means that the compiler can see all the places where dependencies > > must be tracked. This is described in 7.6.4. > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional > constraint--what it is--in 1.10, so I guess at least that should be > fixed). > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic > difference, at least if one is assuming that normal optimization of > sequential code is the default, and that maintaining things such as > (flag-flag) is not; if optimizing away (flag-flag) would require the > insertion of fences unless there is the carries_dependency attribute, > then this would be bad I think. No, the attribute does not make a semantic difference. If a dependency flows into a function without [[carries_dependency]], the implementation is within its right to emit an acquire barrier or similar. > IMHO, the dependencies construct (carries_dependency, kill_dependency) > seem to be backwards to me. They assume that the compiler preserves all > those dependencies including (flag-flag) by default, which prohibits > meaningful optimizations. Instead, I guess there should be a construct > for explicitly exploiting the dependencies (e.g., a > preserve_dependencies call, whose argument will not be optimized fully). > Exploiting dependencies will be special code and isn't the common case, > so it can be require additional annotations. If you are compiling a function that has no [[carries_dependency]] attributes on it arguments and return value, and none on any of the functions that it calls, and contains no memomry_order_consume loads, then you can break dependencies all you like within that function. That said, I am of course open to discussing alternative approaches. Especially those that ease the migration of the existing code in the Linux kernel that rely on dependencies. ;-) > > If a dependency chain > > headed by a memory_order_consume load goes into or out of a function > > without the aid of the [[carries_dependency]] attribute, the compiler > > needs to do something else to enforce ordering, e.g., emit a memory > > barrier. > > I agree that this is a way to see it. But I can't see how this will > motivate compiler implementers to not just emit a stronger barrier right > away. That certainly has been the most common approach. > > From a Linux-kernel viewpoint, this is a bit ugly, as it requires > > annotations and use of kill_dependency, but it was the best I could do > > at the time. If things go as they usually do, there will be some other > > reason why those are needed... > > Did you consider something along the "preserve_dependencies" call? If > so, why did you go for kill_dependency? Could you please give more detail on what a "preserve_dependencies" call would do and where it would be used? Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 3:51 ` Paul E. McKenney @ 2014-02-12 5:13 ` Torvald Riegel 2014-02-12 18:26 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-12 5:13 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > > > > > at the earliest opportunity. > > > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > > memory_order_consume until the compilers implement it both correctly and > > > > > > > efficiently. They are not there yet, and there is currently no shortage > > > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > > > > > Do you have any input on > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > > > language standard's definition of dependencies? > > > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > > > > > or > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > > of existence. One way to do that on both ARM and Power is to actually > > > > > emit code for "flag - flag", but there are a number of other ways to > > > > > make that work. > > > > > > > > And that's what would concern me, considering that these requirements > > > > seem to be able to creep out easily. Also, whereas the other atomics > > > > just constrain compilers wrt. reordering across atomic accesses or > > > > changes to the atomic accesses themselves, the dependencies are new > > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > > seems far more involved to me. > > > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > > There are only a few exceptions to the rule that dependencies from > > > memory_order_consume loads must be tracked. And to your point about > > > requirements being placed on pieces of otherwise non-synchronizing code, > > > we already have that with plain old load acquire and store release -- > > > both of these put ordering constraints that affect the surrounding > > > non-synchronizing code. > > > > I think there's a significant difference. With acquire/release or more > > general memory orders, it's true that we can't order _across_ the atomic > > access. However, we can reorder and optimize without additional > > constraints if we do not reorder. This is not the case with consume > > memory order, as the (p + flag - flag) example shows. > > Agreed, memory_order_consume does introduce additional restrictions. > > > > This issue got a lot of discussion, and the compromise is that > > > dependencies cannot leak into or out of functions unless the relevant > > > parameters or return values are annotated with [[carries_dependency]]. > > > This means that the compiler can see all the places where dependencies > > > must be tracked. This is described in 7.6.4. > > > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional > > constraint--what it is--in 1.10, so I guess at least that should be > > fixed). > > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic > > difference, at least if one is assuming that normal optimization of > > sequential code is the default, and that maintaining things such as > > (flag-flag) is not; if optimizing away (flag-flag) would require the > > insertion of fences unless there is the carries_dependency attribute, > > then this would be bad I think. > > No, the attribute does not make a semantic difference. If a dependency > flows into a function without [[carries_dependency]], the implementation > is within its right to emit an acquire barrier or similar. So you can't just ignore the attribute when generating code -- you would at the very least have to do this consistently across all the compilers compiling parts of your code. Which tells me that it does make a semantic difference. But I know that what should or should not be an attribute is controversial. > > IMHO, the dependencies construct (carries_dependency, kill_dependency) > > seem to be backwards to me. They assume that the compiler preserves all > > those dependencies including (flag-flag) by default, which prohibits > > meaningful optimizations. Instead, I guess there should be a construct > > for explicitly exploiting the dependencies (e.g., a > > preserve_dependencies call, whose argument will not be optimized fully). > > Exploiting dependencies will be special code and isn't the common case, > > so it can be require additional annotations. > > If you are compiling a function that has no [[carries_dependency]] > attributes on it arguments and return value, and none on any of the > functions that it calls, and contains no memomry_order_consume loads, > then you can break dependencies all you like within that function. > > That said, I am of course open to discussing alternative approaches. > Especially those that ease the migration of the existing code in the > Linux kernel that rely on dependencies. ;-) > > > > If a dependency chain > > > headed by a memory_order_consume load goes into or out of a function > > > without the aid of the [[carries_dependency]] attribute, the compiler > > > needs to do something else to enforce ordering, e.g., emit a memory > > > barrier. > > > > I agree that this is a way to see it. But I can't see how this will > > motivate compiler implementers to not just emit a stronger barrier right > > away. > > That certainly has been the most common approach. > > > > From a Linux-kernel viewpoint, this is a bit ugly, as it requires > > > annotations and use of kill_dependency, but it was the best I could do > > > at the time. If things go as they usually do, there will be some other > > > reason why those are needed... > > > > Did you consider something along the "preserve_dependencies" call? If > > so, why did you go for kill_dependency? > > Could you please give more detail on what a "preserve_dependencies" call > would do and where it would be used? Demarcating regions of code (or particular expressions) that require the preservation of source-code-level dependencies (eg, "p + flag - flag"), and which thus constraints optimizations allowed on normal code. What we have right now is a "blacklist" of things that kill dependencies, which still requires compilers to not touch things like "flag - flag". Doing so isn't useful in most code, so it would be better to have a "whitelist" of things in which dependencies are strictly preserved. The list of relevant expressions found in the current RCU uses might be one such list. Another would be to require programmers to explicitly annotate the expressions where those dependencies should be specially preserved, with something like a "preserve_dependencies(p + flag - flag)" call. I agree that this might be harder when dependency tracking is scattered throughout a larger region of code, as you pointed out today. But I'm just looking for a setting that is more likely to see good support by compilers. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-12 5:13 ` Torvald Riegel @ 2014-02-12 18:26 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-12 18:26 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Tue, Feb 11, 2014 at 09:13:34PM -0800, Torvald Riegel wrote: > On Sun, 2014-02-09 at 19:51 -0800, Paul E. McKenney wrote: > > On Mon, Feb 10, 2014 at 01:06:48AM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 20:20 -0800, Paul E. McKenney wrote: > > > > On Fri, Feb 07, 2014 at 12:44:48AM +0100, Torvald Riegel wrote: > > > > > On Thu, 2014-02-06 at 14:11 -0800, Paul E. McKenney wrote: > > > > > > On Thu, Feb 06, 2014 at 10:17:03PM +0100, Torvald Riegel wrote: > > > > > > > On Thu, 2014-02-06 at 11:27 -0800, Paul E. McKenney wrote: > > > > > > > > On Thu, Feb 06, 2014 at 06:59:10PM +0000, Will Deacon wrote: > > > > > > > > > There are also so many ways to blow your head off it's untrue. For example, > > > > > > > > > cmpxchg takes a separate memory model parameter for failure and success, but > > > > > > > > > then there are restrictions on the sets you can use for each. It's not hard > > > > > > > > > to find well-known memory-ordering experts shouting "Just use > > > > > > > > > memory_model_seq_cst for everything, it's too hard otherwise". Then there's > > > > > > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > > > > > > atm and optimises all of the data dependencies away) as well as the definition > > > > > > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > > > > > > at the earliest opportunity. > > > > > > > > > > > > > > > > Trust me, rcu_dereference() is not going to be defined in terms of > > > > > > > > memory_order_consume until the compilers implement it both correctly and > > > > > > > > efficiently. They are not there yet, and there is currently no shortage > > > > > > > > of compiler writers who would prefer to ignore memory_order_consume. > > > > > > > > > > > > > > Do you have any input on > > > > > > > http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448? In particular, the > > > > > > > language standard's definition of dependencies? > > > > > > > > > > > > Let's see... 1.10p9 says that a dependency must be carried unless: > > > > > > > > > > > > — B is an invocation of any specialization of std::kill_dependency (29.3), or > > > > > > — A is the left operand of a built-in logical AND (&&, see 5.14) or logical OR (||, see 5.15) operator, > > > > > > or > > > > > > — A is the left operand of a conditional (?:, see 5.16) operator, or > > > > > > — A is the left operand of the built-in comma (,) operator (5.18); > > > > > > > > > > > > So the use of "flag" before the "?" is ignored. But the "flag - flag" > > > > > > after the "?" will carry a dependency, so the code fragment in 59448 > > > > > > needs to do the ordering rather than just optimizing "flag - flag" out > > > > > > of existence. One way to do that on both ARM and Power is to actually > > > > > > emit code for "flag - flag", but there are a number of other ways to > > > > > > make that work. > > > > > > > > > > And that's what would concern me, considering that these requirements > > > > > seem to be able to creep out easily. Also, whereas the other atomics > > > > > just constrain compilers wrt. reordering across atomic accesses or > > > > > changes to the atomic accesses themselves, the dependencies are new > > > > > requirements on pieces of otherwise non-synchronizing code. The latter > > > > > seems far more involved to me. > > > > > > > > Well, the wording of 1.10p9 is pretty explicit on this point. > > > > There are only a few exceptions to the rule that dependencies from > > > > memory_order_consume loads must be tracked. And to your point about > > > > requirements being placed on pieces of otherwise non-synchronizing code, > > > > we already have that with plain old load acquire and store release -- > > > > both of these put ordering constraints that affect the surrounding > > > > non-synchronizing code. > > > > > > I think there's a significant difference. With acquire/release or more > > > general memory orders, it's true that we can't order _across_ the atomic > > > access. However, we can reorder and optimize without additional > > > constraints if we do not reorder. This is not the case with consume > > > memory order, as the (p + flag - flag) example shows. > > > > Agreed, memory_order_consume does introduce additional restrictions. > > > > > > This issue got a lot of discussion, and the compromise is that > > > > dependencies cannot leak into or out of functions unless the relevant > > > > parameters or return values are annotated with [[carries_dependency]]. > > > > This means that the compiler can see all the places where dependencies > > > > must be tracked. This is described in 7.6.4. > > > > > > I wasn't aware of 7.6.4 (but it isn't referred to as an additional > > > constraint--what it is--in 1.10, so I guess at least that should be > > > fixed). > > > Also, AFAIU, 7.6.4p3 is wrong in that the attribute does make a semantic > > > difference, at least if one is assuming that normal optimization of > > > sequential code is the default, and that maintaining things such as > > > (flag-flag) is not; if optimizing away (flag-flag) would require the > > > insertion of fences unless there is the carries_dependency attribute, > > > then this would be bad I think. > > > > No, the attribute does not make a semantic difference. If a dependency > > flows into a function without [[carries_dependency]], the implementation > > is within its right to emit an acquire barrier or similar. > > So you can't just ignore the attribute when generating code -- you would > at the very least have to do this consistently across all the compilers > compiling parts of your code. Which tells me that it does make a > semantic difference. But I know that what should or should not be an > attribute is controversial. Actually, I believe that you -can- ignore the attribute when generating code. In that case, you do the same thing that you would do the same thing as you would if there was no attribute, namely emit whatever barrier was required to render irrelevant any dependency breaking that might occur in the called function. > > > IMHO, the dependencies construct (carries_dependency, kill_dependency) > > > seem to be backwards to me. They assume that the compiler preserves all > > > those dependencies including (flag-flag) by default, which prohibits > > > meaningful optimizations. Instead, I guess there should be a construct > > > for explicitly exploiting the dependencies (e.g., a > > > preserve_dependencies call, whose argument will not be optimized fully). > > > Exploiting dependencies will be special code and isn't the common case, > > > so it can be require additional annotations. > > > > If you are compiling a function that has no [[carries_dependency]] > > attributes on it arguments and return value, and none on any of the > > functions that it calls, and contains no memomry_order_consume loads, > > then you can break dependencies all you like within that function. > > > > That said, I am of course open to discussing alternative approaches. > > Especially those that ease the migration of the existing code in the > > Linux kernel that rely on dependencies. ;-) > > > > > > If a dependency chain > > > > headed by a memory_order_consume load goes into or out of a function > > > > without the aid of the [[carries_dependency]] attribute, the compiler > > > > needs to do something else to enforce ordering, e.g., emit a memory > > > > barrier. > > > > > > I agree that this is a way to see it. But I can't see how this will > > > motivate compiler implementers to not just emit a stronger barrier right > > > away. > > > > That certainly has been the most common approach. > > > > > > From a Linux-kernel viewpoint, this is a bit ugly, as it requires > > > > annotations and use of kill_dependency, but it was the best I could do > > > > at the time. If things go as they usually do, there will be some other > > > > reason why those are needed... > > > > > > Did you consider something along the "preserve_dependencies" call? If > > > so, why did you go for kill_dependency? > > > > Could you please give more detail on what a "preserve_dependencies" call > > would do and where it would be used? > > Demarcating regions of code (or particular expressions) that require the > preservation of source-code-level dependencies (eg, "p + flag - flag"), > and which thus constraints optimizations allowed on normal code. > > What we have right now is a "blacklist" of things that kill > dependencies, which still requires compilers to not touch things like > "flag - flag". Doing so isn't useful in most code, so it would be > better to have a "whitelist" of things in which dependencies are > strictly preserved. The list of relevant expressions found in the > current RCU uses might be one such list. Another would be to require > programmers to explicitly annotate the expressions where those > dependencies should be specially preserved, with something like a > "preserve_dependencies(p + flag - flag)" call. > > I agree that this might be harder when dependency tracking is scattered > throughout a larger region of code, as you pointed out today. But I'm > just looking for a setting that is more likely to see good support by > compilers. The 3.13 version of the Linux kernel contains almost 1300 instances of the rcu_dereference() family of macros, including wrappers. I have thus far gone through about 300 of them by hand and another 200 via scripts. Here are the preliminary results: Common operations on pointers returned from rcu_dereference(): -> To dereference, can be chained, assuming entire linked structure published with single rcu_assign_pointer(). The & prefix operator is applied to dereferenced pointer, as are many other things, including rcu_dereference() itself. [] To index array referenced by RCU-protected index, and to specify array indexed by something else. & infix To strip low-order bits, but clearly mask must be non-zero. Should include infix "|" as well, but only if mask has some zero bits. + infix To index RCU-protected array. (Should include infix "-" too.) = To assign to a temporary variable. () To invoke an RCU-protected pointer to a function. cast To emulate subtypes. ?: To substitute defaults, however, dependencies need to be carried only through middle and right-hand arguments. For completeness, unary "*" and "&" should also be included, as these are sometimes used in macros that apply casts. One could argue that symmetry would require that "|" be treated the same as infix "&", but excluding the case where the other operand is all one bits, but I don't feel strongly about "|". Please note that I am not aware of any reasonable compiler optimization that would break dependency chains in these cases, at least not in the case where the original memory_order_consume load had volatile semantics, as rcu_dereference() and friends in fact do have. Many dependency chains are short and contained, but there are quite a few large ones, particularly in the networking code. Here is one moderately ornate example dependency chain: o The arp_process() function calls __in_dev_get_rcu(), which returns an RCU-protected pointer. Then arp_process() invokes the following macros and functions: IN_DEV_ROUTE_LOCALNET() -> ipv4_devconf_get() arp_ignore() -- which calls: IN_DEV_ARP_IGNORE() -> ipv4_devconf_get() inet_confirm_addr() -- which calls: dev_net() -- which calls: read_pnet() IN_DEV_ARPFILTER() -> ipv4_devconf_get() IN_DEV_CONF_GET() -> ipv4_devconf_get() arp_fwd_proxy() -- which calls: IN_DEV_PROXY_ARP() -> ipv4_devconf_get() IN_DEV_MEDIUM_ID() -> ipv4_devconf_get() arp_fwd_pvlan() -- which calls IN_DEV_PROXY_ARP_PVLAN(), which eventually maps to ipv4_devconf_get() pneigh_enqueue() This sort of example is one reason why I would not look fondly on any suggestion that required decorating the operators called out above. That said, I would be OK with dropping the non-volatile memory_order_consume load from C11 and C++11 -- I don't see how memory_order_consume is useful unless it includes volatile semantics. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:59 ` Will Deacon 2014-02-06 19:27 ` Paul E. McKenney @ 2014-02-06 21:09 ` Torvald Riegel 2014-02-06 21:55 ` Paul E. McKenney ` (2 more replies) 1 sibling, 3 replies; 285+ messages in thread From: Torvald Riegel @ 2014-02-06 21:09 UTC (permalink / raw) To: Will Deacon Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > On 02/06/14 18:25, David Howells wrote: > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > these. > > > > > > It sounds interesting to me, if we can make it work properly and > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > really think this is a bad idea for the kernel. I'm not going to comment on what's best for the kernel (simply because I don't work on it), but I disagree with several of your statements. > It seems that nobody really > agrees on exactly how the C11 atomics map to real architectural > instructions on anything but the trivial architectures. There's certainly different ways to implement the memory model and those have to be specified elsewhere, but I don't see how this differs much from other things specified in the ABI(s) for each architecture. > For example, should > the following code fire the assert? I don't see how your example (which is about what the language requires or not) relates to the statement about the mapping above? > > extern atomic<int> foo, bar, baz; > > void thread1(void) > { > foo.store(42, memory_order_relaxed); > bar.fetch_add(1, memory_order_seq_cst); > baz.store(42, memory_order_relaxed); > } > > void thread2(void) > { > while (baz.load(memory_order_seq_cst) != 42) { > /* do nothing */ > } > > assert(foo.load(memory_order_seq_cst) == 42); > } > It's a good example. My first gut feeling was that the assertion should never fire, but that was wrong because (as I seem to usually forget) the seq-cst total order is just a constraint but doesn't itself contribute to synchronizes-with -- but this is different for seq-cst fences. > To answer that question, you need to go and look at the definitions of > synchronises-with, happens-before, dependency_ordered_before and a whole > pile of vaguely written waffle to realise that you don't know. Are you familiar with the formalization of the C11/C++11 model by Batty et al.? http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf http://www.cl.cam.ac.uk/~mjb220/n3132.pdf They also have a nice tool that can run condensed examples and show you all allowed (and forbidden) executions (it runs in the browser, so is slow for larger examples), including nice annotated graphs for those: http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ It requires somewhat special syntax, but the following, which should be equivalent to your example above, runs just fine: int main() { atomic_int foo = 0; atomic_int bar = 0; atomic_int baz = 0; {{{ { foo.store(42, memory_order_relaxed); bar.store(1, memory_order_seq_cst); baz.store(42, memory_order_relaxed); } ||| { r1=baz.load(memory_order_seq_cst).readsvalue(42); r2=foo.load(memory_order_seq_cst).readsvalue(0); } }}}; return 0; } That yields 3 consistent executions for me, and likewise if the last readsvalue() is using 42 as argument. If you add a "fence(memory_order_seq_cst);" after the store to foo, the program can't observe != 42 for foo anymore, because the seq-cst fence is adding a synchronizes-with edge via the baz reads-from. I think this is a really neat tool, and very helpful to answer such questions as in your example. > Certainly, > the code that arm64 GCC currently spits out would allow the assertion to fire > on some microarchitectures. > > There are also so many ways to blow your head off it's untrue. For example, > cmpxchg takes a separate memory model parameter for failure and success, but > then there are restrictions on the sets you can use for each. That's in there for the architectures without a single-instruction CAS/cmpxchg, I believe. > It's not hard > to find well-known memory-ordering experts shouting "Just use > memory_model_seq_cst for everything, it's too hard otherwise". Everyone I've heard saying this meant this as advice to people new to synchronization or just dealing infrequently with it. The advice is the simple and safe fallback, and I don't think it's meant as an acknowledgment that the model itself would be too hard. If the language's memory model is supposed to represent weak HW memory models to at least some extent, there's only so much you can do in terms of keeping it simple. If all architectures had x86-like models, the language's model would certainly be simpler... :) > Then there's > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > atm and optimises all of the data dependencies away) AFAIK consume memory order was added to model Power/ARM-specific behavior. I agree that the way the standard specifies how dependencies are to be preserved is kind of vague (as far as I understand it). See GCC PR 59448. > as well as the definition > of "data races", which seem to be used as an excuse to miscompile a program > at the earliest opportunity. No. The purpose of this is to *not disallow* every optimization on non-synchronizing code. Due to the assumption of data-race-free programs, the compiler can assume a sequential code sequence when no atomics are involved (and thus, keep applying optimizations for sequential code). Or is there something particular that you dislike about the specification of data races? > Trying to introduce system concepts (writes to devices, interrupts, > non-coherent agents) into this mess is going to be an uphill battle IMHO. That might very well be true. OTOH, if you whould need to model this uniformly across different architectures (ie, so that there is a intra-kernel-portable abstraction for those system concepts), you might as well try doing this by extending the C11/C++11 model. Maybe that will not be successful or not really a good fit, though, but at least then it's clear why that's the case. > I'd > just rather stick to the semantics we have and the asm volatile barriers. > > That's not to say I don't there's no room for improvement in what we have > in the kernel. Certainly, I'd welcome allowing more relaxed operations on > architectures that support them, but it needs to be something that at least > the different architecture maintainers can understand how to implement > efficiently behind an uncomplicated interface. I don't think that interface is > C11. IMHO, one thing worth considering is that for C/C++, the C11/C++11 is the only memory model that has widespread support. So, even though it's a fairly weak memory model (unless you go for the "only seq-cst" beginners advice) and thus comes with a higher complexity, this model is what likely most people will be familiar with over time. Deviating from the "standard" model can have valid reasons, but it also has a cost in that new contributors are more likely to be familiar with the "standard" model. Note that I won't claim that the C11/C++11 model is perfect -- there are a few rough edges there (e.g., the forward progress guarantees are (or used to be) a little coarse for my taste), and consume vs. dependencies worries me as well. But, IMHO, overall it's the best C/C++ language model we have. Torvald ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 21:09 ` Torvald Riegel @ 2014-02-06 21:55 ` Paul E. McKenney 2014-02-06 22:58 ` Torvald Riegel 2014-02-06 22:13 ` Joseph S. Myers 2014-02-07 12:01 ` Will Deacon 2 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-06 21:55 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > > these. > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > really think this is a bad idea for the kernel. > > I'm not going to comment on what's best for the kernel (simply because I > don't work on it), but I disagree with several of your statements. > > > It seems that nobody really > > agrees on exactly how the C11 atomics map to real architectural > > instructions on anything but the trivial architectures. > > There's certainly different ways to implement the memory model and those > have to be specified elsewhere, but I don't see how this differs much > from other things specified in the ABI(s) for each architecture. > > > For example, should > > the following code fire the assert? > > I don't see how your example (which is about what the language requires > or not) relates to the statement about the mapping above? > > > > > extern atomic<int> foo, bar, baz; > > > > void thread1(void) > > { > > foo.store(42, memory_order_relaxed); > > bar.fetch_add(1, memory_order_seq_cst); > > baz.store(42, memory_order_relaxed); > > } > > > > void thread2(void) > > { > > while (baz.load(memory_order_seq_cst) != 42) { > > /* do nothing */ > > } > > > > assert(foo.load(memory_order_seq_cst) == 42); > > } > > > > It's a good example. My first gut feeling was that the assertion should > never fire, but that was wrong because (as I seem to usually forget) the > seq-cst total order is just a constraint but doesn't itself contribute > to synchronizes-with -- but this is different for seq-cst fences. >From what I can see, Will's point is that mapping the Linux kernel's atomic_add_return() primitive into fetch_add() does not work because atomic_add_return()'s ordering properties require that the assert() never fire. Augmenting the fetch_add() with a seq_cst fence would work on many architectures, but not for all similar examples. The reason is that the C11 seq_cst fence is deliberately weak compared to ARM's dmb or Power's sync. To your point, I believe that it would make the above example work, but there are some IRIW-like examples that would fail according to the standard (though a number of specific implementations would in fact work correctly). > > To answer that question, you need to go and look at the definitions of > > synchronises-with, happens-before, dependency_ordered_before and a whole > > pile of vaguely written waffle to realise that you don't know. > > Are you familiar with the formalization of the C11/C++11 model by Batty > et al.? > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > They also have a nice tool that can run condensed examples and show you > all allowed (and forbidden) executions (it runs in the browser, so is > slow for larger examples), including nice annotated graphs for those: > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > It requires somewhat special syntax, but the following, which should be > equivalent to your example above, runs just fine: > > int main() { > atomic_int foo = 0; > atomic_int bar = 0; > atomic_int baz = 0; > {{{ { > foo.store(42, memory_order_relaxed); > bar.store(1, memory_order_seq_cst); > baz.store(42, memory_order_relaxed); > } > ||| { > r1=baz.load(memory_order_seq_cst).readsvalue(42); > r2=foo.load(memory_order_seq_cst).readsvalue(0); > } > }}}; > return 0; } > > That yields 3 consistent executions for me, and likewise if the last > readsvalue() is using 42 as argument. > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > program can't observe != 42 for foo anymore, because the seq-cst fence > is adding a synchronizes-with edge via the baz reads-from. > > I think this is a really neat tool, and very helpful to answer such > questions as in your example. Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that your substitution of store() for fetch_add() is correct. Then this shows that we cannot substitute fetch_add() for atomic_add_return(). Adding atomic_thread_fence(memory_order_seq_cst) after the bar.store gives me "192 executions; no consistent", so perhaps there is hope for augmenting the fetch_add() with a fence. Except, as noted above, for any number of IRIW-like examples such as the following: int main() { atomic_int x = 0; atomic_int y = 0; {{{ x.store(1, memory_order_release); ||| y.store(1, memory_order_release); ||| { r1=x.load(memory_order_relaxed).readsvalue(1); atomic_thread_fence(memory_order_seq_cst); r2=y.load(memory_order_relaxed).readsvalue(0); } ||| { r3=y.load(memory_order_relaxed).readsvalue(1); atomic_thread_fence(memory_order_seq_cst); r4=x.load(memory_order_relaxed).readsvalue(0); } }}}; return 0; } Adding a seq_cst store to a new variable z between each pair of reads seems to choke cppmem: int main() { atomic_int x = 0; atomic_int y = 0; atomic_int z = 0 {{{ x.store(1, memory_order_release); ||| y.store(1, memory_order_release); ||| { r1=x.load(memory_order_relaxed).readsvalue(1); z.store(1, memory_order_seq_cst); atomic_thread_fence(memory_order_seq_cst); r2=y.load(memory_order_relaxed).readsvalue(0); } ||| { r3=y.load(memory_order_relaxed).readsvalue(1); z.store(1, memory_order_seq_cst); atomic_thread_fence(memory_order_seq_cst); r4=x.load(memory_order_relaxed).readsvalue(0); } }}}; return 0; } Ah, it did eventually finish with "576 executions; 6 consistent, all race free". So this is an example where C11 has a hard time modeling the Linux kernel's atomic_add_return(). Therefore, use of C11 atomics to implement Linux kernel atomic operations requires knowledge of the underlying architecture and the compiler's implementation, as was noted earlier in this thread. > > Certainly, > > the code that arm64 GCC currently spits out would allow the assertion to fire > > on some microarchitectures. > > > > There are also so many ways to blow your head off it's untrue. For example, > > cmpxchg takes a separate memory model parameter for failure and success, but > > then there are restrictions on the sets you can use for each. > > That's in there for the architectures without a single-instruction > CAS/cmpxchg, I believe. Yep. The Linux kernel currently requires the rough equivalent of memory_order_seq_cst for both paths, but there is some chance that the failure-path requirement might be weakened. > > It's not hard > > to find well-known memory-ordering experts shouting "Just use > > memory_model_seq_cst for everything, it's too hard otherwise". > > Everyone I've heard saying this meant this as advice to people new to > synchronization or just dealing infrequently with it. The advice is the > simple and safe fallback, and I don't think it's meant as an > acknowledgment that the model itself would be too hard. If the > language's memory model is supposed to represent weak HW memory models > to at least some extent, there's only so much you can do in terms of > keeping it simple. If all architectures had x86-like models, the > language's model would certainly be simpler... :) That is said a lot, but there was a recent Linux-kernel example that turned out to be quite hard to prove for x86. ;-) > > Then there's > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > atm and optimises all of the data dependencies away) > > AFAIK consume memory order was added to model Power/ARM-specific > behavior. I agree that the way the standard specifies how dependencies > are to be preserved is kind of vague (as far as I understand it). See > GCC PR 59448. This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html That does indeed look to match what Will was calling out as a problem. > > as well as the definition > > of "data races", which seem to be used as an excuse to miscompile a program > > at the earliest opportunity. > > No. The purpose of this is to *not disallow* every optimization on > non-synchronizing code. Due to the assumption of data-race-free > programs, the compiler can assume a sequential code sequence when no > atomics are involved (and thus, keep applying optimizations for > sequential code). > > Or is there something particular that you dislike about the > specification of data races? Cut Will a break, Torvald! ;-) > > Trying to introduce system concepts (writes to devices, interrupts, > > non-coherent agents) into this mess is going to be an uphill battle IMHO. > > That might very well be true. > > OTOH, if you whould need to model this uniformly across different > architectures (ie, so that there is a intra-kernel-portable abstraction > for those system concepts), you might as well try doing this by > extending the C11/C++11 model. Maybe that will not be successful or not > really a good fit, though, but at least then it's clear why that's the > case. I would guess that Linux-kernel use of C11 atomics will be selected or not on an architecture-specific for the foreseeable future. > > I'd > > just rather stick to the semantics we have and the asm volatile barriers. > > > > That's not to say I don't there's no room for improvement in what we have > > in the kernel. Certainly, I'd welcome allowing more relaxed operations on > > architectures that support them, but it needs to be something that at least > > the different architecture maintainers can understand how to implement > > efficiently behind an uncomplicated interface. I don't think that interface is > > C11. > > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is > the only memory model that has widespread support. So, even though it's > a fairly weak memory model (unless you go for the "only seq-cst" > beginners advice) and thus comes with a higher complexity, this model is > what likely most people will be familiar with over time. Deviating from > the "standard" model can have valid reasons, but it also has a cost in > that new contributors are more likely to be familiar with the "standard" > model. > > Note that I won't claim that the C11/C++11 model is perfect -- there are > a few rough edges there (e.g., the forward progress guarantees are (or > used to be) a little coarse for my taste), and consume vs. dependencies > worries me as well. But, IMHO, overall it's the best C/C++ language > model we have. I could be wrong, but I strongly suspect that in the near term, any memory-model migration of the 15M+ LoC Linux-kernel code base will be incremental in nature. Especially if the C/C++ committee insists on strengthening memory_order_relaxed. :-/ Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 21:55 ` Paul E. McKenney @ 2014-02-06 22:58 ` Torvald Riegel 2014-02-07 4:06 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-06 22:58 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > > > these. > > > > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > > really think this is a bad idea for the kernel. > > > > I'm not going to comment on what's best for the kernel (simply because I > > don't work on it), but I disagree with several of your statements. > > > > > It seems that nobody really > > > agrees on exactly how the C11 atomics map to real architectural > > > instructions on anything but the trivial architectures. > > > > There's certainly different ways to implement the memory model and those > > have to be specified elsewhere, but I don't see how this differs much > > from other things specified in the ABI(s) for each architecture. > > > > > For example, should > > > the following code fire the assert? > > > > I don't see how your example (which is about what the language requires > > or not) relates to the statement about the mapping above? > > > > > > > > extern atomic<int> foo, bar, baz; > > > > > > void thread1(void) > > > { > > > foo.store(42, memory_order_relaxed); > > > bar.fetch_add(1, memory_order_seq_cst); > > > baz.store(42, memory_order_relaxed); > > > } > > > > > > void thread2(void) > > > { > > > while (baz.load(memory_order_seq_cst) != 42) { > > > /* do nothing */ > > > } > > > > > > assert(foo.load(memory_order_seq_cst) == 42); > > > } > > > > > > > It's a good example. My first gut feeling was that the assertion should > > never fire, but that was wrong because (as I seem to usually forget) the > > seq-cst total order is just a constraint but doesn't itself contribute > > to synchronizes-with -- but this is different for seq-cst fences. > > From what I can see, Will's point is that mapping the Linux kernel's > atomic_add_return() primitive into fetch_add() does not work because > atomic_add_return()'s ordering properties require that the assert() > never fire. > > Augmenting the fetch_add() with a seq_cst fence would work on many > architectures, but not for all similar examples. The reason is that > the C11 seq_cst fence is deliberately weak compared to ARM's dmb or > Power's sync. To your point, I believe that it would make the above > example work, but there are some IRIW-like examples that would fail > according to the standard (though a number of specific implementations > would in fact work correctly). Thanks for the background. I don't read LKML, and it wasn't obvious from reading just the part of the thread posted to gcc@ that achieving these semantics is the goal. > > > To answer that question, you need to go and look at the definitions of > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > pile of vaguely written waffle to realise that you don't know. > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > et al.? > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > They also have a nice tool that can run condensed examples and show you > > all allowed (and forbidden) executions (it runs in the browser, so is > > slow for larger examples), including nice annotated graphs for those: > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > It requires somewhat special syntax, but the following, which should be > > equivalent to your example above, runs just fine: > > > > int main() { > > atomic_int foo = 0; > > atomic_int bar = 0; > > atomic_int baz = 0; > > {{{ { > > foo.store(42, memory_order_relaxed); > > bar.store(1, memory_order_seq_cst); > > baz.store(42, memory_order_relaxed); > > } > > ||| { > > r1=baz.load(memory_order_seq_cst).readsvalue(42); > > r2=foo.load(memory_order_seq_cst).readsvalue(0); > > } > > }}}; > > return 0; } > > > > That yields 3 consistent executions for me, and likewise if the last > > readsvalue() is using 42 as argument. > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > > program can't observe != 42 for foo anymore, because the seq-cst fence > > is adding a synchronizes-with edge via the baz reads-from. > > > > I think this is a really neat tool, and very helpful to answer such > > questions as in your example. > > Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that > your substitution of store() for fetch_add() is correct. Then this shows > that we cannot substitute fetch_add() for atomic_add_return(). It should be in this example, I believe. The tool also supports CAS: cas_strong_explicit(bar,0,1,memory_order_seq_cst,memory_order_seq_cst); cas_strong(bar,0,1); cas_weak likewise... With that change, I get a few more executions but still the 3 consistent ones for either outcome. > > > It's not hard > > > to find well-known memory-ordering experts shouting "Just use > > > memory_model_seq_cst for everything, it's too hard otherwise". > > > > Everyone I've heard saying this meant this as advice to people new to > > synchronization or just dealing infrequently with it. The advice is the > > simple and safe fallback, and I don't think it's meant as an > > acknowledgment that the model itself would be too hard. If the > > language's memory model is supposed to represent weak HW memory models > > to at least some extent, there's only so much you can do in terms of > > keeping it simple. If all architectures had x86-like models, the > > language's model would certainly be simpler... :) > > That is said a lot, but there was a recent Linux-kernel example that > turned out to be quite hard to prove for x86. ;-) > > > > Then there's > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > atm and optimises all of the data dependencies away) > > > > AFAIK consume memory order was added to model Power/ARM-specific > > behavior. I agree that the way the standard specifies how dependencies > > are to be preserved is kind of vague (as far as I understand it). See > > GCC PR 59448. > > This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html Yes, this bug, and I'd like to get feedback on the implementability of this: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10 > That does indeed look to match what Will was calling out as a problem. > > > > as well as the definition > > > of "data races", which seem to be used as an excuse to miscompile a program > > > at the earliest opportunity. > > > > No. The purpose of this is to *not disallow* every optimization on > > non-synchronizing code. Due to the assumption of data-race-free > > programs, the compiler can assume a sequential code sequence when no > > atomics are involved (and thus, keep applying optimizations for > > sequential code). > > > > Or is there something particular that you dislike about the > > specification of data races? > > Cut Will a break, Torvald! ;-) Sorry if my comment came across too harsh. I've observed a couple of times that people forget about the compiler's role in this, and it might not be obvious, so I wanted to point out just that. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 22:58 ` Torvald Riegel @ 2014-02-07 4:06 ` Paul E. McKenney 2014-02-07 9:13 ` Torvald Riegel 0 siblings, 1 reply; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 4:06 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > > > > On Thu, Feb 06, 2014 at 06:55:01PM +0000, Ramana Radhakrishnan wrote: > > > > > On 02/06/14 18:25, David Howells wrote: > > > > > > > > > > > > Is it worth considering a move towards using C11 atomics and barriers and > > > > > > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > > > > > > these. > > > > > > > > > > > > > > > It sounds interesting to me, if we can make it work properly and > > > > > reliably. + gcc@gcc.gnu.org for others in the GCC community to chip in. > > > > > > > > Given my (albeit limited) experience playing with the C11 spec and GCC, I > > > > really think this is a bad idea for the kernel. > > > > > > I'm not going to comment on what's best for the kernel (simply because I > > > don't work on it), but I disagree with several of your statements. > > > > > > > It seems that nobody really > > > > agrees on exactly how the C11 atomics map to real architectural > > > > instructions on anything but the trivial architectures. > > > > > > There's certainly different ways to implement the memory model and those > > > have to be specified elsewhere, but I don't see how this differs much > > > from other things specified in the ABI(s) for each architecture. > > > > > > > For example, should > > > > the following code fire the assert? > > > > > > I don't see how your example (which is about what the language requires > > > or not) relates to the statement about the mapping above? > > > > > > > > > > > extern atomic<int> foo, bar, baz; > > > > > > > > void thread1(void) > > > > { > > > > foo.store(42, memory_order_relaxed); > > > > bar.fetch_add(1, memory_order_seq_cst); > > > > baz.store(42, memory_order_relaxed); > > > > } > > > > > > > > void thread2(void) > > > > { > > > > while (baz.load(memory_order_seq_cst) != 42) { > > > > /* do nothing */ > > > > } > > > > > > > > assert(foo.load(memory_order_seq_cst) == 42); > > > > } > > > > > > > > > > It's a good example. My first gut feeling was that the assertion should > > > never fire, but that was wrong because (as I seem to usually forget) the > > > seq-cst total order is just a constraint but doesn't itself contribute > > > to synchronizes-with -- but this is different for seq-cst fences. > > > > From what I can see, Will's point is that mapping the Linux kernel's > > atomic_add_return() primitive into fetch_add() does not work because > > atomic_add_return()'s ordering properties require that the assert() > > never fire. > > > > Augmenting the fetch_add() with a seq_cst fence would work on many > > architectures, but not for all similar examples. The reason is that > > the C11 seq_cst fence is deliberately weak compared to ARM's dmb or > > Power's sync. To your point, I believe that it would make the above > > example work, but there are some IRIW-like examples that would fail > > according to the standard (though a number of specific implementations > > would in fact work correctly). > > Thanks for the background. I don't read LKML, and it wasn't obvious > from reading just the part of the thread posted to gcc@ that achieving > these semantics is the goal. Lots of history on both sides of this one, I am afraid! ;-) > > > > To answer that question, you need to go and look at the definitions of > > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > > pile of vaguely written waffle to realise that you don't know. > > > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > > et al.? > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > > They also have a nice tool that can run condensed examples and show you > > > all allowed (and forbidden) executions (it runs in the browser, so is > > > slow for larger examples), including nice annotated graphs for those: > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > > > It requires somewhat special syntax, but the following, which should be > > > equivalent to your example above, runs just fine: > > > > > > int main() { > > > atomic_int foo = 0; > > > atomic_int bar = 0; > > > atomic_int baz = 0; > > > {{{ { > > > foo.store(42, memory_order_relaxed); > > > bar.store(1, memory_order_seq_cst); > > > baz.store(42, memory_order_relaxed); > > > } > > > ||| { > > > r1=baz.load(memory_order_seq_cst).readsvalue(42); > > > r2=foo.load(memory_order_seq_cst).readsvalue(0); > > > } > > > }}}; > > > return 0; } > > > > > > That yields 3 consistent executions for me, and likewise if the last > > > readsvalue() is using 42 as argument. > > > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > > > program can't observe != 42 for foo anymore, because the seq-cst fence > > > is adding a synchronizes-with edge via the baz reads-from. > > > > > > I think this is a really neat tool, and very helpful to answer such > > > questions as in your example. > > > > Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that > > your substitution of store() for fetch_add() is correct. Then this shows > > that we cannot substitute fetch_add() for atomic_add_return(). > > It should be in this example, I believe. You lost me on this one. > The tool also supports CAS: > cas_strong_explicit(bar,0,1,memory_order_seq_cst,memory_order_seq_cst); > cas_strong(bar,0,1); > cas_weak likewise... > > With that change, I get a few more executions but still the 3 consistent > ones for either outcome. Good point, thank you for the tip! > > > > It's not hard > > > > to find well-known memory-ordering experts shouting "Just use > > > > memory_model_seq_cst for everything, it's too hard otherwise". > > > > > > Everyone I've heard saying this meant this as advice to people new to > > > synchronization or just dealing infrequently with it. The advice is the > > > simple and safe fallback, and I don't think it's meant as an > > > acknowledgment that the model itself would be too hard. If the > > > language's memory model is supposed to represent weak HW memory models > > > to at least some extent, there's only so much you can do in terms of > > > keeping it simple. If all architectures had x86-like models, the > > > language's model would certainly be simpler... :) > > > > That is said a lot, but there was a recent Linux-kernel example that > > turned out to be quite hard to prove for x86. ;-) > > > > > > Then there's > > > > the fun of load-consume vs load-acquire (arm64 GCC completely ignores consume > > > > atm and optimises all of the data dependencies away) > > > > > > AFAIK consume memory order was added to model Power/ARM-specific > > > behavior. I agree that the way the standard specifies how dependencies > > > are to be preserved is kind of vague (as far as I understand it). See > > > GCC PR 59448. > > > > This one? http://gcc.gnu.org/ml/gcc-bugs/2013-12/msg01083.html > > Yes, this bug, and I'd like to get feedback on the implementability of > this: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10 Agreed that the load and store need to be atomic. From a Linux-kernel perspective, perhaps some day ACCESS_ONCE() can be a volatile atomic operation. This will require a bit of churn because things like: ACCESS_ONCE(x)++; have no obvious C11 counterpart that I know of, at least not one that can be reasonably implemented within the confines of a C-preprocessor macro. But this can probably be overcome at some point. (And yes, the usage of ACCESS_ONCE() has grown way beyond what I originally had in mind, but such is life.) > > That does indeed look to match what Will was calling out as a problem. > > > > > > as well as the definition > > > > of "data races", which seem to be used as an excuse to miscompile a program > > > > at the earliest opportunity. > > > > > > No. The purpose of this is to *not disallow* every optimization on > > > non-synchronizing code. Due to the assumption of data-race-free > > > programs, the compiler can assume a sequential code sequence when no > > > atomics are involved (and thus, keep applying optimizations for > > > sequential code). > > > > > > Or is there something particular that you dislike about the > > > specification of data races? > > > > Cut Will a break, Torvald! ;-) > > Sorry if my comment came across too harsh. I've observed a couple of > times that people forget about the compiler's role in this, and it might > not be obvious, so I wanted to point out just that. As far as I know, no need to apologize. The problem we are having is that Linux-kernel hackers need to make their code work with whatever is available from the compiler. The compiler is written to a standard that, even with C11, is insufficient for the needs of kernel hackers: MMIO, subtle differences in memory-ordering semantics, memory fences that are too weak to implement smp_mb() in the general case, and so on. So kernel hackers need to use non-standard extensions, inline assembly, and sometimes even compiler implementation details. So it is only natural to expect the occasional bout of frustration on the part of the kernel hacker. For the compiler writer's part, I am sure that having to deal with odd cases that are outside of the standard is also sometimes frustrating. Almost like both the kernel hackers and the compiler writers were living in the real world or something. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 4:06 ` Paul E. McKenney @ 2014-02-07 9:13 ` Torvald Riegel 2014-02-07 16:44 ` Paul E. McKenney 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-07 9:13 UTC (permalink / raw) To: paulmck Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote: > On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote: > > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > > > > > To answer that question, you need to go and look at the definitions of > > > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > > > pile of vaguely written waffle to realise that you don't know. > > > > > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > > > et al.? > > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > > > > They also have a nice tool that can run condensed examples and show you > > > > all allowed (and forbidden) executions (it runs in the browser, so is > > > > slow for larger examples), including nice annotated graphs for those: > > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > > > > > It requires somewhat special syntax, but the following, which should be > > > > equivalent to your example above, runs just fine: > > > > > > > > int main() { > > > > atomic_int foo = 0; > > > > atomic_int bar = 0; > > > > atomic_int baz = 0; > > > > {{{ { > > > > foo.store(42, memory_order_relaxed); > > > > bar.store(1, memory_order_seq_cst); > > > > baz.store(42, memory_order_relaxed); > > > > } > > > > ||| { > > > > r1=baz.load(memory_order_seq_cst).readsvalue(42); > > > > r2=foo.load(memory_order_seq_cst).readsvalue(0); > > > > } > > > > }}}; > > > > return 0; } > > > > > > > > That yields 3 consistent executions for me, and likewise if the last > > > > readsvalue() is using 42 as argument. > > > > > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > > > > program can't observe != 42 for foo anymore, because the seq-cst fence > > > > is adding a synchronizes-with edge via the baz reads-from. > > > > > > > > I think this is a really neat tool, and very helpful to answer such > > > > questions as in your example. > > > > > > Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that > > > your substitution of store() for fetch_add() is correct. Then this shows > > > that we cannot substitute fetch_add() for atomic_add_return(). > > > > It should be in this example, I believe. > > You lost me on this one. I mean that in this example, substituting fetch_add() with store() should not change meaning, given that what the fetch_add reads-from seems irrelevant. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 9:13 ` Torvald Riegel @ 2014-02-07 16:44 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 16:44 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 10:13:40AM +0100, Torvald Riegel wrote: > On Thu, 2014-02-06 at 20:06 -0800, Paul E. McKenney wrote: > > On Thu, Feb 06, 2014 at 11:58:22PM +0100, Torvald Riegel wrote: > > > On Thu, 2014-02-06 at 13:55 -0800, Paul E. McKenney wrote: > > > > On Thu, Feb 06, 2014 at 10:09:25PM +0100, Torvald Riegel wrote: > > > > > On Thu, 2014-02-06 at 18:59 +0000, Will Deacon wrote: > > > > > > To answer that question, you need to go and look at the definitions of > > > > > > synchronises-with, happens-before, dependency_ordered_before and a whole > > > > > > pile of vaguely written waffle to realise that you don't know. > > > > > > > > > > Are you familiar with the formalization of the C11/C++11 model by Batty > > > > > et al.? > > > > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > > > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > > > > > > > They also have a nice tool that can run condensed examples and show you > > > > > all allowed (and forbidden) executions (it runs in the browser, so is > > > > > slow for larger examples), including nice annotated graphs for those: > > > > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > > > > > > > > > It requires somewhat special syntax, but the following, which should be > > > > > equivalent to your example above, runs just fine: > > > > > > > > > > int main() { > > > > > atomic_int foo = 0; > > > > > atomic_int bar = 0; > > > > > atomic_int baz = 0; > > > > > {{{ { > > > > > foo.store(42, memory_order_relaxed); > > > > > bar.store(1, memory_order_seq_cst); > > > > > baz.store(42, memory_order_relaxed); > > > > > } > > > > > ||| { > > > > > r1=baz.load(memory_order_seq_cst).readsvalue(42); > > > > > r2=foo.load(memory_order_seq_cst).readsvalue(0); > > > > > } > > > > > }}}; > > > > > return 0; } > > > > > > > > > > That yields 3 consistent executions for me, and likewise if the last > > > > > readsvalue() is using 42 as argument. > > > > > > > > > > If you add a "fence(memory_order_seq_cst);" after the store to foo, the > > > > > program can't observe != 42 for foo anymore, because the seq-cst fence > > > > > is adding a synchronizes-with edge via the baz reads-from. > > > > > > > > > > I think this is a really neat tool, and very helpful to answer such > > > > > questions as in your example. > > > > > > > > Hmmm... The tool doesn't seem to like fetch_add(). But let's assume that > > > > your substitution of store() for fetch_add() is correct. Then this shows > > > > that we cannot substitute fetch_add() for atomic_add_return(). > > > > > > It should be in this example, I believe. > > > > You lost me on this one. > > I mean that in this example, substituting fetch_add() with store() > should not change meaning, given that what the fetch_add reads-from > seems irrelevant. Got it. Agreed, though your other suggestion of substituting CAS is more convincing. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 21:09 ` Torvald Riegel 2014-02-06 21:55 ` Paul E. McKenney @ 2014-02-06 22:13 ` Joseph S. Myers 2014-02-06 23:25 ` Torvald Riegel 2014-02-07 12:01 ` Will Deacon 2 siblings, 1 reply; 285+ messages in thread From: Joseph S. Myers @ 2014-02-06 22:13 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc On Thu, 6 Feb 2014, Torvald Riegel wrote: > > It seems that nobody really > > agrees on exactly how the C11 atomics map to real architectural > > instructions on anything but the trivial architectures. > > There's certainly different ways to implement the memory model and those > have to be specified elsewhere, but I don't see how this differs much > from other things specified in the ABI(s) for each architecture. It is not clear to me that there is any good consensus understanding of how to specify this in an ABI, or how, given an ABI, to determine whether an implementation is valid. For ABIs not considering atomics / concurrency, it's well understood, for example, that the ABI specifies observable conditions at function call boundaries ("these arguments are in these registers", "the stack pointer has this alignment", "on return from the function, the values of these registers are unchanged from the values they had on entry"). It may sometimes specify things at other times (e.g. presence or absence of a red zone - whether memory beyond the stack pointer may be overwritten on an interrupt). But if it gives a code sequence, it's clear this is just an example rather than a requirement for particular code to be generated - any code sequence suffices if it meets the observable conditions at the points where code generated by one implementation may pass control to code generated by another implementation. When atomics are involved, you no longer have a limited set of well-defined points where control may pass from code generated by one implementation to code generated by another - the code generated by the two may be running concurrently. We know of certain cases <http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are choices of the mapping of atomic operations to particular instructions. But I'm not sure there's much evidence that these are the only ABI issues arising from concurrency - that there aren't any other ways in which an implementation may transform code, consistent with the as-if rule of ISO C, that may run into incompatibilities of different choices. And even if those are the only issues, it's not clear there are well-understood ways to define the mapping from the C11 memory model to the architecture's model, which provide a good way to reason about whether a particular choice of instructions is valid according to the mapping. > Are you familiar with the formalization of the C11/C++11 model by Batty > et al.? > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf These discuss, as well as the model itself, proving the validity of a particular choice of x86 instructions. I imagine that might be a starting point towards an understanding of how to describe the relevant issues in an ABI, and how to determine whether a choice of instructions is consistent with what an ABI says. But I don't get the impression this is yet at the level where people not doing research in the area can routinely read and write such ABIs and work out whether a code sequence is consistent with them. (If an ABI says "use instruction X", then you can't use a more efficient X' added by a new version of the instruction set. But it can't necessarily be as loose as saying "use X and Y, or other instructions that achieve semantics when the other thread is using X or Y", because it might be the case that Y' interoperates with X, X' interoperates with Y, but X' and Y' don't interoperate with each other. I'd envisage something more like mapping not to instructions, but to concepts within the architecture's own memory model - but that requires the architecture's memory model to be as well defined, and probably formalized, as the C11 one.) -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 22:13 ` Joseph S. Myers @ 2014-02-06 23:25 ` Torvald Riegel 2014-02-06 23:33 ` Joseph S. Myers 0 siblings, 1 reply; 285+ messages in thread From: Torvald Riegel @ 2014-02-06 23:25 UTC (permalink / raw) To: Joseph S. Myers Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc On Thu, 2014-02-06 at 22:13 +0000, Joseph S. Myers wrote: > On Thu, 6 Feb 2014, Torvald Riegel wrote: > > > > It seems that nobody really > > > agrees on exactly how the C11 atomics map to real architectural > > > instructions on anything but the trivial architectures. > > > > There's certainly different ways to implement the memory model and those > > have to be specified elsewhere, but I don't see how this differs much > > from other things specified in the ABI(s) for each architecture. > > It is not clear to me that there is any good consensus understanding of > how to specify this in an ABI, or how, given an ABI, to determine whether > an implementation is valid. > > For ABIs not considering atomics / concurrency, it's well understood, for > example, that the ABI specifies observable conditions at function call > boundaries ("these arguments are in these registers", "the stack pointer > has this alignment", "on return from the function, the values of these > registers are unchanged from the values they had on entry"). It may > sometimes specify things at other times (e.g. presence or absence of a red > zone - whether memory beyond the stack pointer may be overwritten on an > interrupt). But if it gives a code sequence, it's clear this is just an > example rather than a requirement for particular code to be generated - > any code sequence suffices if it meets the observable conditions at the > points where code generated by one implementation may pass control to code > generated by another implementation. > > When atomics are involved, you no longer have a limited set of > well-defined points where control may pass from code generated by one > implementation to code generated by another - the code generated by the > two may be running concurrently. Agreed. > We know of certain cases > <http://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html> where there are > choices of the mapping of atomic operations to particular instructions. > But I'm not sure there's much evidence that these are the only ABI issues > arising from concurrency - that there aren't any other ways in which an > implementation may transform code, consistent with the as-if rule of ISO > C, that may run into incompatibilities of different choices. I can't think of other incompatibilities with high likelihood -- provided we ignore consume memory order and the handling of dependencies (see below). But I would doubt there is a high risk of such because (1) the data race definition should hopefully not cause subtle incompatibilities and (2) there is clear "hand-off point" from the compiler to a specific instruction representing an atomic access. For example, if we have a release store, and we agree on the instructions used for that, then compilers will have to ensure happens-before for anything before the release store; for example, as long as stores sequenced-before the release store are performed, it doesn't matter in which order that happens. Subsequently, an acquire-load somewhere else can pick this sequence of events up just by using the agreed-upon acquire-load; like with the stores, it can order subsequent loads in any way it sees fit, including different optimizations. That's obviously not a formal proof, though. But it seems likely to me, at least :) I'm more concerned about consume and dependencies because as far as I understand the standard's requirements, dependencies need to be tracked across function calls. Thus, we might have several compilers involved in that, and we can't just "condense" things to happens-before, but it's instead how and that we keep dependencies intact. Because of that, I'm wondering whether this is actually implementable practically. (See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59448#c10) > And even if > those are the only issues, it's not clear there are well-understood ways > to define the mapping from the C11 memory model to the architecture's > model, which provide a good way to reason about whether a particular > choice of instructions is valid according to the mapping. I think that if we have different options, there needs to be agreement on which to choose across the compilers, at the very least. I don't quite know how this looks like for GCC and LLVM, for example. > > Are you familiar with the formalization of the C11/C++11 model by Batty > > et al.? > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > These discuss, as well as the model itself, proving the validity of a > particular choice of x86 instructions. I imagine that might be a starting > point towards an understanding of how to describe the relevant issues in > an ABI, and how to determine whether a choice of instructions is > consistent with what an ABI says. But I don't get the impression this is > yet at the level where people not doing research in the area can routinely > read and write such ABIs and work out whether a code sequence is > consistent with them. It's certainly complicated stuff (IMHO). On the positive side, it's just a few compilers, the kernel, etc. that have to deal with this, if most programmers use the languages' memory model. > (If an ABI says "use instruction X", then you can't use a more efficient > X' added by a new version of the instruction set. But it can't > necessarily be as loose as saying "use X and Y, or other instructions that > achieve semantics when the other thread is using X or Y", because it might > be the case that Y' interoperates with X, X' interoperates with Y, but X' > and Y' don't interoperate with each other. I'd envisage something more > like mapping not to instructions, but to concepts within the > architecture's own memory model - but that requires the architecture's > memory model to be as well defined, and probably formalized, as the C11 > one.) Yes. The same group of researchers have also worked on formalizing the Power model, and use this as base for a proof of the correctness of the proposed mappings: http://www.cl.cam.ac.uk/~pes20/cppppc/ The formal approach to all this might be a complex task, but it is more confidence-inspiring than making guesses about one standard's prose vs. another specification's prose. ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 23:25 ` Torvald Riegel @ 2014-02-06 23:33 ` Joseph S. Myers 0 siblings, 0 replies; 285+ messages in thread From: Joseph S. Myers @ 2014-02-06 23:33 UTC (permalink / raw) To: Torvald Riegel Cc: Will Deacon, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc On Fri, 7 Feb 2014, Torvald Riegel wrote: > I think that if we have different options, there needs to be agreement > on which to choose across the compilers, at the very least. I don't > quite know how this looks like for GCC and LLVM, for example. I'm not sure we even necessarily get compatibility for the alignment of _Atomic types yet (and no ABI document I've seen discusses that issue). -- Joseph S. Myers joseph@codesourcery.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 21:09 ` Torvald Riegel 2014-02-06 21:55 ` Paul E. McKenney 2014-02-06 22:13 ` Joseph S. Myers @ 2014-02-07 12:01 ` Will Deacon 2014-02-07 16:47 ` Paul E. McKenney 2 siblings, 1 reply; 285+ messages in thread From: Will Deacon @ 2014-02-07 12:01 UTC (permalink / raw) To: Torvald Riegel Cc: Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, paulmck, gcc Hello Torvald, It looks like Paul clarified most of the points I was trying to make (thanks Paul!), so I won't go back over them here. On Thu, Feb 06, 2014 at 09:09:25PM +0000, Torvald Riegel wrote: > Are you familiar with the formalization of the C11/C++11 model by Batty > et al.? > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > They also have a nice tool that can run condensed examples and show you > all allowed (and forbidden) executions (it runs in the browser, so is > slow for larger examples), including nice annotated graphs for those: > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem in the past, but I didn't realise they have a version for C++11 too. Actually, the armmem backend doesn't implement our atomic instructions or the acquire/release accessors, so it's not been as useful as it could be. I should probably try to learn OCaml... > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is > the only memory model that has widespread support. So, even though it's > a fairly weak memory model (unless you go for the "only seq-cst" > beginners advice) and thus comes with a higher complexity, this model is > what likely most people will be familiar with over time. Deviating from > the "standard" model can have valid reasons, but it also has a cost in > that new contributors are more likely to be familiar with the "standard" > model. Indeed, I wasn't trying to write-off the C11 memory model as something we can never use in the kernel. I just don't think the current situation is anywhere close to usable for a project such as Linux. If a greater understanding of the memory model does eventually manifest amongst C/C++ developers (by which I mean, the beginners advice is really treated as such and there is a widespread intuition about ordering guarantees, as opposed to the need to use formal tools), then surely the tools and libraries will stabilise and provide uniform semantics across the 25+ architectures that Linux currently supports. If *that* happens, this discussion is certainly worth having again. Will ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-07 12:01 ` Will Deacon @ 2014-02-07 16:47 ` Paul E. McKenney 0 siblings, 0 replies; 285+ messages in thread From: Paul E. McKenney @ 2014-02-07 16:47 UTC (permalink / raw) To: Will Deacon Cc: Torvald Riegel, Ramana Radhakrishnan, David Howells, Peter Zijlstra, linux-arch, linux-kernel, torvalds, akpm, mingo, gcc On Fri, Feb 07, 2014 at 12:01:25PM +0000, Will Deacon wrote: > Hello Torvald, > > It looks like Paul clarified most of the points I was trying to make > (thanks Paul!), so I won't go back over them here. > > On Thu, Feb 06, 2014 at 09:09:25PM +0000, Torvald Riegel wrote: > > Are you familiar with the formalization of the C11/C++11 model by Batty > > et al.? > > http://www.cl.cam.ac.uk/~mjb220/popl085ap-sewell.pdf > > http://www.cl.cam.ac.uk/~mjb220/n3132.pdf > > > > They also have a nice tool that can run condensed examples and show you > > all allowed (and forbidden) executions (it runs in the browser, so is > > slow for larger examples), including nice annotated graphs for those: > > http://svr-pes20-cppmem.cl.cam.ac.uk/cppmem/ > > Thanks for the link, that's incredibly helpful. I've used ppcmem and armmem > in the past, but I didn't realise they have a version for C++11 too. > Actually, the armmem backend doesn't implement our atomic instructions or > the acquire/release accessors, so it's not been as useful as it could be. > I should probably try to learn OCaml... That would be very cool! Another option would be to recruit a grad student to take on that project for Peter Sewell. He might already have one, for all I know. > > IMHO, one thing worth considering is that for C/C++, the C11/C++11 is > > the only memory model that has widespread support. So, even though it's > > a fairly weak memory model (unless you go for the "only seq-cst" > > beginners advice) and thus comes with a higher complexity, this model is > > what likely most people will be familiar with over time. Deviating from > > the "standard" model can have valid reasons, but it also has a cost in > > that new contributors are more likely to be familiar with the "standard" > > model. > > Indeed, I wasn't trying to write-off the C11 memory model as something we > can never use in the kernel. I just don't think the current situation is > anywhere close to usable for a project such as Linux. If a greater > understanding of the memory model does eventually manifest amongst C/C++ > developers (by which I mean, the beginners advice is really treated as > such and there is a widespread intuition about ordering guarantees, as > opposed to the need to use formal tools), then surely the tools and libraries > will stabilise and provide uniform semantics across the 25+ architectures > that Linux currently supports. If *that* happens, this discussion is certainly > worth having again. And it is likely to be worthwhile even before then on a piecemeal basis, where architecture maintainers pick and choose which primitive is in inline assembly and which the compiler can deal with properly. For example, I bet that atomic_inc() can be implemented just fine by C11 in the very near future. However, atomic_add_return() is another story. Thanx, Paul ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-06 18:25 ` David Howells ` (2 preceding siblings ...) 2014-02-06 18:55 ` Ramana Radhakrishnan @ 2014-02-06 19:21 ` Linus Torvalds 3 siblings, 0 replies; 285+ messages in thread From: Linus Torvalds @ 2014-02-06 19:21 UTC (permalink / raw) To: David Howells Cc: Peter Zijlstra, linux-arch, Linux Kernel Mailing List, Andrew Morton, Ingo Molnar, Will Deacon, Paul McKenney, ramana.radhakrishnan On Thu, Feb 6, 2014 at 10:25 AM, David Howells <dhowells@redhat.com> wrote: > > Is it worth considering a move towards using C11 atomics and barriers and > compiler intrinsics inside the kernel? The compiler _ought_ to be able to do > these. I think that's a bad idea as a generic implementation, but it's quite possibly worth doing on an architecture-by-architecture basis. The thing is, gcc builtins sometimes suck. Sometimes it's because certain architectures want particular gcc versions that don't do a good job, sometimes it's because the architecture doesn't lend itself to doing what we want done and we end up better off with some tweaked thing, sometimes it's because the language feature is just badly designed. I'm not convinced that the C11 people got things right. But in specific cases, maybe the architecture wants to use the builtin. And the gcc version checks are likely to be architecture-specific for a longish while. So some particular architectures might want to use them, but I really doubt it makes sense to make it the default with arch overrides. Linus ^ permalink raw reply [flat|nested] 285+ messages in thread
[parent not found: <52F93B7C.2090304@tilera.com>]
[parent not found: <20140210205719.GY5002@laptop.programming.kicks-ass.net>]
* Re: [RFC][PATCH 0/5] arch: atomic rework [not found] ` <20140210205719.GY5002@laptop.programming.kicks-ass.net> @ 2014-02-10 21:08 ` Chris Metcalf 2014-02-10 21:14 ` Peter Zijlstra 0 siblings, 1 reply; 285+ messages in thread From: Chris Metcalf @ 2014-02-10 21:08 UTC (permalink / raw) To: Peter Zijlstra, Linux Kernel Mailing List (+LKML again) On 2/10/2014 3:57 PM, Peter Zijlstra wrote: > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote: >>> Its been compiled on everything I have a compiler for, however frv and >>> tile are missing because they're special and I was tired. >> So what's the specialness on tile? > Its not doing the atomic work in ASM but uses magic builtins or such. > > I got the list of magic funcs for tilegx, but didn't look into the 32bit > chips. Oh, I see. The <asm/atomic.h> files on tile are already reasonably well-factored. It's possible you could do better, but I think not by too much, other than possibly by using <asm-generic/atomic.h> for some of the common idioms like "subtraction is addition with a negative second argument", etc., which hasn't been done elsewhere. -- Chris Metcalf, Tilera Corp. http://www.tilera.com ^ permalink raw reply [flat|nested] 285+ messages in thread
* Re: [RFC][PATCH 0/5] arch: atomic rework 2014-02-10 21:08 ` Chris Metcalf @ 2014-02-10 21:14 ` Peter Zijlstra 0 siblings, 0 replies; 285+ messages in thread From: Peter Zijlstra @ 2014-02-10 21:14 UTC (permalink / raw) To: Chris Metcalf; +Cc: Linux Kernel Mailing List On Mon, Feb 10, 2014 at 04:08:11PM -0500, Chris Metcalf wrote: > (+LKML again) > > On 2/10/2014 3:57 PM, Peter Zijlstra wrote: > > On Mon, Feb 10, 2014 at 03:50:04PM -0500, Chris Metcalf wrote: > >> On 2/6/2014 8:52 AM, Peter Zijlstra wrote: > >>> Its been compiled on everything I have a compiler for, however frv and > >>> tile are missing because they're special and I was tired. > >> So what's the specialness on tile? > > Its not doing the atomic work in ASM but uses magic builtins or such. > > > > I got the list of magic funcs for tilegx, but didn't look into the 32bit > > chips. > > Oh, I see. The <asm/atomic.h> files on tile are already reasonably well-factored. > > It's possible you could do better, but I think not by too much, other than possibly > by using <asm-generic/atomic.h> for some of the common idioms like "subtraction > is addition with a negative second argument", etc., which hasn't been done elsewhere. The last patch 5/5 adds a few atomic ops; I could of course use cmpxchg() loops for everything, but I found tilegx actually has fetch_or and fetch_and to implement atomic_or() / atomic_and(). It doesn't have fetch_xor() from what I've been told so atomic_xor() will have to become a cmpxchg() loop. ^ permalink raw reply [flat|nested] 285+ messages in thread
end of thread, other threads:[~2014-03-07 19:11 UTC | newest] Thread overview: 285+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2014-02-26 3:06 [RFC][PATCH 0/5] arch: atomic rework George Spelvin 2014-02-26 5:22 ` Paul E. McKenney -- strict thread matches above, loose matches on Subject: below -- 2014-02-18 12:12 Peter Sewell 2014-02-18 12:53 ` Peter Zijlstra 2014-02-18 16:08 ` Peter Sewell 2014-02-18 14:56 ` Paul E. McKenney 2014-02-18 15:16 ` Mark Batty 2014-02-18 17:17 ` Paul E. McKenney 2014-02-18 15:33 ` Peter Sewell 2014-02-18 16:47 ` Paul E. McKenney 2014-02-18 17:38 ` Linus Torvalds 2014-02-18 18:21 ` Peter Sewell 2014-02-18 18:49 ` Linus Torvalds 2014-02-18 19:47 ` Paul E. McKenney 2014-02-18 20:46 ` Torvald Riegel 2014-02-18 20:43 ` Torvald Riegel 2014-02-18 21:29 ` Paul E. McKenney 2014-02-18 23:48 ` Peter Sewell 2014-02-19 9:46 ` Torvald Riegel 2014-02-06 13:48 Peter Zijlstra 2014-02-06 18:25 ` David Howells 2014-02-06 18:30 ` Peter Zijlstra 2014-02-06 18:42 ` Paul E. McKenney 2014-02-06 18:55 ` Ramana Radhakrishnan 2014-02-06 18:59 ` Will Deacon 2014-02-06 19:27 ` Paul E. McKenney 2014-02-06 21:17 ` Torvald Riegel 2014-02-06 22:11 ` Paul E. McKenney 2014-02-06 23:44 ` Torvald Riegel 2014-02-07 4:20 ` Paul E. McKenney 2014-02-07 7:44 ` Peter Zijlstra 2014-02-07 16:50 ` Paul E. McKenney 2014-02-07 16:55 ` Will Deacon 2014-02-07 17:06 ` Peter Zijlstra 2014-02-07 17:13 ` Will Deacon 2014-02-07 17:20 ` Peter Zijlstra 2014-02-07 18:03 ` Paul E. McKenney 2014-02-07 17:46 ` Joseph S. Myers 2014-02-07 18:43 ` Torvald Riegel 2014-02-07 18:02 ` Paul E. McKenney 2014-02-10 0:27 ` Torvald Riegel 2014-02-10 0:56 ` Linus Torvalds 2014-02-10 1:16 ` Torvald Riegel 2014-02-10 1:24 ` Linus Torvalds 2014-02-10 1:46 ` Torvald Riegel 2014-02-10 2:04 ` Linus Torvalds 2014-02-10 3:21 ` Paul E. McKenney 2014-02-10 3:45 ` Paul E. McKenney 2014-02-10 11:46 ` Peter Zijlstra 2014-02-10 19:09 ` Linus Torvalds 2014-02-11 15:59 ` Paul E. McKenney 2014-02-12 6:06 ` Torvald Riegel 2014-02-12 9:19 ` Peter Zijlstra 2014-02-12 17:42 ` Paul E. McKenney 2014-02-12 18:12 ` Peter Zijlstra 2014-02-17 18:18 ` Paul E. McKenney 2014-02-17 20:39 ` Richard Biener 2014-02-17 22:14 ` Paul E. McKenney 2014-02-17 22:27 ` Torvald Riegel 2014-02-14 5:07 ` Torvald Riegel 2014-02-14 9:50 ` Peter Zijlstra 2014-02-14 19:19 ` Torvald Riegel 2014-02-12 17:39 ` Paul E. McKenney 2014-02-12 5:39 ` Torvald Riegel 2014-02-12 18:07 ` Paul E. McKenney 2014-02-12 20:22 ` Linus Torvalds 2014-02-13 0:23 ` Paul E. McKenney 2014-02-13 20:03 ` Torvald Riegel 2014-02-14 2:01 ` Paul E. McKenney 2014-02-14 4:43 ` Torvald Riegel 2014-02-14 17:29 ` Paul E. McKenney 2014-02-14 19:21 ` Torvald Riegel 2014-02-14 19:50 ` Linus Torvalds 2014-02-14 20:02 ` Linus Torvalds 2014-02-15 2:08 ` Paul E. McKenney 2014-02-15 2:44 ` Linus Torvalds 2014-02-15 2:48 ` Linus Torvalds 2014-02-15 6:35 ` Paul E. McKenney 2014-02-15 6:58 ` Paul E. McKenney 2014-02-15 18:07 ` Torvald Riegel 2014-02-17 18:59 ` Joseph S. Myers 2014-02-17 19:19 ` Will Deacon 2014-02-17 19:41 ` Torvald Riegel 2014-02-17 23:12 ` Joseph S. Myers 2014-02-15 17:45 ` Torvald Riegel 2014-02-15 18:49 ` Linus Torvalds 2014-02-17 19:55 ` Torvald Riegel 2014-02-17 20:18 ` Linus Torvalds 2014-02-17 21:21 ` Torvald Riegel 2014-02-17 22:02 ` Linus Torvalds 2014-02-17 22:25 ` Torvald Riegel 2014-02-17 22:47 ` Linus Torvalds 2014-02-17 23:41 ` Torvald Riegel 2014-02-18 0:18 ` Linus Torvalds 2014-02-18 1:26 ` Paul E. McKenney 2014-02-18 15:38 ` Torvald Riegel 2014-02-18 16:55 ` Paul E. McKenney 2014-02-18 19:57 ` Torvald Riegel 2014-02-17 23:10 ` Alec Teal 2014-02-18 0:05 ` Linus Torvalds 2014-02-18 15:31 ` Torvald Riegel 2014-02-18 16:49 ` Linus Torvalds 2014-02-18 17:16 ` Paul E. McKenney 2014-02-18 18:23 ` Peter Sewell 2014-02-18 19:00 ` Linus Torvalds 2014-02-18 19:42 ` Paul E. McKenney 2014-02-18 21:40 ` Torvald Riegel 2014-02-18 21:52 ` Peter Zijlstra 2014-02-19 9:52 ` Torvald Riegel 2014-02-18 22:58 ` Paul E. McKenney 2014-02-19 10:59 ` Torvald Riegel 2014-02-19 15:14 ` Paul E. McKenney 2014-02-19 17:55 ` Torvald Riegel 2014-02-19 22:12 ` Paul E. McKenney 2014-02-18 21:21 ` Torvald Riegel 2014-02-18 21:40 ` Peter Zijlstra 2014-02-18 21:47 ` Torvald Riegel 2014-02-19 15:23 ` David Lang 2014-02-19 18:11 ` Torvald Riegel 2014-02-18 21:47 ` Peter Zijlstra 2014-02-19 11:07 ` Torvald Riegel 2014-02-19 11:42 ` Peter Zijlstra 2014-02-18 22:14 ` Linus Torvalds 2014-02-19 14:40 ` Torvald Riegel 2014-02-19 19:49 ` Linus Torvalds 2014-02-18 3:00 ` Paul E. McKenney 2014-02-18 3:24 ` Linus Torvalds 2014-02-18 3:42 ` Linus Torvalds 2014-02-18 5:22 ` Paul E. McKenney 2014-02-18 16:17 ` Torvald Riegel 2014-02-18 17:44 ` Linus Torvalds 2014-02-18 19:40 ` Paul E. McKenney 2014-02-18 19:47 ` Torvald Riegel 2014-02-20 0:53 ` Linus Torvalds 2014-02-20 4:01 ` Paul E. McKenney 2014-02-20 4:43 ` Linus Torvalds 2014-02-20 8:30 ` Paul E. McKenney 2014-02-20 9:20 ` Paul E. McKenney 2014-02-20 17:01 ` Linus Torvalds 2014-02-20 18:11 ` Paul E. McKenney 2014-02-20 18:32 ` Linus Torvalds 2014-02-20 18:53 ` Torvald Riegel 2014-02-20 19:09 ` Linus Torvalds 2014-02-22 18:53 ` Torvald Riegel 2014-02-22 21:53 ` Linus Torvalds 2014-02-23 0:39 ` Paul E. McKenney 2014-02-23 3:50 ` Linus Torvalds 2014-02-23 6:34 ` Paul E. McKenney 2014-02-23 19:31 ` Linus Torvalds 2014-02-24 1:16 ` Paul E. McKenney 2014-02-24 1:35 ` Linus Torvalds 2014-02-24 4:59 ` Paul E. McKenney 2014-02-24 5:25 ` Linus Torvalds 2014-02-24 15:57 ` Linus Torvalds 2014-02-24 16:27 ` Richard Biener 2014-02-24 16:37 ` Linus Torvalds 2014-02-24 16:40 ` Linus Torvalds 2014-02-24 16:55 ` Michael Matz 2014-02-24 17:28 ` Paul E. McKenney 2014-02-24 17:57 ` Paul E. McKenney 2014-02-26 17:39 ` Torvald Riegel 2014-02-24 17:38 ` Linus Torvalds 2014-02-24 18:12 ` Paul E. McKenney 2014-02-26 17:34 ` Torvald Riegel 2014-02-24 17:21 ` Paul E. McKenney 2014-02-24 18:14 ` Linus Torvalds 2014-02-24 18:53 ` Paul E. McKenney 2014-02-24 19:54 ` Linus Torvalds 2014-02-24 22:37 ` Paul E. McKenney 2014-02-24 23:35 ` Linus Torvalds 2014-02-25 6:00 ` Paul E. McKenney 2014-02-26 1:47 ` Linus Torvalds 2014-02-26 5:12 ` Paul E. McKenney 2014-02-25 6:05 ` Linus Torvalds 2014-02-26 0:15 ` Paul E. McKenney 2014-02-26 3:32 ` Jeff Law 2014-02-26 5:23 ` Paul E. McKenney 2014-02-27 15:37 ` Torvald Riegel 2014-02-27 17:01 ` Linus Torvalds 2014-02-27 19:06 ` Paul E. McKenney 2014-02-27 19:47 ` Linus Torvalds 2014-02-27 20:53 ` Paul E. McKenney 2014-03-01 0:50 ` Paul E. McKenney 2014-03-01 10:06 ` Peter Sewell 2014-03-01 14:03 ` Paul E. McKenney 2014-03-02 10:05 ` Peter Sewell 2014-03-02 23:20 ` Paul E. McKenney 2014-03-02 23:44 ` Peter Sewell 2014-03-03 4:25 ` Paul E. McKenney 2014-03-03 20:44 ` Torvald Riegel 2014-03-04 22:11 ` Peter Sewell 2014-03-05 17:15 ` Torvald Riegel 2014-03-05 18:37 ` Peter Sewell 2014-03-03 18:55 ` Torvald Riegel 2014-03-03 19:20 ` Paul E. McKenney 2014-03-03 20:46 ` Torvald Riegel 2014-03-04 19:00 ` Paul E. McKenney 2014-03-04 21:35 ` Paul E. McKenney 2014-03-05 16:54 ` Torvald Riegel 2014-03-05 18:15 ` Paul E. McKenney 2014-03-07 18:33 ` Torvald Riegel 2014-03-07 19:11 ` Paul E. McKenney 2014-03-05 16:26 ` Torvald Riegel 2014-03-05 18:01 ` Paul E. McKenney 2014-03-07 17:45 ` Torvald Riegel 2014-03-07 19:02 ` Paul E. McKenney 2014-03-03 18:59 ` Torvald Riegel 2014-03-03 15:36 ` Torvald Riegel 2014-02-27 17:50 ` Paul E. McKenney 2014-02-27 19:22 ` Paul E. McKenney 2014-02-28 1:02 ` Paul E. McKenney 2014-03-03 19:01 ` Torvald Riegel 2014-02-20 18:56 ` Paul E. McKenney 2014-02-20 19:45 ` Linus Torvalds 2014-02-20 22:10 ` Paul E. McKenney 2014-02-20 22:52 ` Linus Torvalds 2014-02-21 18:35 ` Michael Matz 2014-02-21 19:13 ` Paul E. McKenney 2014-02-21 22:10 ` Joseph S. Myers 2014-02-21 22:37 ` Paul E. McKenney 2014-02-26 13:09 ` Torvald Riegel 2014-02-26 18:43 ` Joseph S. Myers 2014-02-27 0:52 ` Torvald Riegel 2014-02-24 13:55 ` Michael Matz 2014-02-24 17:40 ` Paul E. McKenney 2014-02-26 13:04 ` Torvald Riegel 2014-02-26 18:27 ` Paul E. McKenney 2014-02-20 18:44 ` Torvald Riegel 2014-02-20 18:56 ` Paul E. McKenney 2014-02-20 18:23 ` Torvald Riegel [not found] ` <CAHWkzRQZ8+gOGMFNyTKjFNzpUv6d_J1G9KL0x_iCa=YCgvEojQ@mail.gmail.com> 2014-02-21 19:16 ` Linus Torvalds 2014-02-21 19:41 ` Linus Torvalds 2014-02-21 19:48 ` Peter Sewell [not found] ` <CAHWkzRSO82jU-9dtTEjHaW2FeLcEqdZXxp5Q8cmVTTT9uhZQYw@mail.gmail.com> 2014-02-21 20:22 ` Linus Torvalds [not found] ` <CAHWkzRRxqhH+DnuQHu9bM4ywGBen3oqtT8W4Xqt1CFAHy2WQRg@mail.gmail.com> 2014-02-21 19:24 ` Paul E. McKenney 2014-02-20 17:54 ` Torvald Riegel 2014-02-20 18:11 ` Paul E. McKenney 2014-02-20 17:49 ` Torvald Riegel 2014-02-20 18:25 ` Linus Torvalds 2014-02-20 19:02 ` Linus Torvalds 2014-02-20 19:06 ` Linus Torvalds 2014-02-20 17:26 ` Torvald Riegel 2014-02-20 18:18 ` Paul E. McKenney 2014-02-22 18:30 ` Torvald Riegel 2014-02-22 20:17 ` Paul E. McKenney 2014-02-20 17:14 ` Torvald Riegel 2014-02-20 17:34 ` Linus Torvalds 2014-02-20 18:12 ` Torvald Riegel 2014-02-20 18:26 ` Paul E. McKenney 2014-02-18 5:01 ` Paul E. McKenney 2014-02-18 15:56 ` Torvald Riegel 2014-02-18 16:51 ` Paul E. McKenney 2014-02-17 20:23 ` Paul E. McKenney 2014-02-17 21:05 ` Torvald Riegel 2014-02-15 17:30 ` Torvald Riegel 2014-02-15 19:15 ` Linus Torvalds 2014-02-17 22:09 ` Torvald Riegel 2014-02-17 22:32 ` Linus Torvalds 2014-02-17 23:17 ` Torvald Riegel 2014-02-18 0:09 ` Linus Torvalds 2014-02-18 15:46 ` Torvald Riegel 2014-02-10 11:48 ` Peter Zijlstra 2014-02-10 11:49 ` Will Deacon 2014-02-10 12:05 ` Peter Zijlstra 2014-02-10 15:04 ` Paul E. McKenney 2014-02-10 16:22 ` Will Deacon 2014-02-07 18:44 ` Torvald Riegel 2014-02-10 0:06 ` Torvald Riegel 2014-02-10 3:51 ` Paul E. McKenney 2014-02-12 5:13 ` Torvald Riegel 2014-02-12 18:26 ` Paul E. McKenney 2014-02-06 21:09 ` Torvald Riegel 2014-02-06 21:55 ` Paul E. McKenney 2014-02-06 22:58 ` Torvald Riegel 2014-02-07 4:06 ` Paul E. McKenney 2014-02-07 9:13 ` Torvald Riegel 2014-02-07 16:44 ` Paul E. McKenney 2014-02-06 22:13 ` Joseph S. Myers 2014-02-06 23:25 ` Torvald Riegel 2014-02-06 23:33 ` Joseph S. Myers 2014-02-07 12:01 ` Will Deacon 2014-02-07 16:47 ` Paul E. McKenney 2014-02-06 19:21 ` Linus Torvalds [not found] ` <52F93B7C.2090304@tilera.com> [not found] ` <20140210205719.GY5002@laptop.programming.kicks-ass.net> 2014-02-10 21:08 ` Chris Metcalf 2014-02-10 21:14 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).