From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756608AbaCESh3 (ORCPT ); Wed, 5 Mar 2014 13:37:29 -0500 Received: from mail-ig0-f174.google.com ([209.85.213.174]:45016 "EHLO mail-ig0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753882AbaCEShX (ORCPT ); Wed, 5 Mar 2014 13:37:23 -0500 MIME-Version: 1.0 Reply-To: Peter.Sewell@cl.cam.ac.uk In-Reply-To: <1394039718.28840.15041.camel@triegel.csb> References: <20140224185341.GU8264@linux.vnet.ibm.com> <1393515453.28840.9961.camel@triegel.csb> <20140227190611.GU8264@linux.vnet.ibm.com> <20140227205312.GX8264@linux.vnet.ibm.com> <20140301005047.GA14777@linux.vnet.ibm.com> <20140301140351.GO11910@linux.vnet.ibm.com> <1393879459.28840.11943.camel@triegel.csb> <1394039718.28840.15041.camel@triegel.csb> Date: Wed, 5 Mar 2014 18:37:22 +0000 X-Google-Sender-Auth: 9sibvMYqOiJgHuzy4cH6_yxIFJc Message-ID: Subject: Re: [RFC][PATCH 0/5] arch: atomic rework From: Peter Sewell To: Torvald Riegel Cc: Paul McKenney , Linus Torvalds , Will Deacon , Peter Zijlstra , Ramana Radhakrishnan , David Howells , "linux-arch@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "akpm@linux-foundation.org" , "mingo@kernel.org" , "gcc@gcc.gnu.org" Content-Type: text/plain; charset=ISO-8859-1 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 5 March 2014 17:15, Torvald Riegel wrote: > On Tue, 2014-03-04 at 22:11 +0000, Peter Sewell wrote: >> On 3 March 2014 20:44, Torvald Riegel wrote: >> > On Sun, 2014-03-02 at 04:05 -0600, Peter Sewell wrote: >> >> On 1 March 2014 08:03, Paul E. McKenney wrote: >> >> > On Sat, Mar 01, 2014 at 04:06:34AM -0600, Peter Sewell wrote: >> >> >> Hi Paul, >> >> >> >> >> >> On 28 February 2014 18:50, Paul E. McKenney wrote: >> >> >> > On Thu, Feb 27, 2014 at 12:53:12PM -0800, Paul E. McKenney wrote: >> >> >> >> On Thu, Feb 27, 2014 at 11:47:08AM -0800, Linus Torvalds wrote: >> >> >> >> > On Thu, Feb 27, 2014 at 11:06 AM, Paul E. McKenney >> >> >> >> > wrote: >> >> >> >> > > >> >> >> >> > > 3. The comparison was against another RCU-protected pointer, >> >> >> >> > > where that other pointer was properly fetched using one >> >> >> >> > > of the RCU primitives. Here it doesn't matter which pointer >> >> >> >> > > you use. At least as long as the rcu_assign_pointer() for >> >> >> >> > > that other pointer happened after the last update to the >> >> >> >> > > pointed-to structure. >> >> >> >> > > >> >> >> >> > > I am a bit nervous about #3. Any thoughts on it? >> >> >> >> > >> >> >> >> > I think that it might be worth pointing out as an example, and saying >> >> >> >> > that code like >> >> >> >> > >> >> >> >> > p = atomic_read(consume); >> >> >> >> > X; >> >> >> >> > q = atomic_read(consume); >> >> >> >> > Y; >> >> >> >> > if (p == q) >> >> >> >> > data = p->val; >> >> >> >> > >> >> >> >> > then the access of "p->val" is constrained to be data-dependent on >> >> >> >> > *either* p or q, but you can't really tell which, since the compiler >> >> >> >> > can decide that the values are interchangeable. >> >> >> >> > >> >> >> >> > I cannot for the life of me come up with a situation where this would >> >> >> >> > matter, though. If "X" contains a fence, then that fence will be a >> >> >> >> > stronger ordering than anything the consume through "p" would >> >> >> >> > guarantee anyway. And if "X" does *not* contain a fence, then the >> >> >> >> > atomic reads of p and q are unordered *anyway*, so then whether the >> >> >> >> > ordering to the access through "p" is through p or q is kind of >> >> >> >> > irrelevant. No? >> >> >> >> >> >> >> >> I can make a contrived litmus test for it, but you are right, the only >> >> >> >> time you can see it happen is when X has no barriers, in which case >> >> >> >> you don't have any ordering anyway -- both the compiler and the CPU can >> >> >> >> reorder the loads into p and q, and the read from p->val can, as you say, >> >> >> >> come from either pointer. >> >> >> >> >> >> >> >> For whatever it is worth, hear is the litmus test: >> >> >> >> >> >> >> >> T1: p = kmalloc(...); >> >> >> >> if (p == NULL) >> >> >> >> deal_with_it(); >> >> >> >> p->a = 42; /* Each field in its own cache line. */ >> >> >> >> p->b = 43; >> >> >> >> p->c = 44; >> >> >> >> atomic_store_explicit(&gp1, p, memory_order_release); >> >> >> >> p->b = 143; >> >> >> >> p->c = 144; >> >> >> >> atomic_store_explicit(&gp2, p, memory_order_release); >> >> >> >> >> >> >> >> T2: p = atomic_load_explicit(&gp2, memory_order_consume); >> >> >> >> r1 = p->b; /* Guaranteed to get 143. */ >> >> >> >> q = atomic_load_explicit(&gp1, memory_order_consume); >> >> >> >> if (p == q) { >> >> >> >> /* The compiler decides that q->c is same as p->c. */ >> >> >> >> r2 = p->c; /* Could get 44 on weakly order system. */ >> >> >> >> } >> >> >> >> >> >> >> >> The loads from gp1 and gp2 are, as you say, unordered, so you get what >> >> >> >> you get. >> >> >> >> >> >> >> >> And publishing a structure via one RCU-protected pointer, updating it, >> >> >> >> then publishing it via another pointer seems to me to be asking for >> >> >> >> trouble anyway. If you really want to do something like that and still >> >> >> >> see consistency across all the fields in the structure, please put a lock >> >> >> >> in the structure and use it to guard updates and accesses to those fields. >> >> >> > >> >> >> > And here is a patch documenting the restrictions for the current Linux >> >> >> > kernel. The rules change a bit due to rcu_dereference() acting a bit >> >> >> > differently than atomic_load_explicit(&p, memory_order_consume). >> >> >> > >> >> >> > Thoughts? >> >> >> >> >> >> That might serve as informal documentation for linux kernel >> >> >> programmers about the bounds on the optimisations that you expect >> >> >> compilers to do for common-case RCU code - and I guess that's what you >> >> >> intend it to be for. But I don't see how one can make it precise >> >> >> enough to serve as a language definition, so that compiler people >> >> >> could confidently say "yes, we respect that", which I guess is what >> >> >> you really need. As a useful criterion, we should aim for something >> >> >> precise enough that in a verified-compiler context you can >> >> >> mathematically prove that the compiler will satisfy it (even though >> >> >> that won't happen anytime soon for GCC), and that analysis tool >> >> >> authors can actually know what they're working with. All this stuff >> >> >> about "you should avoid cancellation", and "avoid masking with just a >> >> >> small number of bits" is just too vague. >> >> > >> >> > Understood, and yes, this is intended to document current compiler >> >> > behavior for the Linux kernel community. It would not make sense to show >> >> > it to the C11 or C++11 communities, except perhaps as an informational >> >> > piece on current practice. >> >> > >> >> >> The basic problem is that the compiler may be doing sophisticated >> >> >> reasoning with a bunch of non-local knowledge that it's deduced from >> >> >> the code, neither of which are well-understood, and here we have to >> >> >> identify some envelope, expressive enough for RCU idioms, in which >> >> >> that reasoning doesn't allow data/address dependencies to be removed >> >> >> (and hence the hardware guarantee about them will be maintained at the >> >> >> source level). >> >> >> >> >> >> The C11 syntactic notion of dependency, whatever its faults, was at >> >> >> least precise, could be reasoned about locally (just looking at the >> >> >> syntactic code in question), and did do that. The fact that current >> >> >> compilers do optimisations that remove dependencies and will likely >> >> >> have many bugs at present is besides the point - this was surely >> >> >> intended as a *new* constraint on what they are allowed to do. The >> >> >> interesting question is really whether the compiler writers think that >> >> >> they *could* implement it in a reasonable way - I'd like to hear >> >> >> Torvald and his colleagues' opinion on that. >> >> >> >> >> >> What you're doing above seems to be basically a very cut-down version >> >> >> of that, but with a fuzzy boundary. If you want it to be precise, >> >> >> maybe it needs to be much simpler (which might force you into ruling >> >> >> out some current code idioms). >> >> > >> >> > I hope that Torvald Riegel's proposal (https://lkml.org/lkml/2014/2/27/806) >> >> > can be developed to serve this purpose. >> >> >> >> (I missed that mail when it first came past, sorry) >> >> >> >> That's also going to be tricky, I'm afraid. The key condition there is: >> >> >> >> "* at the time of execution of E, L [PS: I assume that L is a >> >> typo and should be E] >> > >> > No, L was intended. (But see below.) >> >> then I misunderstood... >> >> >> can possibly have returned at >> >> least two different values under the assumption that L itself >> >> could have returned any value allowed by L's type." >> >> >> >> First, the evaluation of E might be nondeterministic - e.g., for an >> >> artificial example, if it's just a nondeterministic value obtained >> >> from the result of a race on SC atomics. The above doesn't >> >> distinguish between that (which doesn't have a real dependency on L) >> >> and that XOR'd with L (which does). And it does so in the wrong >> >> direction: it'll say there the former has a dependency on L. >> > >> > I'm not quite sure I understand the examples you want to point out >> > (could you add brief code snippets, perhaps?) -- but the informal >> > definition I proposed also says that E must have used L in some way to >> > make it's computation. That's pretty vague, and the way I phrased the >> > requirement is probably not optimal. >> >> ...so this example isn't so relevant, but it's maybe interesting >> anyway. In free-wheeling psuedocode: >> >> x = read_consume(L) // a memory_order_consume read of a 1-bit value from L >> y = nondet() // some code that >> nondeterministically produces another 1-bit value, eg by spawning two >> threads that each do an SC-atomic write to some other location, >> returning 0 or 1 depending on which wins >> >> then evaluate either just >> >> z=y >> >> or >> >> z = x XOR y >> >> In my misinterpretation of what you wrote, your definition would say >> there's a dependency from the load of L to the evalution of y, even >> though there isn't. > > The evaluations "z==y" or "z=y" wouldn't depend on L, assuming nondet() > doesn't depend on L (e.g., by using x). They don't use x's value. (And > I don't have a precise definition for this, unfortunately, but I hope > the intent is clear.) > > "x XOR y" would depend on L, because this does take x into account, and > x's value remains relevant irrespective of which value y has. y would > be non-vdp, and the compiler could have specialized the code into two > branches for both 1 and 0 values of y; but it would still need x to > compute z in the last evaluation. > >> >> > So let me expand a bit on the background first. What I tried to capture >> > with the rules is that an evaluation (ie, E), really uses the value >> > returned by L, and not just a, for example, constant value that can be >> > inferred from whatever was executed between L and E. This (is intended >> > to) prevent value prediction and such by programmers. The compiler in >> > turn knows what a real program is allowed to do that still wants to rely >> > on the ordering by the consume. Basing all of this on the values seemed >> > to be helpful because basing it on *purely* syntax didn't seem >> > implementable; for the latter, we'd need to disallow cases such as x-x >> > or require compilers to include artificial dependencies if a program >> > indeed does value speculation. IOW, it didn't seem possible to draw a >> > clear distinction between a data dependency defined purely based on >> > syntax and control dependencies. >> > >> > Another way to perhaps phrase the requirement might be to construct it >> > inductively, so that "E uses the value returned by L" becomes clearer. >> > However, I currently don't really know how to do that without any holes. >> > >> > For example, let's say that an atomic mo_consume load has a value >> > dependency to itself; the dependency has an associated set of values, >> > which is equal to all values allowed by the type (but see also below). >> > Then, an evaluation A that uses B as operand has a value dependency on B >> > if the associated set of values can still have more than one element. >> > Whether that's the case depends on the operation, and the other >> > operands, including their values at the time of execution. >> > However, it's not just results of evaluations that effectively constrain >> > the set of values. Conditional execution does as well, because the >> > condition being true might establish a constraint on L, which might >> > remove any dependency when L (or a result of a computation involving L) >> > is used. So I'm not sure the inductive approach would work. >> > >> > You said you thought I might have wanted to say that: >> > "E really-depends on L if there exist two different values that (just >> > according to typing) might be read for L that give rise to two >> > different values for E". >> > >> > Which should follow from the above, or at least should not conflict with >> > it. My "definition" tried to capture that the program must not >> > establish so many constraints between E and L that the value of L is >> > clear in the sense of having exactly one value. If we dereference such >> > an E whose execution in fact constrains L to one value, there is no real >> > dependency anymore. >> > However, by itself, this doesn't cover that E must use L in it's >> > computation -- which your formulation does, I think. >> >> unfortunately not - that's what the example above shows. > > Now I think I understand. > >> > Another way might be to try to define which constraints established by >> > an execution (based on the executed code's semantics) actually remove >> > value dependencies on L. >> > >> > Do you have any suggestions for how to define this (ie, true value >> > dependencies) in a better way? >> >> not at the moment. I need to think some more about your vdps, though. >> As I understand it, you're agreeing with the general intent of the >> original C11 design that some new mechanism to tell the compiler not >> to optimise in certain places in required, but differing in the way >> you identify those? > > Yes, we still need such a mechanism because we need to prevent the > compiler from doing value prediction or something similar that uses > control dependencies to avoid data dependencies. For example, it could > use a large switch statement to add code specialized for each possible > value returned from an mo_consume load. Thus, at least for the > standard, we need some mechanism that prevents certain compiler > optimizations. > > Where it differs to C11 is that we're not trying to use just such a > mechanism built on syntax to try to define when we actually have a real > value dependency. > >> >> >> Second, it involves reasoning about counterfactual executions. That >> >> doesn't necessarily make it wrong, per se, but probably makes it hard >> >> to work with. For example, suppose that in all the actual >> >> whole-program executions, a runtime occurrence of L only ever returns >> >> one particular value (perhaps because of some simple #define'd >> >> configuration) >> > >> > Right, or just because it's a deterministic program. >> > >> > This is a problem for the definition of value dependencies, AFAICT, >> > because where do you put the line for what the compiler is allowed to >> > speculate about? When does the program indeed "reveal" the value of a >> > load, and when does it not? Is a compiler to do out-of-band tracking >> > for certain memory locations, and thus predict the value? >> > >> > Adding the assumption that L, at least conceptually, might return any >> > value seemed to be a simple way to remove that uncertainty regarding >> > what the compiler might know about. >> >> ...but it brings in a lot of baggage. > > Which baggage do you mean? In terms of verification, or definitions in > the standard, or effects on compilers? At least the first two. I'm not in a position to comment on the last, but I'd hate to try to do compiler verification based on a semantics that involves substantial quantification over counterfactual hypothetical executions. >> >> , and that the code used in the evaluation of E depends >> >> on some invariant which is related to that configuration. >> > >> > A related issue that I've been thinking about is whether things like >> > divide-by-zero (ie, operations that give rise to undefined behavior) >> > should be considered as constraints on the values or not. I guess they >> > should, but then this seems closer what you mention, invariants >> > established by the whole program. >> > >> >> The >> >> hypothetical execution used above in which a different value is used >> >> is one in the code is being run in a situation with broken invariants. >> >> Then there will be technical difficulties in using the definition: >> >> I don't see how one would persuade oneself that a compiler always >> >> satisfies it, because these hypothetical executions are far removed >> >> from what it's actually working on. >> > >> > I think the above is easy to implement for a compiler. At the >> > mo_consume load, you simply forget any information you have about this >> > value / memory location; for operations on value_dep_preserving types, >> >> are all the subexpressions likewise forced to be of vdp types? > > I'd like to avoid that if possible. Things like "vdp-pointer + int" > should Just Work in the sense of being still vdp-typed without requiring > an explicit cast for the int operand. Operators ||, &&, ?: might > justify more fine-grained rules, but I guess it's better to just make > more stuff vdp if that's easier. vdp is not a sufficient condition for > there being a value-dependency anyway, and the few optimizations it > prevents on expressions containing some vdp-typed operands probably > don't matter much. Also, as I mentioned in the reply to Paul, I believe > that if the compiler can prove that there's no value-dependency, it can > ignore the vdp-annotation (IOW, as-if is still possible). >