* rcu_read_lock lost its compiler barrier [not found] ` <20150921204327.GH4029@linux.vnet.ibm.com> @ 2019-06-02 5:56 ` Herbert Xu 2019-06-02 20:54 ` Linus Torvalds 2019-06-03 0:06 ` Paul E. McKenney 0 siblings, 2 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-02 5:56 UTC (permalink / raw) To: Paul E. McKenney Cc: Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, netdev, David S. Miller, Linus Torvalds Digging up an old email because I was not aware of this previously but Paul pointed me to it during another discussion. On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote: > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote: > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > index d63bb77..6c3cece 100644 > > > --- a/include/linux/rcupdate.h > > > +++ b/include/linux/rcupdate.h > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void); > > > > > > static inline void __rcu_read_lock(void) > > > { > > > - preempt_disable(); > > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) > > > + preempt_disable(); > > > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right? > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock(). > > > > So perhaps we can get rid of the IS_ENABLED() check? > > Actually, barrier() is not intended to be implied by rcu_read_lock(). > In a non-preemptible RCU implementation, it doesn't help anything > to have the compiler flush its temporaries upon rcu_read_lock() > and rcu_read_unlock(). This is seriously broken. RCU has been around for years and is used throughout the kernel while the compiler barrier existed. You can't then go and decide to remove the compiler barrier! To do that you'd need to audit every single use of rcu_read_lock in the kernel to ensure that they're not depending on the compiler barrier. This is also contrary to the definition of almost every other *_lock primitive in the kernel where the compiler barrier is included. So please revert this patch. Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-02 5:56 ` rcu_read_lock lost its compiler barrier Herbert Xu @ 2019-06-02 20:54 ` Linus Torvalds 2019-06-03 2:46 ` Herbert Xu 2019-06-03 0:06 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-02 20:54 UTC (permalink / raw) To: Herbert Xu Cc: Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > You can't then go and decide to remove the compiler barrier! To do > that you'd need to audit every single use of rcu_read_lock in the > kernel to ensure that they're not depending on the compiler barrier. What's the possible case where it would matter when there is no preemption? Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-02 20:54 ` Linus Torvalds @ 2019-06-03 2:46 ` Herbert Xu 2019-06-03 3:47 ` Paul E. McKenney 2019-06-06 8:38 ` Andrea Parri 0 siblings, 2 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-03 2:46 UTC (permalink / raw) To: Linus Torvalds Cc: Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote: > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > > > You can't then go and decide to remove the compiler barrier! To do > > that you'd need to audit every single use of rcu_read_lock in the > > kernel to ensure that they're not depending on the compiler barrier. > > What's the possible case where it would matter when there is no preemption? The case we were discussing is from net/ipv4/inet_fragment.c from the net-next tree: void fqdir_exit(struct fqdir *fqdir) { ... fqdir->dead = true; /* call_rcu is supposed to provide memory barrier semantics, * separating the setting of fqdir->dead with the destruction * work. This implicit barrier is paired with inet_frag_kill(). */ INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); queue_rcu_work(system_wq, &fqdir->destroy_rwork); } and void inet_frag_kill(struct inet_frag_queue *fq) { ... rcu_read_lock(); /* The RCU read lock provides a memory barrier * guaranteeing that if fqdir->dead is false then * the hash table destruction will not start until * after we unlock. Paired with inet_frags_exit_net(). */ if (!fqdir->dead) { rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, fqdir->f->rhash_params); ... } ... rcu_read_unlock(); ... } I simplified this to Initial values: a = 0 b = 0 CPU1 CPU2 ---- ---- a = 1 rcu_read_lock synchronize_rcu if (a == 0) b = 2 b = 1 rcu_read_unlock On exit we want this to be true: b == 2 Now what Paul was telling me is that unless every memory operation is done with READ_ONCE/WRITE_ONCE then his memory model shows that the exit constraint won't hold. IOW, we need CPU1 CPU2 ---- ---- WRITE_ONCE(a, 1) rcu_read_lock synchronize_rcu if (READ_ONCE(a) == 0) WRITE_ONCE(b, 2) WRITE_ONCE(b, 1) rcu_read_unlock Now I think this bullshit because if we really needed these compiler barriers then we surely would need real memory barriers to go with them. In fact, the sole purpose of the RCU mechanism is to provide those memory barriers. Quoting from Documentation/RCU/Design/Requirements/Requirements.html: <li> Each CPU that has an RCU read-side critical section that begins before <tt>synchronize_rcu()</tt> starts is guaranteed to execute a full memory barrier between the time that the RCU read-side critical section ends and the time that <tt>synchronize_rcu()</tt> returns. Without this guarantee, a pre-existing RCU read-side critical section might hold a reference to the newly removed <tt>struct foo</tt> after the <tt>kfree()</tt> on line 14 of <tt>remove_gp_synchronous()</tt>. <li> Each CPU that has an RCU read-side critical section that ends after <tt>synchronize_rcu()</tt> returns is guaranteed to execute a full memory barrier between the time that <tt>synchronize_rcu()</tt> begins and the time that the RCU read-side critical section begins. Without this guarantee, a later RCU read-side critical section running after the <tt>kfree()</tt> on line 14 of <tt>remove_gp_synchronous()</tt> might later run <tt>do_something_gp()</tt> and find the newly deleted <tt>struct foo</tt>. My review of the RCU code shows that these memory barriers are indeed present (at least when we're not in tiny mode where all this discussion would be moot anyway). For example, in call_rcu we eventually get down to rcu_segcblist_enqueue which has an smp_mb. On the reader side (correct me if I'm wrong Paul) the memory barrier is implicitly coming from the scheduler. My point is that within our kernel whenever we have a CPU memory barrier we always have a compiler barrier too. Therefore my code example above does not need any extra compiler barriers such as the ones provided by READ_ONCE/WRITE_ONCE. I think perhaps Paul was perhaps thinking that I'm expecting rcu_read_lock/rcu_read_unlock themselves to provide the memory or compiler barriers. That would indeed be wrong but this is not what I need. All I need is the RCU semantics as documented for there to be memory and compiler barriers around the whole grace period. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 2:46 ` Herbert Xu @ 2019-06-03 3:47 ` Paul E. McKenney 2019-06-03 4:01 ` Herbert Xu 2019-06-03 5:26 ` Herbert Xu 2019-06-06 8:38 ` Andrea Parri 1 sibling, 2 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 3:47 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 01:54:12PM -0700, Linus Torvalds wrote: > > On Sat, Jun 1, 2019 at 10:56 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > > > > > You can't then go and decide to remove the compiler barrier! To do > > > that you'd need to audit every single use of rcu_read_lock in the > > > kernel to ensure that they're not depending on the compiler barrier. > > > > What's the possible case where it would matter when there is no preemption? > > The case we were discussing is from net/ipv4/inet_fragment.c from > the net-next tree: > > void fqdir_exit(struct fqdir *fqdir) > { > ... > fqdir->dead = true; > > /* call_rcu is supposed to provide memory barrier semantics, > * separating the setting of fqdir->dead with the destruction > * work. This implicit barrier is paired with inet_frag_kill(). > */ > > INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); > queue_rcu_work(system_wq, &fqdir->destroy_rwork); > } > > and > > void inet_frag_kill(struct inet_frag_queue *fq) > { > ... > rcu_read_lock(); > /* The RCU read lock provides a memory barrier > * guaranteeing that if fqdir->dead is false then > * the hash table destruction will not start until > * after we unlock. Paired with inet_frags_exit_net(). > */ > if (!fqdir->dead) { > rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, > fqdir->f->rhash_params); > ... > } > ... > rcu_read_unlock(); > ... > } > > I simplified this to > > Initial values: > > a = 0 > b = 0 > > CPU1 CPU2 > ---- ---- > a = 1 rcu_read_lock > synchronize_rcu if (a == 0) > b = 2 b = 1 > rcu_read_unlock > > On exit we want this to be true: > b == 2 > > Now what Paul was telling me is that unless every memory operation > is done with READ_ONCE/WRITE_ONCE then his memory model shows that > the exit constraint won't hold. But please note that the plain-variable portion of the memory model is very new and likely still has a bug or two. In fact, see below. > IOW, we need > > CPU1 CPU2 > ---- ---- > WRITE_ONCE(a, 1) rcu_read_lock > synchronize_rcu if (READ_ONCE(a) == 0) > WRITE_ONCE(b, 2) WRITE_ONCE(b, 1) > rcu_read_unlock > > Now I think this bullshit because if we really needed these compiler > barriers then we surely would need real memory barriers to go with > them. On the one hand, you have no code before your rcu_read_lock() and also 1no code after your rcu_read_unlock(). So in this particular example, adding compiler barriers to these guys won't help you. On the other hand, on CPU 1's write to "b", I agree with you and disagree with the model, though perhaps my partners in LKMM crime will show me the error of my ways on this point. On CPU 2's write to "b", I can see the memory model's point, but getting there requires some gymnastics on the part of both the compiler and the CPU. The WRITE_ONCE() and READ_ONCE() for "a" is the normal requirement for variables that are concurrently loaded and stored. Please note that garden-variety uses of RCU have similar requirements, namely the rcu_assign_pointer() on the one side and the rcu_dereference() on the other. Your use case allows rcu_assign_pointer() to be weakened to WRITE_ONCE() and rcu_dereference() to be weakened to READ_ONCE() (not that this last is all that much of a weakening these days). > In fact, the sole purpose of the RCU mechanism is to provide those > memory barriers. Quoting from > Documentation/RCU/Design/Requirements/Requirements.html: > > <li> Each CPU that has an RCU read-side critical section that > begins before <tt>synchronize_rcu()</tt> starts is > guaranteed to execute a full memory barrier between the time > that the RCU read-side critical section ends and the time that > <tt>synchronize_rcu()</tt> returns. > Without this guarantee, a pre-existing RCU read-side critical section > might hold a reference to the newly removed <tt>struct foo</tt> > after the <tt>kfree()</tt> on line 14 of > <tt>remove_gp_synchronous()</tt>. > <li> Each CPU that has an RCU read-side critical section that ends > after <tt>synchronize_rcu()</tt> returns is guaranteed > to execute a full memory barrier between the time that > <tt>synchronize_rcu()</tt> begins and the time that the RCU > read-side critical section begins. > Without this guarantee, a later RCU read-side critical section > running after the <tt>kfree()</tt> on line 14 of > <tt>remove_gp_synchronous()</tt> might > later run <tt>do_something_gp()</tt> and find the > newly deleted <tt>struct foo</tt>. Ahem. 1. These guarantees are of full memory barriers, -not- compiler barriers. 2. These rules don't say exactly where these full memory barriers go. SRCU is at one extreme, placing those full barriers in srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU at the other, placing these barriers entirely within the callback queueing/invocation, grace-period computation, and the scheduler. Preemptible Tree RCU is in the middle, with rcu_read_unlock() sometimes including a full memory barrier, but other times with the full memory barrier being confined as it is with !PREEMPT Tree RCU. So let's place those memory barriers in your example, interleaving: CPU2: rcu_read_lock(); CPU1: WRITE_ONCE(a, 1) | CPU2: if (READ_ONCE(a) == 0) CPU2: WRITE_ONCE(b, 1) CPU2: rcu_read_unlock /* Could put a full memory barrier here, but it wouldn't help. */ CPU1: synchronize_rcu CPU1: b = 2; Here the accesses to "a" are concurrent, and there are legal placements for the required full memory barrier that don't change that. In fact, I cannot see how any memory-barrier placement can help here. So the WRITE_ONCE() and READ_ONCE() should be used for "a". And again, in garden-variety RCU use cases, these would be rcu_assign_pointer() and rcu_dereference(), respectively. So again, I advise using READ_ONCE() and WRITE_ONCE() for the accesses to "a", for documentation purposes, even ignoring the future proofing. Now let's do the (admittedly quite crazy, at least here in 2019) profiling-directed transformation of the "b = 1" on a weakly ordered system: CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu CPU1: b = 2; CPU2: rcu_read_lock(); CPU2: if (READ_ONCE(a) == 0) CPU2: if (b != 1) CPU2: b = 1; CPU2: rcu_read_unlock Interleaving and inserting full memory barriers as per the rules above: CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu /* Could put a full memory barrier here, but it wouldn't help. */ CPU2: rcu_read_lock(); CPU1: b = 2; CPU2: if (READ_ONCE(a) == 0) CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ CPU2: b = 1; CPU2: rcu_read_unlock In fact, CPU2's load from b might be moved up to race with CPU1's store, which (I believe) is why the model complains in this case. I still cannot imagine why CPU1's "b = 2" needs to be WRITE_ONCE(), perhaps due to a failure of imagination on my part. > My review of the RCU code shows that these memory barriers are > indeed present (at least when we're not in tiny mode where all > this discussion would be moot anyway). For example, in call_rcu > we eventually get down to rcu_segcblist_enqueue which has an smp_mb. > On the reader side (correct me if I'm wrong Paul) the memory > barrier is implicitly coming from the scheduler. > > My point is that within our kernel whenever we have a CPU memory > barrier we always have a compiler barrier too. Therefore my code > example above does not need any extra compiler barriers such as > the ones provided by READ_ONCE/WRITE_ONCE. For the garden-variety RCU use cases, this is true. Instead, they require things that are as strong or stronger than READ_ONCE/WRITE_ONCE. For example: CPU 0: p = kzalloc(sizeof(*p), GFP_KERNEL); BUG_ON(!p); p->a = 42; r1 = p->b; rcu_assign_pointer(gp, p); CPU 1: rcu_read_lock() q = rcu_dereference(gp); r2 = p->a; p->b = 202; rcu_read_unlock() In this case, the plain C-language loads and stores cannot possibly execute concurrently, and the model agrees with this. Again, we didn't use WRITE_ONCE() and READ_ONCE() -- we instead used rcu_assign_pointer() and rcu_dereference(). Similar examples involving (say) list_del_rcu() and synchronize_rcu() are also safe for plain C-language loads and stores. Plain C-language accesses by readers to the item being deleted cannot race with plain C-language accesses after the synchronize_rcu() returns. But here we are using list_del_rcu() in the update-side code along with the synchronize_rcu() and the readers are again using rcu_dereference(). > I think perhaps Paul was perhaps thinking that I'm expecting > rcu_read_lock/rcu_read_unlock themselves to provide the memory > or compiler barriers. That would indeed be wrong but this is > not what I need. All I need is the RCU semantics as documented > for there to be memory and compiler barriers around the whole > grace period. Whew!!! That was -exactly- what I was thinking, and I am glad that you are not advocating rcu_read_lock() and rcu_read_unlock() supplying those barriers. ;-) And again, it is early days for plain accesses for the Linux-kernel memory model. I am confident that READ_ONCE() and WRITE_ONCE for "a" is a very good thing to do, but less confident for "b", most especially for CPU1's store to "b". Either way, your example is an excellent test case for the plain C-language access capability of the Linux-kernel memory model, so thank you for that! Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 3:47 ` Paul E. McKenney @ 2019-06-03 4:01 ` Herbert Xu 2019-06-03 4:17 ` Herbert Xu 2019-06-03 7:23 ` Paul E. McKenney 2019-06-03 5:26 ` Herbert Xu 1 sibling, 2 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-03 4:01 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > CPU2: if (b != 1) > CPU2: b = 1; Stop right there. The kernel is full of code that assumes that assignment to an int/long is atomic. If your compiler breaks this assumption that we can kiss the kernel good-bye. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 4:01 ` Herbert Xu @ 2019-06-03 4:17 ` Herbert Xu 2019-06-03 7:23 ` Paul E. McKenney 1 sibling, 0 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-03 4:17 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > CPU2: if (b != 1) > > CPU2: b = 1; > > Stop right there. The kernel is full of code that assumes that > assignment to an int/long is atomic. If your compiler breaks this > assumption that we can kiss the kernel good-bye. The slippery slope apparently started here: : commit ea435467500612636f8f4fb639ff6e76b2496e4b : Author: Matthew Wilcox <matthew@wil.cx> : Date: Tue Jan 6 14:40:39 2009 -0800 : : atomic_t: unify all arch definitions : : diff --git a/arch/x86/include/asm/atomic_32.h b/arch/x86/include/asm/atomic_32.h : index ad5b9f6ecddf..85b46fba4229 100644 : --- a/arch/x86/include/asm/atomic_32.h : +++ b/arch/x86/include/asm/atomic_32.h : ... : @@ -10,15 +11,6 @@ : * resource counting etc.. : */ : : -/* : - * Make sure gcc doesn't try to be clever and move things around : - * on us. We need to use _exactly_ the address the user gave us, : - * not some alias that contains the same information. : - */ : -typedef struct { : - int counter; : -} atomic_t; : : diff --git a/include/linux/types.h b/include/linux/types.h : index 121f349cb7ec..3b864f2d9560 100644 : --- a/include/linux/types.h : +++ b/include/linux/types.h : @@ -195,6 +195,16 @@ typedef u32 phys_addr_t; : : typedef phys_addr_t resource_size_t; : : +typedef struct { : + volatile int counter; : +} atomic_t; : + Before evolving into the READ_ONCE/WRITE_ONCE that we have now. Linus, are we now really supporting a compiler where an assignment (or a read) from an int/long/pointer can be non-atomic without the volatile marker? Because if that's the case then we have a lot of code to audit. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 4:01 ` Herbert Xu 2019-06-03 4:17 ` Herbert Xu @ 2019-06-03 7:23 ` Paul E. McKenney 2019-06-03 8:42 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 7:23 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > CPU2: if (b != 1) > > CPU2: b = 1; > > Stop right there. The kernel is full of code that assumes that > assignment to an int/long is atomic. If your compiler breaks this > assumption that we can kiss the kernel good-bye. Here you go: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 TL;DR: On x86, of you are doing a plain store of a 32-bit constant that has bits set only in the lower few bits of each 16-bit half of that constant, the compiler is plenty happy to use a pair of 16-bit store-immediate instructions to carry out that store. This is also known as "store tearing". The two bugs were filed (and after some back and forth, fixed) because someone forgot to exclude C11 atomics and volatile accesses from this store tearing. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 7:23 ` Paul E. McKenney @ 2019-06-03 8:42 ` Paul E. McKenney 2019-06-03 15:26 ` David Laight 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 8:42 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 12:23:39AM -0700, Paul E. McKenney wrote: > On Mon, Jun 03, 2019 at 12:01:14PM +0800, Herbert Xu wrote: > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > CPU2: if (b != 1) > > > CPU2: b = 1; > > > > Stop right there. The kernel is full of code that assumes that > > assignment to an int/long is atomic. If your compiler breaks this > > assumption that we can kiss the kernel good-bye. > > Here you go: > > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=55981 > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56028 > > TL;DR: On x86, of you are doing a plain store of a 32-bit constant > that has bits set only in the lower few bits of each 16-bit half of > that constant, the compiler is plenty happy to use a pair of 16-bit > store-immediate instructions to carry out that store. This is also > known as "store tearing". > > The two bugs were filed (and after some back and forth, fixed) because > someone forgot to exclude C11 atomics and volatile accesses from this > store tearing. I should hasten to add that I have not seen load tearing, nor have I seen store tearing when storing a value unknown to the compiler. However, plain C-language loads and stores can be invented, fused, and a few other "interesting" optimization can be applied. On kissing the kernel goodbye, a reasonable strategy might be to identify the transformations that are actually occuring (like the stores of certain constants called out above) and fix those. We do occasionally use READ_ONCE() to prevent load-fusing optimizations that would otherwise cause the compiler to turn while-loops into if-statements guarding infinite loops. There is also the possibility of having the compiler guys give us more command-line arguments. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* RE: rcu_read_lock lost its compiler barrier 2019-06-03 8:42 ` Paul E. McKenney @ 2019-06-03 15:26 ` David Laight 2019-06-03 15:40 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: David Laight @ 2019-06-03 15:26 UTC (permalink / raw) To: 'paulmck@linux.ibm.com', Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller From: Paul E. McKenney > Sent: 03 June 2019 09:42 ... > On kissing the kernel goodbye, a reasonable strategy might be to > identify the transformations that are actually occuring (like the > stores of certain constants called out above) and fix those. > We do > occasionally use READ_ONCE() to prevent load-fusing optimizations that > would otherwise cause the compiler to turn while-loops into if-statements > guarding infinite loops. In that case the variable ought to be volatile... > There is also the possibility of having the > compiler guys give us more command-line arguments. I wonder how much the code size (of anything) would increase if the compiler: 1) Never read a value into a local more than once. 2) Never write a value that wasn't requested by the code. 3) Never used multiple memory accesses for 'machine word' (and smaller) items. (1) makes all reads READ_ONCE() except that the actual read can be delayed until further down the code. If I have a naive #define bswap32() I'd expect: v = bswap32(foo->bar) to possibly read foo->bar multiple times, but not: int foo_bar = foo->bar; v = bswap32(foo_bar); (2) makes all writes WRITE_ONCE() except that if there are multiple writes to the same location, only the last need be done. In particular it stops speculative writes and the use of locations that are going to be written to as temporaries. It also stop foo->bar = ~0; being implemented as a clear then decrement. (3) I'd never imagined the compiler would write the two halves of a word separately! If the compiler behaved like that (as one might expect it would) then READ_ONCE() would be a READ_NOW() for when the sequencing mattered. I was also recently surprised by the code I got from this loop: for (i = 0; i < limit; i++) sum64 += array32[i]; (as in the IP checksum sum without add carry support). The compiler unrolled it to used horrid sequences of sse3/avx instructions. This might be a gain for large enough buffers and 'hot cache' but for small buffer and likely cold cache it is horrid. I guess such optimisations are valid, but I wonder how often they are an actual win for real programs. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 15:26 ` David Laight @ 2019-06-03 15:40 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2019-06-03 15:40 UTC (permalink / raw) To: David Laight Cc: paulmck, Herbert Xu, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 3, 2019 at 8:26 AM David Laight <David.Laight@aculab.com> wrote: > > From: Paul E. McKenney > > > We do > > occasionally use READ_ONCE() to prevent load-fusing optimizations that > > would otherwise cause the compiler to turn while-loops into if-statements > > guarding infinite loops. > > In that case the variable ought to be volatile... No. We do not use volatile on variables. The C people got the semantics wrong, probably because 'volatile' was historically for IO, not for access atomicity without locking. It's not the memory location that is volatile, it is really the _access_ that is volatile. The same memory location might be completely stable in other access situations (ie when done under a lock). In other words, we should *never* use volatile in the kernel. It's fundamentally mis-designed for modern use. (Of course, we then can use volatile in a cast in code, which drives some compiler people crazy, but that's because said compiler people don't care about reality, they care about some paperwork). Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 3:47 ` Paul E. McKenney 2019-06-03 4:01 ` Herbert Xu @ 2019-06-03 5:26 ` Herbert Xu 2019-06-03 6:42 ` Boqun Feng 2019-06-03 9:35 ` Paul E. McKenney 1 sibling, 2 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-03 5:26 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > 1. These guarantees are of full memory barriers, -not- compiler > barriers. What I'm saying is that wherever they are, they must come with compiler barriers. I'm not aware of any synchronisation mechanism in the kernel that gives a memory barrier without a compiler barrier. > 2. These rules don't say exactly where these full memory barriers > go. SRCU is at one extreme, placing those full barriers in > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > at the other, placing these barriers entirely within the callback > queueing/invocation, grace-period computation, and the scheduler. > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > sometimes including a full memory barrier, but other times with > the full memory barrier being confined as it is with !PREEMPT > Tree RCU. The rules do say that the (full) memory barrier must precede any RCU read-side that occur after the synchronize_rcu and after the end of any RCU read-side that occur before the synchronize_rcu. All I'm arguing is that wherever that full mb is, as long as it also carries with it a barrier() (which it must do if it's done using an existing kernel mb/locking primitive), then we're fine. > Interleaving and inserting full memory barriers as per the rules above: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu > /* Could put a full memory barrier here, but it wouldn't help. */ CPU1: smp_mb(); CPU2: smp_mb(); Let's put them in because I think they are critical. smp_mb() also carries with it a barrier(). > CPU2: rcu_read_lock(); > CPU1: b = 2; > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > which (I believe) is why the model complains in this case. Let's put aside my doubt over how we're even allowing a compiler to turn b = 1 into if (b != 1) b = 1 Since you seem to be assuming that (a == 0) is true in this case (as the assignment b = 1 is carried out), then because of the presence of the full memory barrier, the RCU read-side section must have started prior to the synchronize_rcu. This means that synchronize_rcu is not allowed to return until at least the end of the grace period, or at least until the end of rcu_read_unlock. So it actually should be: CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu called /* Could put a full memory barrier here, but it wouldn't help. */ CPU1: smp_mb(); CPU2: smp_mb(); CPU2: grace period starts ...time passes... CPU2: rcu_read_lock(); CPU2: if (READ_ONCE(a) == 0) CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ CPU2: b = 1; CPU2: rcu_read_unlock ...time passes... CPU2: grace period ends /* This full memory barrier is also guaranteed by RCU. */ CPU2: smp_mb(); CPU1 synchronize_rcu returns CPU1: b = 2; Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 5:26 ` Herbert Xu @ 2019-06-03 6:42 ` Boqun Feng 2019-06-03 20:03 ` Paul E. McKenney 2019-06-03 9:35 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Boqun Feng @ 2019-06-03 6:42 UTC (permalink / raw) To: Herbert Xu Cc: Paul E. McKenney, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller [-- Attachment #1: Type: text/plain, Size: 4707 bytes --] On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > 1. These guarantees are of full memory barriers, -not- compiler > > barriers. > > What I'm saying is that wherever they are, they must come with > compiler barriers. I'm not aware of any synchronisation mechanism > in the kernel that gives a memory barrier without a compiler barrier. > > > 2. These rules don't say exactly where these full memory barriers > > go. SRCU is at one extreme, placing those full barriers in > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > at the other, placing these barriers entirely within the callback > > queueing/invocation, grace-period computation, and the scheduler. > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > sometimes including a full memory barrier, but other times with > > the full memory barrier being confined as it is with !PREEMPT > > Tree RCU. > > The rules do say that the (full) memory barrier must precede any > RCU read-side that occur after the synchronize_rcu and after the > end of any RCU read-side that occur before the synchronize_rcu. > > All I'm arguing is that wherever that full mb is, as long as it > also carries with it a barrier() (which it must do if it's done > using an existing kernel mb/locking primitive), then we're fine. > > > Interleaving and inserting full memory barriers as per the rules above: > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu > > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > Let's put them in because I think they are critical. smp_mb() also > carries with it a barrier(). > > > CPU2: rcu_read_lock(); > > CPU1: b = 2; > > CPU2: if (READ_ONCE(a) == 0) > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > CPU2: b = 1; > > CPU2: rcu_read_unlock > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > which (I believe) is why the model complains in this case. > > Let's put aside my doubt over how we're even allowing a compiler > to turn > > b = 1 > > into > > if (b != 1) > b = 1 > > Since you seem to be assuming that (a == 0) is true in this case I think Paul's example assuming (a == 0) is false, and maybe speculative writes (by compilers) needs to added into consideration? Please consider the following case (I add a few smp_mb()s), the case may be a little bit crasy, you have been warned ;-) CPU1: WRITE_ONCE(a, 1) CPU1: synchronize_rcu called CPU1: smp_mb(); /* let assume there is one here */ CPU2: rcu_read_lock(); CPU2: smp_mb(); /* let assume there is one here */ /* "if (b != 1) b = 1" reordered */ CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ CPU2: if (r0 != 1) /* true */ CPU2: b = 1; /* b == 1 now, this is a speculative write by compiler */ CPU1: b = 2; /* b == 2 */ CPU2: if (READ_ONCE(a) == 0) /* false */ CPU2: ... CPU2 else /* undo the speculative write */ CPU2: b = r0; /* b == 0 */ CPU2: smp_mb(); CPU2: read_read_unlock(); I know this is too crasy for us to think a compiler like this, but this might be the reason why the model complain about this. Paul, did I get this right? Or you mean something else? Regards, Boqun > (as the assignment b = 1 is carried out), then because of the > presence of the full memory barrier, the RCU read-side section > must have started prior to the synchronize_rcu. This means that > synchronize_rcu is not allowed to return until at least the end > of the grace period, or at least until the end of rcu_read_unlock. > > So it actually should be: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > CPU2: grace period starts > ...time passes... > CPU2: rcu_read_lock(); > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > ...time passes... > CPU2: grace period ends > > /* This full memory barrier is also guaranteed by RCU. */ > CPU2: smp_mb(); > > CPU1 synchronize_rcu returns > CPU1: b = 2; > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 6:42 ` Boqun Feng @ 2019-06-03 20:03 ` Paul E. McKenney 2019-06-04 14:44 ` Alan Stern 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 20:03 UTC (permalink / raw) To: Boqun Feng Cc: Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, stern On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote: > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > 1. These guarantees are of full memory barriers, -not- compiler > > > barriers. > > > > What I'm saying is that wherever they are, they must come with > > compiler barriers. I'm not aware of any synchronisation mechanism > > in the kernel that gives a memory barrier without a compiler barrier. > > > > > 2. These rules don't say exactly where these full memory barriers > > > go. SRCU is at one extreme, placing those full barriers in > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > > at the other, placing these barriers entirely within the callback > > > queueing/invocation, grace-period computation, and the scheduler. > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > > sometimes including a full memory barrier, but other times with > > > the full memory barrier being confined as it is with !PREEMPT > > > Tree RCU. > > > > The rules do say that the (full) memory barrier must precede any > > RCU read-side that occur after the synchronize_rcu and after the > > end of any RCU read-side that occur before the synchronize_rcu. > > > > All I'm arguing is that wherever that full mb is, as long as it > > also carries with it a barrier() (which it must do if it's done > > using an existing kernel mb/locking primitive), then we're fine. > > > > > Interleaving and inserting full memory barriers as per the rules above: > > > > > > CPU1: WRITE_ONCE(a, 1) > > > CPU1: synchronize_rcu > > > /* Could put a full memory barrier here, but it wouldn't help. */ > > > > CPU1: smp_mb(); > > CPU2: smp_mb(); > > > > Let's put them in because I think they are critical. smp_mb() also > > carries with it a barrier(). > > > > > CPU2: rcu_read_lock(); > > > CPU1: b = 2; > > > CPU2: if (READ_ONCE(a) == 0) > > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > > CPU2: b = 1; > > > CPU2: rcu_read_unlock > > > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > > which (I believe) is why the model complains in this case. > > > > Let's put aside my doubt over how we're even allowing a compiler > > to turn > > > > b = 1 > > > > into > > > > if (b != 1) > > b = 1 > > > > Since you seem to be assuming that (a == 0) is true in this case > > I think Paul's example assuming (a == 0) is false, and maybe Yes, otherwise, P0()'s write to "b" cannot have happened. > speculative writes (by compilers) needs to added into consideration? I would instead call it the compiler eliminating needless writes by inventing reads -- if the variable already has the correct value, no write happens. So no compiler speculation. However, it is difficult to create a solid defensible example. Yes, from LKMM's viewpoint, the weakly reordered invented read from "b" can be concurrent with P0()'s write to "b", but in that case the value loaded would have to manage to be equal to 1 for anything bad to happen. This does feel wrong to me, but again, it is difficult to create a solid defensible example. > Please consider the following case (I add a few smp_mb()s), the case may > be a little bit crasy, you have been warned ;-) > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > > CPU1: smp_mb(); /* let assume there is one here */ > > CPU2: rcu_read_lock(); > CPU2: smp_mb(); /* let assume there is one here */ > > /* "if (b != 1) b = 1" reordered */ > CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ > CPU2: if (r0 != 1) /* true */ > CPU2: b = 1; /* b == 1 now, this is a speculative write > by compiler > */ > > CPU1: b = 2; /* b == 2 */ > > CPU2: if (READ_ONCE(a) == 0) /* false */ > CPU2: ... > CPU2 else /* undo the speculative write */ > CPU2: b = r0; /* b == 0 */ > > CPU2: smp_mb(); > CPU2: read_read_unlock(); > > I know this is too crasy for us to think a compiler like this, but this > might be the reason why the model complain about this. > > Paul, did I get this right? Or you mean something else? Mostly there, except that I am not yet desperate enough to appeal to compilers speculating stores. ;-) Thanx, Paul > Regards, > Boqun > > > > > (as the assignment b = 1 is carried out), then because of the > > presence of the full memory barrier, the RCU read-side section > > must have started prior to the synchronize_rcu. This means that > > synchronize_rcu is not allowed to return until at least the end > > of the grace period, or at least until the end of rcu_read_unlock. > > > > So it actually should be: > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu called > > /* Could put a full memory barrier here, but it wouldn't help. */ > > > > CPU1: smp_mb(); > > CPU2: smp_mb(); > > > > CPU2: grace period starts > > ...time passes... > > CPU2: rcu_read_lock(); > > CPU2: if (READ_ONCE(a) == 0) > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > CPU2: b = 1; > > CPU2: rcu_read_unlock > > ...time passes... > > CPU2: grace period ends > > > > /* This full memory barrier is also guaranteed by RCU. */ > > CPU2: smp_mb(); > > > > CPU1 synchronize_rcu returns > > CPU1: b = 2; > > > > Cheers, > > -- > > Email: Herbert Xu <herbert@gondor.apana.org.au> > > Home Page: http://gondor.apana.org.au/~herbert/ > > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 20:03 ` Paul E. McKenney @ 2019-06-04 14:44 ` Alan Stern 2019-06-04 16:04 ` Linus Torvalds ` (2 more replies) 0 siblings, 3 replies; 54+ messages in thread From: Alan Stern @ 2019-06-04 14:44 UTC (permalink / raw) To: Paul E. McKenney Cc: Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Mon, 3 Jun 2019, Paul E. McKenney wrote: > On Mon, Jun 03, 2019 at 02:42:00PM +0800, Boqun Feng wrote: > > On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > > > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > > > > > 1. These guarantees are of full memory barriers, -not- compiler > > > > barriers. > > > > > > What I'm saying is that wherever they are, they must come with > > > compiler barriers. I'm not aware of any synchronisation mechanism > > > in the kernel that gives a memory barrier without a compiler barrier. > > > > > > > 2. These rules don't say exactly where these full memory barriers > > > > go. SRCU is at one extreme, placing those full barriers in > > > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > > > at the other, placing these barriers entirely within the callback > > > > queueing/invocation, grace-period computation, and the scheduler. > > > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > > > sometimes including a full memory barrier, but other times with > > > > the full memory barrier being confined as it is with !PREEMPT > > > > Tree RCU. > > > > > > The rules do say that the (full) memory barrier must precede any > > > RCU read-side that occur after the synchronize_rcu and after the > > > end of any RCU read-side that occur before the synchronize_rcu. > > > > > > All I'm arguing is that wherever that full mb is, as long as it > > > also carries with it a barrier() (which it must do if it's done > > > using an existing kernel mb/locking primitive), then we're fine. > > > > > > > Interleaving and inserting full memory barriers as per the rules above: > > > > > > > > CPU1: WRITE_ONCE(a, 1) > > > > CPU1: synchronize_rcu > > > > /* Could put a full memory barrier here, but it wouldn't help. */ > > > > > > CPU1: smp_mb(); > > > CPU2: smp_mb(); > > > > > > Let's put them in because I think they are critical. smp_mb() also > > > carries with it a barrier(). > > > > > > > CPU2: rcu_read_lock(); > > > > CPU1: b = 2; > > > > CPU2: if (READ_ONCE(a) == 0) > > > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > > > CPU2: b = 1; > > > > CPU2: rcu_read_unlock > > > > > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > > > which (I believe) is why the model complains in this case. > > > > > > Let's put aside my doubt over how we're even allowing a compiler > > > to turn > > > > > > b = 1 > > > > > > into > > > > > > if (b != 1) > > > b = 1 Even if you don't think the compiler will ever do this, the C standard gives compilers the right to invent read accesses if a plain (i.e., non-atomic and non-volatile) write is present. The Linux Kernel Memory Model has to assume that compilers will sometimes do this, even if it doesn't take the exact form of checking a variable's value before writing to it. (Incidentally, regardless of whether the compiler will ever do this, I have seen examples in the kernel where people did exactly this manually, in order to avoid dirtying a cache line unnecessarily.) > > > Since you seem to be assuming that (a == 0) is true in this case > > > > I think Paul's example assuming (a == 0) is false, and maybe > > Yes, otherwise, P0()'s write to "b" cannot have happened. > > > speculative writes (by compilers) needs to added into consideration? On the other hand, the C standard does not allow compilers to add speculative writes. The LKMM assumes they will never occur. > I would instead call it the compiler eliminating needless writes > by inventing reads -- if the variable already has the correct value, > no write happens. So no compiler speculation. > > However, it is difficult to create a solid defensible example. Yes, > from LKMM's viewpoint, the weakly reordered invented read from "b" > can be concurrent with P0()'s write to "b", but in that case the value > loaded would have to manage to be equal to 1 for anything bad to happen. > This does feel wrong to me, but again, it is difficult to create a solid > defensible example. > > > Please consider the following case (I add a few smp_mb()s), the case may > > be a little bit crasy, you have been warned ;-) > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu called > > > > CPU1: smp_mb(); /* let assume there is one here */ > > > > CPU2: rcu_read_lock(); > > CPU2: smp_mb(); /* let assume there is one here */ > > > > /* "if (b != 1) b = 1" reordered */ > > CPU2: r0 = b; /* if (b != 1) reordered here, r0 == 0 */ > > CPU2: if (r0 != 1) /* true */ > > CPU2: b = 1; /* b == 1 now, this is a speculative write > > by compiler > > */ > > > > CPU1: b = 2; /* b == 2 */ > > > > CPU2: if (READ_ONCE(a) == 0) /* false */ > > CPU2: ... > > CPU2 else /* undo the speculative write */ > > CPU2: b = r0; /* b == 0 */ > > > > CPU2: smp_mb(); > > CPU2: read_read_unlock(); > > > > I know this is too crasy for us to think a compiler like this, but this > > might be the reason why the model complain about this. > > > > Paul, did I get this right? Or you mean something else? > > Mostly there, except that I am not yet desperate enough to appeal to > compilers speculating stores. ;-) This example really does point out a weakness in the LKMM's handling of data races. Herbert's litmus test is a great starting point: C xu {} P0(int *a, int *b) { WRITE_ONCE(*a, 1); synchronize_rcu(); *b = 2; } P1(int *a, int *b) { rcu_read_lock(); if (READ_ONCE(*a) == 0) *b = 1; rcu_read_unlock(); } exists (~b=2) Currently the LKMM says the test is allowed and there is a data race, but this answer clearly is wrong since it would violate the RCU guarantee. The problem is that the LKMM currently requires all ordering/visibility of plain accesses to be mediated by marked accesses. But in this case, the visibility is mediated by RCU. Technically, we need to add a relation like ([M] ; po ; rcu-fence ; po ; [M]) into the definitions of ww-vis, wr-vis, and rw-xbstar. Doing so changes the litmus test's result to "not allowed" and no data race. However, I'm not certain that this single change is the entire fix; more thought is needed. Alan ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 14:44 ` Alan Stern @ 2019-06-04 16:04 ` Linus Torvalds 2019-06-04 17:00 ` Alan Stern 2019-06-07 14:09 ` inet: frags: Turn fqdir->dead into an int for old Alphas Herbert Xu 2019-06-06 4:51 ` rcu_read_lock lost its compiler barrier Herbert Xu 2019-06-06 8:16 ` Andrea Parri 2 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2019-06-04 16:04 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, Boqun Feng, Herbert Xu, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Tue, Jun 4, 2019 at 7:44 AM Alan Stern <stern@rowland.harvard.edu> wrote: > > Even if you don't think the compiler will ever do this, the C standard > gives compilers the right to invent read accesses if a plain (i.e., > non-atomic and non-volatile) write is present. Note that for the kernel, it's not like we go strictly by what the standard says anyway. The reason for the "read for write" thing is that obviously you sometimes have broken architectures (*cough*alpha*cough*) that need to do some common writes are read-maskmodify-write accesses, and for bitfields you obviously *always* have that issue. In general, a sane compiler (as opposed to "we just read the standard") had better not add random reads, because people might have some mappings be write-only when the hardware supports it. And we do generally expect sanity from our compilers, and will add compiler flags to disable bad behavior if required - even if said behavior would be "technically allowed by the standard". > (Incidentally, regardless of whether the compiler will ever do this, I > have seen examples in the kernel where people did exactly this > manually, in order to avoid dirtying a cache line unnecessarily.) I really hope and expect that this is not something that the compiler ever does. If a compiler turns "x = y" into if (x != y) x = y" (like we do manually at times, as you say), that compiler is garbage that we cannot use for the kernel. It would break things like "smp_wmb()" ordering guarantees, I'm pretty sure. And as always, if we're doing actively stupid things, and the compiler then turns our stupid code into something we don't expect, the corollary is that then it's on _us_. IOW, if we do if (x != 1) { ... } x = 1; and the compiler goes "oh, we already checked that 'x == 1'" and moves that "unconditional" 'x = 1' into the conditional section like if (x != 1) { .. x = 1; } then that's not something we can then complain about. So our expectation is that the compiler is _sane_, not that it's some "C as assembler". Adding spurious reads is not ok. But if we already had reads in the code and the compiler combines them with other ops, that's on us. End result: in general, we do expect that the compiler turns a regular assignment into a single plain write when that's what the code does, and does not add extra logic over and beyond that. In fact, the alpha port was always subtly buggy exactly because of the "byte write turns into a read-and-masked-write", even if I don't think anybody ever noticed (we did fix cases where people _did_ notice, though, and we might still have some cases where we use 'int' for booleans because of alpha issues.). So I don't technically disagree with anything you say, I just wanted to point out that as far as the kernel is concerned, we do have higher quality expectations from the compiler than just "technically valid according to the C standard". Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 16:04 ` Linus Torvalds @ 2019-06-04 17:00 ` Alan Stern 2019-06-04 17:29 ` Linus Torvalds 2019-06-07 14:09 ` inet: frags: Turn fqdir->dead into an int for old Alphas Herbert Xu 1 sibling, 1 reply; 54+ messages in thread From: Alan Stern @ 2019-06-04 17:00 UTC (permalink / raw) To: Linus Torvalds Cc: Paul E. McKenney, Boqun Feng, Herbert Xu, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Tue, 4 Jun 2019, Linus Torvalds wrote: > So I don't technically disagree with anything you say, That's good to know! > I just wanted > to point out that as far as the kernel is concerned, we do have higher > quality expectations from the compiler than just "technically valid > according to the C standard". Which suggests asking whether these higher expectations should be reflected in the Linux Kernel Memory Model. So far we have largely avoided doing that sort of thing, although there are a few exceptions. (For example, we assume the compiler does not destroy address dependencies from volatile reads -- but we also warn that this assumption may fail if the programmer does not follow some rules described in one of Paul's documentation files.) Alan ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 17:00 ` Alan Stern @ 2019-06-04 17:29 ` Linus Torvalds 0 siblings, 0 replies; 54+ messages in thread From: Linus Torvalds @ 2019-06-04 17:29 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, Boqun Feng, Herbert Xu, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Tue, Jun 4, 2019 at 10:00 AM Alan Stern <stern@rowland.harvard.edu> wrote: > > Which suggests asking whether these higher expectations should be > reflected in the Linux Kernel Memory Model. So far we have largely > avoided doing that sort of thing, although there are a few exceptions. I think they might be hard to describe - which in turn may be why the standard leaves it open and only describes the simple cases. Exactly that "we expect an assignment to be done as a single write" is probably a good example. Yes, we do have that expectation for the simple cases. But it's not an absolute rule, obviously, because it's clearly violated by bitfield writes, and similarly it's obviously violated for data that is bigger than the word size (ie a "u64" assignment is obviously not a single write when you're on a 32-bit target). And while those cases are static and could be described separately, the "oh, and we have _other_ accesses to the same variable nearby, and we expect that the compiler might take those into account unless we explicitly use WRITE_ONCE()" things make for much more hard to describe issues. Doing writes with speculative values is clearly bogus and garbage (although compiler writers _have_ tried to do that too: "we then overwrite it with the right value later, so it's ok"), and thankfully even user space people complain about that due to threading and signals. But eliding writes entirely by combining them with a later one is clearly ok - and eliding them when there was an explcit earlier read and value comparison (like in my example) sounds reasonable to me too. Yet silently adding the elision that wasn't visible in the source due to other accesses would be bad. How do you say "sane and reasonable compiler" in a spec? You usually don't - or you make the rules so odd and complex than nobody really understands them any more, and make it all moot ;) Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-04 16:04 ` Linus Torvalds 2019-06-04 17:00 ` Alan Stern @ 2019-06-07 14:09 ` Herbert Xu 2019-06-07 15:26 ` Eric Dumazet 1 sibling, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-07 14:09 UTC (permalink / raw) To: Linus Torvalds Cc: Alan Stern, Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Tue, Jun 04, 2019 at 09:04:55AM -0700, Linus Torvalds wrote: > > In fact, the alpha port was always subtly buggy exactly because of the > "byte write turns into a read-and-masked-write", even if I don't think > anybody ever noticed (we did fix cases where people _did_ notice, > though, and we might still have some cases where we use 'int' for > booleans because of alpha issues.). This is in fact a real bug in the code in question that no amount of READ_ONCE/WRITE_ONCE would have caught. The field fqdir->dead is declared as boolean so writing to it is not atomic (on old Alphas). I don't think it currently matters because padding would ensure that it is in fact 64 bits long. However, should someone add another char/bool/bitfield in this struct in future it could become an issue. So let's fix it. ---8<-- The field fqdir->dead is meant to be written (and read) atomically. As old Alpha CPUs can't write a single byte atomically, we need at least an int for it to work. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index e91b79ad4e4a..8c458fba74ad 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -14,7 +14,9 @@ struct fqdir { int max_dist; struct inet_frags *f; struct net *net; - bool dead; + + /* We can't use boolean because this needs atomic writes. */ + int dead; struct rhashtable rhashtable ____cacheline_aligned_in_smp; diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index 35e9784fab4e..05aa7c145817 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -193,7 +193,7 @@ void fqdir_exit(struct fqdir *fqdir) { fqdir->high_thresh = 0; /* prevent creation of new frags */ - fqdir->dead = true; + fqdir->dead = 1; /* call_rcu is supposed to provide memory barrier semantics, * separating the setting of fqdir->dead with the destruction -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-07 14:09 ` inet: frags: Turn fqdir->dead into an int for old Alphas Herbert Xu @ 2019-06-07 15:26 ` Eric Dumazet 2019-06-07 15:32 ` Herbert Xu 2019-06-07 16:19 ` Linus Torvalds 0 siblings, 2 replies; 54+ messages in thread From: Eric Dumazet @ 2019-06-07 15:26 UTC (permalink / raw) To: Herbert Xu, Linus Torvalds Cc: Alan Stern, Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On 6/7/19 7:09 AM, Herbert Xu wrote: > On Tue, Jun 04, 2019 at 09:04:55AM -0700, Linus Torvalds wrote: >> >> In fact, the alpha port was always subtly buggy exactly because of the >> "byte write turns into a read-and-masked-write", even if I don't think >> anybody ever noticed (we did fix cases where people _did_ notice, >> though, and we might still have some cases where we use 'int' for >> booleans because of alpha issues.). > > This is in fact a real bug in the code in question that no amount > of READ_ONCE/WRITE_ONCE would have caught. The field fqdir->dead is > declared as boolean so writing to it is not atomic (on old Alphas). > > I don't think it currently matters because padding would ensure > that it is in fact 64 bits long. However, should someone add another > char/bool/bitfield in this struct in future it could become an issue. > > So let's fix it. There is common knowledge among us programmers that bit fields (or bool) sharing a common 'word' need to be protected with a common lock. Converting all bit fields to plain int/long would be quite a waste of memory. In this case, fqdir_exit() is called right before the whole struct fqdir is dismantled, and the only cpu that could possibly change the thing is ourself, and we are going to start an RCU grace period. Note that first cache line in 'struct fqdir' is read-only. Only ->dead field is flipped to one at exit time. Your patch would send a strong signal to programmers to not even try using bit fields. Do we really want that ? > > ---8<-- > The field fqdir->dead is meant to be written (and read) atomically. > As old Alpha CPUs can't write a single byte atomically, we need at > least an int for it to work. > > Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> > > diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h > index e91b79ad4e4a..8c458fba74ad 100644 > --- a/include/net/inet_frag.h > +++ b/include/net/inet_frag.h > @@ -14,7 +14,9 @@ struct fqdir { > int max_dist; > struct inet_frags *f; > struct net *net; > - bool dead; > + > + /* We can't use boolean because this needs atomic writes. */ > + int dead; > > struct rhashtable rhashtable ____cacheline_aligned_in_smp; > > diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c > index 35e9784fab4e..05aa7c145817 100644 > --- a/net/ipv4/inet_fragment.c > +++ b/net/ipv4/inet_fragment.c > @@ -193,7 +193,7 @@ void fqdir_exit(struct fqdir *fqdir) > { > fqdir->high_thresh = 0; /* prevent creation of new frags */ > > - fqdir->dead = true; > + fqdir->dead = 1; > > /* call_rcu is supposed to provide memory barrier semantics, > * separating the setting of fqdir->dead with the destruction > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-07 15:26 ` Eric Dumazet @ 2019-06-07 15:32 ` Herbert Xu 2019-06-07 16:13 ` Eric Dumazet 2019-06-07 16:19 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-07 15:32 UTC (permalink / raw) To: Eric Dumazet Cc: Linus Torvalds, Alan Stern, Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Fri, Jun 07, 2019 at 08:26:12AM -0700, Eric Dumazet wrote: > > There is common knowledge among us programmers that bit fields > (or bool) sharing a common 'word' need to be protected > with a common lock. > > Converting all bit fields to plain int/long would be quite a waste of memory. > > In this case, fqdir_exit() is called right before the whole > struct fqdir is dismantled, and the only cpu that could possibly > change the thing is ourself, and we are going to start an RCU grace period. > > Note that first cache line in 'struct fqdir' is read-only. > Only ->dead field is flipped to one at exit time. > > Your patch would send a strong signal to programmers to not even try using > bit fields. > > Do we really want that ? If this were a bitfield then I'd think it would be safer because anybody adding a new bitfield is unlikely to try modifying both fields without locking or atomic ops. However, because this is a boolean, I can certainly see someone else coming along and adding another bool right next to it and expecting writes them to still be atomic. As it stands, my patch has zero impact on memory usage because it's simply using existing padding. Should this become an issue in future, we can always revisit this and use a more appropriate method of addressing it. But the point is to alert future developers that this field is not an ordinary boolean. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-07 15:32 ` Herbert Xu @ 2019-06-07 16:13 ` Eric Dumazet 0 siblings, 0 replies; 54+ messages in thread From: Eric Dumazet @ 2019-06-07 16:13 UTC (permalink / raw) To: Herbert Xu, Eric Dumazet Cc: Linus Torvalds, Alan Stern, Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On 6/7/19 8:32 AM, Herbert Xu wrote: > On Fri, Jun 07, 2019 at 08:26:12AM -0700, Eric Dumazet wrote: >> >> There is common knowledge among us programmers that bit fields >> (or bool) sharing a common 'word' need to be protected >> with a common lock. >> >> Converting all bit fields to plain int/long would be quite a waste of memory. >> >> In this case, fqdir_exit() is called right before the whole >> struct fqdir is dismantled, and the only cpu that could possibly >> change the thing is ourself, and we are going to start an RCU grace period. >> >> Note that first cache line in 'struct fqdir' is read-only. >> Only ->dead field is flipped to one at exit time. >> >> Your patch would send a strong signal to programmers to not even try using >> bit fields. >> >> Do we really want that ? > > If this were a bitfield then I'd think it would be safer because > anybody adding a new bitfield is unlikely to try modifying both > fields without locking or atomic ops. > > However, because this is a boolean, I can certainly see someone > else coming along and adding another bool right next to it and > expecting writes them to still be atomic. > > As it stands, my patch has zero impact on memory usage because > it's simply using existing padding. Should this become an issue > in future, we can always revisit this and use a more appropriate > method of addressing it. > > But the point is to alert future developers that this field is > not an ordinary boolean. Okay, but you added a quite redundant comment. /* We can't use boolean because this needs atomic writes. */ Should we add a similar comment in front of all bit-fields, or could we factorize this in a proper Documentation perhaps ? Can we just add a proper bit-field and not the comment ? unsigned int dead:1; This way, next programmer can just apply normal rules to add a new bit. Thanks ! ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-07 15:26 ` Eric Dumazet 2019-06-07 15:32 ` Herbert Xu @ 2019-06-07 16:19 ` Linus Torvalds 2019-06-08 15:27 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-07 16:19 UTC (permalink / raw) To: Eric Dumazet Cc: Herbert Xu, Alan Stern, Paul E. McKenney, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Fri, Jun 7, 2019 at 8:26 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > There is common knowledge among us programmers that bit fields > (or bool) sharing a common 'word' need to be protected > with a common lock. > > Converting all bit fields to plain int/long would be quite a waste of memory. Yeah, and we really don't care about alpha. So 'char' should be safe. No compiler actually turns a 'bool' in a struct into a bitfield, afaik, because you're still supposed to be able to take the address of a boolean. But on the whole, I do not believe that we should ever use 'bool' in structures anyway, because it's such a badly defined type. I think it's 'char' in practice on just about all architectures, but there really were traditional use cases where 'bool' was int. But: - we shouldn't turn them into 'int' anyway - alpha is dead, and no sane architecture will make the same mistake anyway. People learnt. - we might want to make sure 'bool' really is 'char' in practice, to double-check that fthe compiler doesn't do anything stupid. - bitfields obviously do need locks. 'char' does not. If there's somebody who really notices the alpha issue in PRACTICE, we can then bother to fix it. But there is approximately one user, and it's not a heavy-duty one. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-07 16:19 ` Linus Torvalds @ 2019-06-08 15:27 ` Paul E. McKenney 2019-06-08 17:42 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-08 15:27 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Herbert Xu, Alan Stern, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote: > On Fri, Jun 7, 2019 at 8:26 AM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > There is common knowledge among us programmers that bit fields > > (or bool) sharing a common 'word' need to be protected > > with a common lock. > > > > Converting all bit fields to plain int/long would be quite a waste of memory. > > Yeah, and we really don't care about alpha. So 'char' should be safe. > > No compiler actually turns a 'bool' in a struct into a bitfield, > afaik, because you're still supposed to be able to take the address of > a boolean. > > But on the whole, I do not believe that we should ever use 'bool' in > structures anyway, because it's such a badly defined type. I think > it's 'char' in practice on just about all architectures, but there > really were traditional use cases where 'bool' was int. > > But: > > - we shouldn't turn them into 'int' anyway - alpha is dead, and no > sane architecture will make the same mistake anyway. People learnt. > > - we might want to make sure 'bool' really is 'char' in practice, to > double-check that fthe compiler doesn't do anything stupid. > > - bitfields obviously do need locks. 'char' does not. > > If there's somebody who really notices the alpha issue in PRACTICE, we > can then bother to fix it. But there is approximately one user, and > it's not a heavy-duty one. C11 and later compilers are supposed to use read-modify-write atomic operations in this sort of situation anyway because they are not supposed to introduce data races. So if this problem comes up, the fix should be in GCC rather than the Linux kernel, right? Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-08 15:27 ` Paul E. McKenney @ 2019-06-08 17:42 ` Linus Torvalds 2019-06-08 17:50 ` Linus Torvalds 2019-06-08 18:14 ` Paul E. McKenney 0 siblings, 2 replies; 54+ messages in thread From: Linus Torvalds @ 2019-06-08 17:42 UTC (permalink / raw) To: Paul E. McKenney Cc: Eric Dumazet, Herbert Xu, Alan Stern, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Sat, Jun 8, 2019 at 8:32 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote: > > On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote: > > > > - bitfields obviously do need locks. 'char' does not. > > > > If there's somebody who really notices the alpha issue in PRACTICE, we > > can then bother to fix it. But there is approximately one user, and > > it's not a heavy-duty one. > > C11 and later compilers are supposed to use read-modify-write atomic > operations in this sort of situation anyway because they are not supposed > to introduce data races. I don't think that's possible on any common architecture. The bitfields themselves will need locking, to serialize writes of different fields against each other. There are no atomic rmw sequences that have reasonable performance for the bitfield updates themselves. The fields *around* the bitfields had better be safe, but that's something we already depend on, and which falls under the heading of "we don't accept garbage compilers". Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-08 17:42 ` Linus Torvalds @ 2019-06-08 17:50 ` Linus Torvalds 2019-06-08 18:50 ` Paul E. McKenney 2019-06-08 18:14 ` Paul E. McKenney 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-08 17:50 UTC (permalink / raw) To: Paul E. McKenney Cc: Eric Dumazet, Herbert Xu, Alan Stern, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Sat, Jun 8, 2019 at 10:42 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > There are no atomic rmw sequences that have reasonable performance for > the bitfield updates themselves. Note that this is purely about the writing side. Reads of bitfield values can be (and generally _should_ be) atomic, and hopefully C11 means that you wouldn't see intermediate values. But I'm not convinced about that either: one natural way to update a bitfield is to first do the masking, and then do the insertion of new bits, so a bitfield assignment very easily exposes non-real values to a concurrent read on another CPU. What I think C11 is supposed to protect is from compilers doing horribly bad things, and accessing bitfields with bigger types than the field itself, ie when you have struct { char c; int field1:5; }; then a write to "field1" had better not touch "char c" as part of the rmw operation, because that would indeed introduce a data-race with a completely independent field that might have completely independent locking rules. But struct { int c:8; int field1:5; }; would not sanely have the same guarantees, even if the layout in memory might be identical. Once you have bitfields next to each other, and use a base type that means they can be combined together, they can't be sanely modified without locking. (And I don't know if C11 took up the "base type of the bitfield" thing. Maybe you still need to use the ":0" thing to force alignment, and maybe the C standards people still haven't made the underlying type be meaningful other than for sign handling). Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-08 17:50 ` Linus Torvalds @ 2019-06-08 18:50 ` Paul E. McKenney 0 siblings, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-08 18:50 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Herbert Xu, Alan Stern, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Sat, Jun 08, 2019 at 10:50:51AM -0700, Linus Torvalds wrote: > On Sat, Jun 8, 2019 at 10:42 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > There are no atomic rmw sequences that have reasonable performance for > > the bitfield updates themselves. > > Note that this is purely about the writing side. Reads of bitfield > values can be (and generally _should_ be) atomic, and hopefully C11 > means that you wouldn't see intermediate values. > > But I'm not convinced about that either: one natural way to update a > bitfield is to first do the masking, and then do the insertion of new > bits, so a bitfield assignment very easily exposes non-real values to > a concurrent read on another CPU. Agreed on the "not convinced" part (though perhaps most implementations would handle concurrent reads and writes involving different fields of the same bitfield). And the C standard does not guarantee this, because data races are defined in terms of memory locations. So as far as the C standard is concerned, if there are two concurrent accesses to fields within a bitfield that are not separated by ":0", there is a data race and so the compiler can do whatever it wants. But do we really care about this case? > What I think C11 is supposed to protect is from compilers doing > horribly bad things, and accessing bitfields with bigger types than > the field itself, ie when you have > > struct { > char c; > int field1:5; > }; > > then a write to "field1" had better not touch "char c" as part of the > rmw operation, because that would indeed introduce a data-race with a > completely independent field that might have completely independent > locking rules. > > But > > struct { > int c:8; > int field1:5; > }; > > would not sanely have the same guarantees, even if the layout in > memory might be identical. Once you have bitfields next to each other, > and use a base type that means they can be combined together, they > can't be sanely modified without locking. > > (And I don't know if C11 took up the "base type of the bitfield" > thing. Maybe you still need to use the ":0" thing to force alignment, > and maybe the C standards people still haven't made the underlying > type be meaningful other than for sign handling). The C standard draft (n2310) gives similar examples: EXAMPLE A structure declared as struct { char a; int b:5, c:11,:0, d:8; struct { int ee:8; } e; } contains four separate memory locations: The member a, and bit-fields d and e.ee are each separate memory locations, and can be modified concurrently without interfering with each other. The bit-fields b and c together constitute the fourth memory location. The bit-fields b and c cannot be concurrently modified, but b and a, for example, can be. So yes, ":0" still forces alignment to the next storage unit. And it can be used to allow concurrent accesses to fields within a bitfield, but only when those two fields are separated by ":0". On the underlying type, according to J.3.9 of the current C working draft, the following are implementation-specified behavior: - Whether a "plain" int bit-field is treated as a signed int bit-field or as an unsigned int bit-field (6.7.2, 6.7.2.1). - Whether atomic types are permitted for bit-fields (6.7.2.1). This last is strange because you are not allowed to take the address of a bit field, and the various operations on atomic types take addresses. Search me! Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: inet: frags: Turn fqdir->dead into an int for old Alphas 2019-06-08 17:42 ` Linus Torvalds 2019-06-08 17:50 ` Linus Torvalds @ 2019-06-08 18:14 ` Paul E. McKenney 1 sibling, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-08 18:14 UTC (permalink / raw) To: Linus Torvalds Cc: Eric Dumazet, Herbert Xu, Alan Stern, Boqun Feng, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Sat, Jun 08, 2019 at 10:42:41AM -0700, Linus Torvalds wrote: > On Sat, Jun 8, 2019 at 8:32 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote: > > > > On Fri, Jun 07, 2019 at 09:19:42AM -0700, Linus Torvalds wrote: > > > > > > - bitfields obviously do need locks. 'char' does not. > > > > > > If there's somebody who really notices the alpha issue in PRACTICE, we > > > can then bother to fix it. But there is approximately one user, and > > > it's not a heavy-duty one. > > > > C11 and later compilers are supposed to use read-modify-write atomic > > operations in this sort of situation anyway because they are not supposed > > to introduce data races. Apologies, I should have explicitly stated that I was talking about char stores, not bitfield stores. And last I checked, the C11 standard's prohibition against data races did not extend to individual fields within a bitfield. So, yes, for bitfields, the programmer must use a lock or similar if it is necessary for updates to fields within a bitfield to be atomic. > I don't think that's possible on any common architecture. The > bitfields themselves will need locking, to serialize writes of > different fields against each other. Yes, and again the C standard doesn't make any atomicity guarantees regarding storing to different fields within a bitfield. The compiler is free to assume that nothing else is happening anywhere in the bitfield when storing to a field within that bitfield. Which gets back to your "bitfields obviously do need locks", and it is of course the developer (not the compiler) who must supply those locks. Plus a given lock must cover the entire bitfield -- having one lock for half the fields within a given bitfield and another lock for the other half will break. Switching from bitfields to char, the C standard -does- require that storing to one char must avoid even momentary corruption of adjacent char, so given an old Alpha the compiler would need to use something like an LL/SC loop. If it fails to do so, that compiler is failing to comply with the standard. > There are no atomic rmw sequences that have reasonable performance for > the bitfield updates themselves. Agreed, in the general case. In a few specific special cases, we do sometimes hand-craft bitfields using shifts and masks, and sometimes we use atomic RMW operations to update them. I suppose we could use unions as an alternative, but it is not clear to me that this would help anything. > The fields *around* the bitfields had better be safe, but that's > something we already depend on, and which falls under the heading of > "we don't accept garbage compilers". And the C standard does require the compiler to make that guarantee, so for once the standard is even on our side. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 14:44 ` Alan Stern 2019-06-04 16:04 ` Linus Torvalds @ 2019-06-06 4:51 ` Herbert Xu 2019-06-06 6:05 ` Paul E. McKenney 2019-06-06 8:16 ` Andrea Parri 2 siblings, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-06 4:51 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote: > > Currently the LKMM says the test is allowed and there is a data race, > but this answer clearly is wrong since it would violate the RCU > guarantee. Thank you! This is what I tried to say all along in this thread but you expressed it in a much better way :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 4:51 ` rcu_read_lock lost its compiler barrier Herbert Xu @ 2019-06-06 6:05 ` Paul E. McKenney 2019-06-06 6:14 ` Herbert Xu 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-06 6:05 UTC (permalink / raw) To: Herbert Xu Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 12:51:09PM +0800, Herbert Xu wrote: > On Tue, Jun 04, 2019 at 10:44:18AM -0400, Alan Stern wrote: > > > > Currently the LKMM says the test is allowed and there is a data race, > > but this answer clearly is wrong since it would violate the RCU > > guarantee. > > Thank you! This is what I tried to say all along in this thread > but you expressed it in a much better way :) In case you were wondering, the reason that I was giving you such a hard time was that from what I could see, you were pushing for no {READ,WRITE}_ONCE() at all. ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 6:05 ` Paul E. McKenney @ 2019-06-06 6:14 ` Herbert Xu 2019-06-06 9:06 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-06 6:14 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote: > > In case you were wondering, the reason that I was giving you such > a hard time was that from what I could see, you were pushing for no > {READ,WRITE}_ONCE() at all. ;-) Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c. We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at all. Even if the compiler dices and slices the reads/writes of "a" into a thousand pieces, it should still work if the RCU primitives are worth their salt. But I do concede that in the general RCU case you must have the READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 6:14 ` Herbert Xu @ 2019-06-06 9:06 ` Paul E. McKenney 2019-06-06 9:28 ` Herbert Xu 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-06 9:06 UTC (permalink / raw) To: Herbert Xu Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 02:14:38PM +0800, Herbert Xu wrote: > On Wed, Jun 05, 2019 at 11:05:11PM -0700, Paul E. McKenney wrote: > > > > In case you were wondering, the reason that I was giving you such > > a hard time was that from what I could see, you were pushing for no > > {READ,WRITE}_ONCE() at all. ;-) > > Hmm, that's exactly what it should be in net/ipv4/inet_fragment.c. > We don't need the READ_ONCE/WRITE_ONCE (or volatile marking) at > all. Even if the compiler dices and slices the reads/writes of > "a" into a thousand pieces, it should still work if the RCU > primitives are worth their salt. OK, so I take it that there is additional synchronization in there somewhere that is not captured in your simplified example code? Or is your point instead that given the initial value of "a" being zero and the value stored to "a" being one, there is no way that any possible load and store tearing (your slicing and dicing) could possibly mess up the test of the value loaded from "a"? > But I do concede that in the general RCU case you must have the > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. OK, good that we are in agreement on this part, at least! ;-) Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 9:06 ` Paul E. McKenney @ 2019-06-06 9:28 ` Herbert Xu 2019-06-06 10:58 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-06 9:28 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote: > > Or is your point instead that given the initial value of "a" being > zero and the value stored to "a" being one, there is no way that > any possible load and store tearing (your slicing and dicing) could > possibly mess up the test of the value loaded from "a"? Exactly. If you can dream up of a scenario where the compiler can get this wrong I'm all ears. > > But I do concede that in the general RCU case you must have the > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. > > OK, good that we are in agreement on this part, at least! ;-) Well only because we're allowing crazy compilers that can turn a simple word-aligned word assignment (a = b) into two stores. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 9:28 ` Herbert Xu @ 2019-06-06 10:58 ` Paul E. McKenney 2019-06-06 13:38 ` Herbert Xu 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-06 10:58 UTC (permalink / raw) To: Herbert Xu Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 05:28:55PM +0800, Herbert Xu wrote: > On Thu, Jun 06, 2019 at 02:06:19AM -0700, Paul E. McKenney wrote: > > > > Or is your point instead that given the initial value of "a" being > > zero and the value stored to "a" being one, there is no way that > > any possible load and store tearing (your slicing and dicing) could > > possibly mess up the test of the value loaded from "a"? > > Exactly. If you can dream up of a scenario where the compiler can > get this wrong I'm all ears. I believe that this is safe in practice, as long as you exercise constant vigilance. (OK, OK, I might be overdramatizing...) I cannot immediately think of a way that the compiler could get this wrong even in theory, but similar code sequences can be messed up. The reason for this is that in theory, the compiler could use the stored-to location as temporary storage, like this: a = whatever; // Compiler uses "a" as a temporary do_something(); whatever = a; a = 1; // Intended store The compiler is allowed to do this (again, in theory and according to a strict interpretation of the standard) because you haven't told it that anything else is paying attention to variable "a". As a result, the compiler is within its rights to use "a" as temporary storage immediately prior to any plain C-language store to "a". In practice, I don't know of any compilers that actually do this, nor have I heard anyone suggesting that they might soon actually do this. And even if they could, your example would still work because your example doesn't care about anything other than zero and non-zero, so wouldn't get confused by the compiler storing a temporary value of 42 or whatever. > > > But I do concede that in the general RCU case you must have the > > > READ_ONCE/WRITE_ONCE calls for rcu_dereference/rcu_assign_pointer. > > > > OK, good that we are in agreement on this part, at least! ;-) > > Well only because we're allowing crazy compilers that can turn > a simple word-aligned word assignment (a = b) into two stores. In my experience, the insanity of compilers increases with time, but yes. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 10:58 ` Paul E. McKenney @ 2019-06-06 13:38 ` Herbert Xu 2019-06-06 13:48 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-06 13:38 UTC (permalink / raw) To: Paul E. McKenney Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote: > > I cannot immediately think of a way that the compiler could get this > wrong even in theory, but similar code sequences can be messed up. > The reason for this is that in theory, the compiler could use the > stored-to location as temporary storage, like this: > > a = whatever; // Compiler uses "a" as a temporary > do_something(); > whatever = a; > a = 1; // Intended store Well if the compiler is going to do this then surely it would continue to do this even if you used WRITE_ONCE. Remember a is not volatile, only the access of a through WRITE_ONCE is volatile. Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 13:38 ` Herbert Xu @ 2019-06-06 13:48 ` Paul E. McKenney 0 siblings, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-06 13:48 UTC (permalink / raw) To: Herbert Xu Cc: Alan Stern, Boqun Feng, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Andrea Parri, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 09:38:24PM +0800, Herbert Xu wrote: > On Thu, Jun 06, 2019 at 03:58:17AM -0700, Paul E. McKenney wrote: > > > > I cannot immediately think of a way that the compiler could get this > > wrong even in theory, but similar code sequences can be messed up. > > The reason for this is that in theory, the compiler could use the > > stored-to location as temporary storage, like this: > > > > a = whatever; // Compiler uses "a" as a temporary > > do_something(); > > whatever = a; > > a = 1; // Intended store > > Well if the compiler is going to do this then surely it would > continue to do this even if you used WRITE_ONCE. Remember a is > not volatile, only the access of a through WRITE_ONCE is volatile. I disagree. Given a volatile store, the compiler cannot assume that the stored-to location is normal memory at that point in time, and therefore cannot assume that it is safe to invent a store to that location (as shown above). Thus far, the C++ standards committee seems on-board with this, though time will tell. http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p1382r1.pdf Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 14:44 ` Alan Stern 2019-06-04 16:04 ` Linus Torvalds 2019-06-06 4:51 ` rcu_read_lock lost its compiler barrier Herbert Xu @ 2019-06-06 8:16 ` Andrea Parri 2019-06-06 14:19 ` Alan Stern 2 siblings, 1 reply; 54+ messages in thread From: Andrea Parri @ 2019-06-06 8:16 UTC (permalink / raw) To: Alan Stern Cc: Paul E. McKenney, Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Luc Maranget, Jade Alglave > This example really does point out a weakness in the LKMM's handling of > data races. Herbert's litmus test is a great starting point: > > > C xu > > {} > > P0(int *a, int *b) > { > WRITE_ONCE(*a, 1); > synchronize_rcu(); > *b = 2; > } > > P1(int *a, int *b) > { > rcu_read_lock(); > if (READ_ONCE(*a) == 0) > *b = 1; > rcu_read_unlock(); > } > > exists (~b=2) > > > Currently the LKMM says the test is allowed and there is a data race, > but this answer clearly is wrong since it would violate the RCU > guarantee. > > The problem is that the LKMM currently requires all ordering/visibility > of plain accesses to be mediated by marked accesses. But in this case, > the visibility is mediated by RCU. Technically, we need to add a > relation like > > ([M] ; po ; rcu-fence ; po ; [M]) > > into the definitions of ww-vis, wr-vis, and rw-xbstar. Doing so > changes the litmus test's result to "not allowed" and no data race. > However, I'm not certain that this single change is the entire fix; > more thought is needed. This seems a sensible change to me: looking forward to seeing a patch, on top of -rcu/dev, for further review and testing! We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() discussed in this thread (maybe once the RCU code and the informal doc will have settled in such direction). It seems worth stressing the fact that _neither_ of these changes will prevent the test below from being racy, considered the two accesses to "a" happen concurrently / without synchronization. Thanks, Andrea C xu-2 {} P0(int *a, int *b) { *a = 1; synchronize_rcu(); WRITE_ONCE(*b, 2); } P1(int *a, int *b) { rcu_read_lock(); if (*a == 0) WRITE_ONCE(*b, 1); rcu_read_unlock(); } exists (~b=2) ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 8:16 ` Andrea Parri @ 2019-06-06 14:19 ` Alan Stern 2019-06-08 15:19 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Alan Stern @ 2019-06-06 14:19 UTC (permalink / raw) To: Andrea Parri Cc: Paul E. McKenney, Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Luc Maranget, Jade Alglave On Thu, 6 Jun 2019, Andrea Parri wrote: > This seems a sensible change to me: looking forward to seeing a patch, > on top of -rcu/dev, for further review and testing! > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > discussed in this thread (maybe once the RCU code and the informal doc > will have settled in such direction). Yes. Also for SRCU. That point had not escaped me. Alan ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 14:19 ` Alan Stern @ 2019-06-08 15:19 ` Paul E. McKenney 2019-06-08 15:56 ` Alan Stern 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-08 15:19 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Luc Maranget, Jade Alglave On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > This seems a sensible change to me: looking forward to seeing a patch, > > on top of -rcu/dev, for further review and testing! > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > discussed in this thread (maybe once the RCU code and the informal doc > > will have settled in such direction). > > Yes. Also for SRCU. That point had not escaped me. And it does seem pretty settled. There are quite a few examples where there are normal accesses at either end of the RCU read-side critical sections, for example, the one in the requirements diffs below. For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler barriers since 2006. ;-) Thanx, Paul ------------------------------------------------------------------------ diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..080b39cc1dbb 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: <li> <a href="#Hotplug CPU">Hotplug CPU</a>. <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. <li> <a href="#Tracing and RCU">Tracing and RCU</a>. +<li> <a href="#Accesses to User Mamory and RCU"> +Accesses to User Mamory and RCU</a>. <li> <a href="#Energy Efficiency">Energy Efficiency</a>. <li> <a href="#Scheduling-Clock Interrupts and RCU"> Scheduling-Clock Interrupts and RCU</a>. @@ -2521,6 +2523,75 @@ cannot be used. The tracing folks both located the requirement and provided the needed fix, so this surprise requirement was relatively painless. +<h3><a name="Accesses to User Mamory and RCU"> +Accesses to User Mamory and RCU</a></h3> + +<p> +The kernel needs to access user-space memory, for example, to access +data referenced by system-call parameters. +The <tt>get_user()</tt> macro does this job. + +<p> +However, user-space memory might well be paged out, which means +that <tt>get_user()</tt> might well page-fault and thus block while +waiting for the resulting I/O to complete. +It would be a very bad thing for the compiler to reorder +a <tt>get_user()</tt> invocation into an RCU read-side critical +section. +For example, suppose that the source code looked like this: + +<blockquote> +<pre> + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); +</pre> +</blockquote> + +<p> +The compiler must not be permitted to transform this source code into +the following: + +<blockquote> +<pre> + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); +</pre> +</blockquote> + +<p> +If the compiler did make this transformation in a +<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did +page fault, the result would be a quiescent state in the middle +of an RCU read-side critical section. +This misplaced quiescent state could result in line 4 being +a use-after-free access, which could be bad for your kernel's +actuarial statistics. +Similar examples can be constructed with the call to <tt>get_user()</tt> +preceding the <tt>rcu_read_lock()</tt>. + +<p> +Unfortunately, <tt>get_user()</tt> doesn't have any particular +ordering properties, and in some architectures the underlying <tt>asm</tt> +isn't even marked <tt>volatile</tt>. +And even if it was marked <tt>volatile</tt>, the above access to +<tt>p->value</tt> is not volatile, so the compiler would not have any +reason to keep those two accesses in order. + +<p> +Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt> +and <tt>rcu_read_unlock()</tt> must act as compiler barriers, +at least for outermost instances of <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical +sections. + <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> <p> ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-08 15:19 ` Paul E. McKenney @ 2019-06-08 15:56 ` Alan Stern 2019-06-08 16:31 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Alan Stern @ 2019-06-08 15:56 UTC (permalink / raw) To: Paul E. McKenney Cc: Andrea Parri, Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Luc Maranget, Jade Alglave On Sat, 8 Jun 2019, Paul E. McKenney wrote: > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > > > This seems a sensible change to me: looking forward to seeing a patch, > > > on top of -rcu/dev, for further review and testing! > > > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > > discussed in this thread (maybe once the RCU code and the informal doc > > > will have settled in such direction). > > > > Yes. Also for SRCU. That point had not escaped me. > > And it does seem pretty settled. There are quite a few examples where > there are normal accesses at either end of the RCU read-side critical > sections, for example, the one in the requirements diffs below. > > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler > barriers since 2006. ;-) > > Thanx, Paul > > ------------------------------------------------------------------------ > > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html > index 5a9238a2883c..080b39cc1dbb 100644 > --- a/Documentation/RCU/Design/Requirements/Requirements.html > +++ b/Documentation/RCU/Design/Requirements/Requirements.html > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: > <li> <a href="#Hotplug CPU">Hotplug CPU</a>. > <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. > <li> <a href="#Tracing and RCU">Tracing and RCU</a>. > +<li> <a href="#Accesses to User Mamory and RCU"> ------------------------------------^ > +Accesses to User Mamory and RCU</a>. ---------------------^ > <li> <a href="#Energy Efficiency">Energy Efficiency</a>. > <li> <a href="#Scheduling-Clock Interrupts and RCU"> > Scheduling-Clock Interrupts and RCU</a>. > @@ -2521,6 +2523,75 @@ cannot be used. > The tracing folks both located the requirement and provided the > needed fix, so this surprise requirement was relatively painless. > > +<h3><a name="Accesses to User Mamory and RCU"> ----------------------------------^ > +Accesses to User Mamory and RCU</a></h3> ---------------------^ Are these issues especially notable for female programmers? :-) Alan ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-08 15:56 ` Alan Stern @ 2019-06-08 16:31 ` Paul E. McKenney 0 siblings, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-08 16:31 UTC (permalink / raw) To: Alan Stern Cc: Andrea Parri, Boqun Feng, Herbert Xu, Linus Torvalds, Frederic Weisbecker, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Luc Maranget, Jade Alglave On Sat, Jun 08, 2019 at 11:56:04AM -0400, Alan Stern wrote: > On Sat, 8 Jun 2019, Paul E. McKenney wrote: > > > On Thu, Jun 06, 2019 at 10:19:43AM -0400, Alan Stern wrote: > > > On Thu, 6 Jun 2019, Andrea Parri wrote: > > > > > > > This seems a sensible change to me: looking forward to seeing a patch, > > > > on top of -rcu/dev, for further review and testing! > > > > > > > > We could also add (to LKMM) the barrier() for rcu_read_{lock,unlock}() > > > > discussed in this thread (maybe once the RCU code and the informal doc > > > > will have settled in such direction). > > > > > > Yes. Also for SRCU. That point had not escaped me. > > > > And it does seem pretty settled. There are quite a few examples where > > there are normal accesses at either end of the RCU read-side critical > > sections, for example, the one in the requirements diffs below. > > > > For SRCU, srcu_read_lock() and srcu_read_unlock() have implied compiler > > barriers since 2006. ;-) > > > > Thanx, Paul > > > > ------------------------------------------------------------------------ > > > > diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html > > index 5a9238a2883c..080b39cc1dbb 100644 > > --- a/Documentation/RCU/Design/Requirements/Requirements.html > > +++ b/Documentation/RCU/Design/Requirements/Requirements.html > > @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: > > <li> <a href="#Hotplug CPU">Hotplug CPU</a>. > > <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. > > <li> <a href="#Tracing and RCU">Tracing and RCU</a>. > > +<li> <a href="#Accesses to User Mamory and RCU"> > ------------------------------------^ > > +Accesses to User Mamory and RCU</a>. > ---------------------^ > > <li> <a href="#Energy Efficiency">Energy Efficiency</a>. > > <li> <a href="#Scheduling-Clock Interrupts and RCU"> > > Scheduling-Clock Interrupts and RCU</a>. > > @@ -2521,6 +2523,75 @@ cannot be used. > > The tracing folks both located the requirement and provided the > > needed fix, so this surprise requirement was relatively painless. > > > > +<h3><a name="Accesses to User Mamory and RCU"> > ----------------------------------^ > > +Accesses to User Mamory and RCU</a></h3> > ---------------------^ > > Are these issues especially notable for female programmers? :-) *Red face* Some days it just doesn't pay to get up in the morning... Well, those issues certainly seem a bit inconsiderate to non-mammalian programmers. :-/ How about the updated version shown below? Thanx, Paul ------------------------------------------------------------------------ diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 5a9238a2883c..f04c467e55c5 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -2129,6 +2129,8 @@ Some of the relevant points of interest are as follows: <li> <a href="#Hotplug CPU">Hotplug CPU</a>. <li> <a href="#Scheduler and RCU">Scheduler and RCU</a>. <li> <a href="#Tracing and RCU">Tracing and RCU</a>. +<li> <a href="#Accesses to User Memory and RCU"> +Accesses to User Memory and RCU</a>. <li> <a href="#Energy Efficiency">Energy Efficiency</a>. <li> <a href="#Scheduling-Clock Interrupts and RCU"> Scheduling-Clock Interrupts and RCU</a>. @@ -2521,6 +2523,75 @@ cannot be used. The tracing folks both located the requirement and provided the needed fix, so this surprise requirement was relatively painless. +<h3><a name="Accesses to User Memory and RCU"> +Accesses to User Memory and RCU</a></h3> + +<p> +The kernel needs to access user-space memory, for example, to access +data referenced by system-call parameters. +The <tt>get_user()</tt> macro does this job. + +<p> +However, user-space memory might well be paged out, which means +that <tt>get_user()</tt> might well page-fault and thus block while +waiting for the resulting I/O to complete. +It would be a very bad thing for the compiler to reorder +a <tt>get_user()</tt> invocation into an RCU read-side critical +section. +For example, suppose that the source code looked like this: + +<blockquote> +<pre> + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 v = p->value; + 4 rcu_read_unlock(); + 5 get_user(user_v, user_p); + 6 do_something_with(v, user_v); +</pre> +</blockquote> + +<p> +The compiler must not be permitted to transform this source code into +the following: + +<blockquote> +<pre> + 1 rcu_read_lock(); + 2 p = rcu_dereference(gp); + 3 get_user(user_v, user_p); // BUG: POSSIBLE PAGE FAULT!!! + 4 v = p->value; + 5 rcu_read_unlock(); + 6 do_something_with(v, user_v); +</pre> +</blockquote> + +<p> +If the compiler did make this transformation in a +<tt>CONFIG_PREEMPT=n</tt> kernel build, and if <tt>get_user()</tt> did +page fault, the result would be a quiescent state in the middle +of an RCU read-side critical section. +This misplaced quiescent state could result in line 4 being +a use-after-free access, which could be bad for your kernel's +actuarial statistics. +Similar examples can be constructed with the call to <tt>get_user()</tt> +preceding the <tt>rcu_read_lock()</tt>. + +<p> +Unfortunately, <tt>get_user()</tt> doesn't have any particular +ordering properties, and in some architectures the underlying <tt>asm</tt> +isn't even marked <tt>volatile</tt>. +And even if it was marked <tt>volatile</tt>, the above access to +<tt>p->value</tt> is not volatile, so the compiler would not have any +reason to keep those two accesses in order. + +<p> +Therefore, the Linux-kernel definitions of <tt>rcu_read_lock()</tt> +and <tt>rcu_read_unlock()</tt> must act as compiler barriers, +at least for outermost instances of <tt>rcu_read_lock()</tt> and +<tt>rcu_read_unlock()</tt> within a nested set of RCU read-side critical +sections. + <h3><a name="Energy Efficiency">Energy Efficiency</a></h3> <p> ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 5:26 ` Herbert Xu 2019-06-03 6:42 ` Boqun Feng @ 2019-06-03 9:35 ` Paul E. McKenney 1 sibling, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 9:35 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 01:26:26PM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 08:47:07PM -0700, Paul E. McKenney wrote: > > > > 1. These guarantees are of full memory barriers, -not- compiler > > barriers. > > What I'm saying is that wherever they are, they must come with > compiler barriers. I'm not aware of any synchronisation mechanism > in the kernel that gives a memory barrier without a compiler barrier. Yes, if a given synchronization mechanism requires that memory references need to be ordered, both the compiler and the CPU must maintain that ordering. > > 2. These rules don't say exactly where these full memory barriers > > go. SRCU is at one extreme, placing those full barriers in > > srcu_read_lock() and srcu_read_unlock(), and !PREEMPT Tree RCU > > at the other, placing these barriers entirely within the callback > > queueing/invocation, grace-period computation, and the scheduler. > > Preemptible Tree RCU is in the middle, with rcu_read_unlock() > > sometimes including a full memory barrier, but other times with > > the full memory barrier being confined as it is with !PREEMPT > > Tree RCU. > > The rules do say that the (full) memory barrier must precede any > RCU read-side that occur after the synchronize_rcu and after the > end of any RCU read-side that occur before the synchronize_rcu. > > All I'm arguing is that wherever that full mb is, as long as it > also carries with it a barrier() (which it must do if it's done > using an existing kernel mb/locking primitive), then we're fine. Fair enough, and smp_mb() does provide what is needed. > > Interleaving and inserting full memory barriers as per the rules above: > > > > CPU1: WRITE_ONCE(a, 1) > > CPU1: synchronize_rcu > > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); What is CPU2's smp_mb() ordering? In other words, what comment would you put on each of the above smp_mb() calls? > Let's put them in because I think they are critical. smp_mb() also > carries with it a barrier(). Again, agreed, smp_mb() implies barrier(). > > CPU2: rcu_read_lock(); > > CPU1: b = 2; > > CPU2: if (READ_ONCE(a) == 0) > > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > > CPU2: b = 1; > > CPU2: rcu_read_unlock > > > > In fact, CPU2's load from b might be moved up to race with CPU1's store, > > which (I believe) is why the model complains in this case. > > Let's put aside my doubt over how we're even allowing a compiler > to turn > > b = 1 > > into > > if (b != 1) > b = 1 > > Since you seem to be assuming that (a == 0) is true in this case > (as the assignment b = 1 is carried out), then because of the > presence of the full memory barrier, the RCU read-side section > must have started prior to the synchronize_rcu. This means that > synchronize_rcu is not allowed to return until at least the end > of the grace period, or at least until the end of rcu_read_unlock. > > So it actually should be: > > CPU1: WRITE_ONCE(a, 1) > CPU1: synchronize_rcu called > /* Could put a full memory barrier here, but it wouldn't help. */ > > CPU1: smp_mb(); > CPU2: smp_mb(); > > CPU2: grace period starts > ...time passes... > CPU2: rcu_read_lock(); > CPU2: if (READ_ONCE(a) == 0) > CPU2: if (b != 1) /* Weakly ordered CPU moved this up! */ > CPU2: b = 1; > CPU2: rcu_read_unlock > ...time passes... > CPU2: grace period ends > > /* This full memory barrier is also guaranteed by RCU. */ > CPU2: smp_mb(); But in this case, given that there are no more statements for CPU2, what is this smp_mb() ordering? Thanx, Paul > CPU1 synchronize_rcu returns > CPU1: b = 2; > > Cheers, > -- > Email: Herbert Xu <herbert@gondor.apana.org.au> > Home Page: http://gondor.apana.org.au/~herbert/ > PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt > ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 2:46 ` Herbert Xu 2019-06-03 3:47 ` Paul E. McKenney @ 2019-06-06 8:38 ` Andrea Parri 2019-06-06 9:32 ` Herbert Xu 1 sibling, 1 reply; 54+ messages in thread From: Andrea Parri @ 2019-06-06 8:38 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Dmitry Vyukov, Alan Stern, Eric Dumazet On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > The case we were discussing is from net/ipv4/inet_fragment.c from > the net-next tree: BTW, thank you for keeping me and other people who intervened in that discussion in Cc:... Andrea > > void fqdir_exit(struct fqdir *fqdir) > { > ... > fqdir->dead = true; > > /* call_rcu is supposed to provide memory barrier semantics, > * separating the setting of fqdir->dead with the destruction > * work. This implicit barrier is paired with inet_frag_kill(). > */ > > INIT_RCU_WORK(&fqdir->destroy_rwork, fqdir_rwork_fn); > queue_rcu_work(system_wq, &fqdir->destroy_rwork); > } > > and > > void inet_frag_kill(struct inet_frag_queue *fq) > { > ... > rcu_read_lock(); > /* The RCU read lock provides a memory barrier > * guaranteeing that if fqdir->dead is false then > * the hash table destruction will not start until > * after we unlock. Paired with inet_frags_exit_net(). > */ > if (!fqdir->dead) { > rhashtable_remove_fast(&fqdir->rhashtable, &fq->node, > fqdir->f->rhash_params); > ... > } > ... > rcu_read_unlock(); > ... > } > > I simplified this to > > Initial values: > > a = 0 > b = 0 > > CPU1 CPU2 > ---- ---- > a = 1 rcu_read_lock > synchronize_rcu if (a == 0) > b = 2 b = 1 > rcu_read_unlock > > On exit we want this to be true: > b == 2 ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-06 8:38 ` Andrea Parri @ 2019-06-06 9:32 ` Herbert Xu 0 siblings, 0 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-06 9:32 UTC (permalink / raw) To: Andrea Parri Cc: Linus Torvalds, Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller, Dmitry Vyukov, Alan Stern, Eric Dumazet On Thu, Jun 06, 2019 at 10:38:56AM +0200, Andrea Parri wrote: > On Mon, Jun 03, 2019 at 10:46:40AM +0800, Herbert Xu wrote: > > > The case we were discussing is from net/ipv4/inet_fragment.c from > > the net-next tree: > > BTW, thank you for keeping me and other people who intervened in that > discussion in Cc:... FWIW I didn't drop you from the Cc list. The email discussion was taken off-list by someone else and I simply kept that Cc list when I brought it back onto lkml. On a second look I did end up dropping Eric but I think he's had enough of this discussion :) -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-02 5:56 ` rcu_read_lock lost its compiler barrier Herbert Xu 2019-06-02 20:54 ` Linus Torvalds @ 2019-06-03 0:06 ` Paul E. McKenney 2019-06-03 3:03 ` Herbert Xu 1 sibling, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 0:06 UTC (permalink / raw) To: Herbert Xu Cc: Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, netdev, David S. Miller, Linus Torvalds On Sun, Jun 02, 2019 at 01:56:07PM +0800, Herbert Xu wrote: > Digging up an old email because I was not aware of this previously > but Paul pointed me to it during another discussion. > > On Mon, Sep 21, 2015 at 01:43:27PM -0700, Paul E. McKenney wrote: > > On Mon, Sep 21, 2015 at 09:30:49PM +0200, Frederic Weisbecker wrote: > > > > > > diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h > > > > index d63bb77..6c3cece 100644 > > > > --- a/include/linux/rcupdate.h > > > > +++ b/include/linux/rcupdate.h > > > > @@ -297,12 +297,14 @@ void synchronize_rcu(void); > > > > > > > > static inline void __rcu_read_lock(void) > > > > { > > > > - preempt_disable(); > > > > + if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) > > > > + preempt_disable(); > > > > > > preempt_disable() is a no-op when !CONFIG_PREEMPT_COUNT, right? > > > Or rather it's a barrier(), which is anyway implied by rcu_read_lock(). > > > > > > So perhaps we can get rid of the IS_ENABLED() check? > > > > Actually, barrier() is not intended to be implied by rcu_read_lock(). > > In a non-preemptible RCU implementation, it doesn't help anything > > to have the compiler flush its temporaries upon rcu_read_lock() > > and rcu_read_unlock(). > > This is seriously broken. RCU has been around for years and is > used throughout the kernel while the compiler barrier existed. Please note that preemptible Tree RCU has lacked the compiler barrier on all but the outermost rcu_read_unlock() for years before Boqun's patch. So exactly where in the code that we are currently discussing are you relying on compiler barriers in either rcu_read_lock() or rcu_read_unlock()? The grace-period guarantee allows the compiler ordering to be either in the readers (SMP&&PREEMPT), in the grace-period mechanism (SMP&&!PREEMPT), or both (SRCU). > You can't then go and decide to remove the compiler barrier! To do > that you'd need to audit every single use of rcu_read_lock in the > kernel to ensure that they're not depending on the compiler barrier. > > This is also contrary to the definition of almost every other > *_lock primitive in the kernel where the compiler barrier is > included. > > So please revert this patch. I do not believe that reverting that patch will help you at all. But who knows? So please point me at the full code body that was being debated earlier on this thread. It will no doubt take me quite a while to dig through it, given my being on the road for the next couple of weeks, but so it goes. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 0:06 ` Paul E. McKenney @ 2019-06-03 3:03 ` Herbert Xu 2019-06-03 9:27 ` Paul E. McKenney 2019-06-03 15:55 ` Linus Torvalds 0 siblings, 2 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-03 3:03 UTC (permalink / raw) To: Paul E. McKenney Cc: Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, netdev, David S. Miller, Linus Torvalds On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote: > > Please note that preemptible Tree RCU has lacked the compiler barrier on > all but the outermost rcu_read_unlock() for years before Boqun's patch. Actually this is not true. Boqun's patch (commit bb73c52bad36) does not add a barrier() to __rcu_read_lock. In fact I dug into the git history and this compiler barrier() has existed in preemptible tree RCU since the very start in 2009: : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd : Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> : Date: Sat Aug 22 13:56:52 2009 -0700 : : rcu: Merge preemptable-RCU functionality into hierarchical RCU : : +/* : + * Tree-preemptable RCU implementation for rcu_read_lock(). : + * Just increment ->rcu_read_lock_nesting, shared state will be updated : + * if we block. : + */ : +void __rcu_read_lock(void) : +{ : + ACCESS_ONCE(current->rcu_read_lock_nesting)++; : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */ : +} : +EXPORT_SYMBOL_GPL(__rcu_read_lock); However, you are correct that in the non-preempt tree RCU case, the compiler barrier in __rcu_read_lock was not always present. In fact it was added by: : commit 386afc91144b36b42117b0092893f15bc8798a80 : Author: Linus Torvalds <torvalds@linux-foundation.org> : Date: Tue Apr 9 10:48:33 2013 -0700 : : spinlocks and preemption points need to be at least compiler barriers I suspect this is what prompted you to remove it in 2015. > I do not believe that reverting that patch will help you at all. > > But who knows? So please point me at the full code body that was being > debated earlier on this thread. It will no doubt take me quite a while to > dig through it, given my being on the road for the next couple of weeks, > but so it goes. Please refer to my response to Linus for the code in question. In any case, I am now even more certain that compiler barriers are not needed in the code in question. The reasoning is quite simple. If you need those compiler barriers then you surely need real memory barriers. Vice versa, if real memory barriers are already present thanks to RCU, then you don't need those compiler barriers. In fact this calls into question the use of READ_ONCE/WRITE_ONCE in RCU primitives such as rcu_dereference and rcu_assign_pointer. IIRC when RCU was first added to the Linux kernel we did not have compiler barriers in rcu_dereference and rcu_assign_pointer. They were added later on. As compiler barriers per se are useless, these are surely meant to be coupled with the memory barriers provided by RCU grace periods and synchronize_rcu. But then those real memory barriers would have compiler barriers too. So why do we need the compiler barriers in rcu_dereference and rcu_assign_pointer? Cheers, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 3:03 ` Herbert Xu @ 2019-06-03 9:27 ` Paul E. McKenney 2019-06-03 15:55 ` Linus Torvalds 1 sibling, 0 replies; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 9:27 UTC (permalink / raw) To: Herbert Xu Cc: Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, netdev, David S. Miller, Linus Torvalds On Mon, Jun 03, 2019 at 11:03:24AM +0800, Herbert Xu wrote: > On Sun, Jun 02, 2019 at 05:06:17PM -0700, Paul E. McKenney wrote: > > > > Please note that preemptible Tree RCU has lacked the compiler barrier on > > all but the outermost rcu_read_unlock() for years before Boqun's patch. > > Actually this is not true. Boqun's patch (commit bb73c52bad36) does > not add a barrier() to __rcu_read_lock. In fact I dug into the git > history and this compiler barrier() has existed in preemptible tree > RCU since the very start in 2009: I said rcu_read_unlock() and you said __rcu_read_lock(). > : commit f41d911f8c49a5d65c86504c19e8204bb605c4fd > : Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com> > : Date: Sat Aug 22 13:56:52 2009 -0700 > : > : rcu: Merge preemptable-RCU functionality into hierarchical RCU > : > : +/* > : + * Tree-preemptable RCU implementation for rcu_read_lock(). > : + * Just increment ->rcu_read_lock_nesting, shared state will be updated > : + * if we block. > : + */ > : +void __rcu_read_lock(void) > : +{ > : + ACCESS_ONCE(current->rcu_read_lock_nesting)++; > : + barrier(); /* needed if we ever invoke rcu_read_lock in rcutree.c */ > : +} > : +EXPORT_SYMBOL_GPL(__rcu_read_lock); Thank you for finding this! This particular version does have an unconditional barrier() in __rcu_read_unlock(), for whatever that is worth: +void __rcu_read_unlock(void) +{ + struct task_struct *t = current; + + barrier(); /* needed if we ever invoke rcu_read_unlock in rcutree.c */ + if (--ACCESS_ONCE(t->rcu_read_lock_nesting) == 0 && + unlikely(ACCESS_ONCE(t->rcu_read_unlock_special))) + rcu_read_unlock_special(t); +} I would not have seen the point of a compiler barrier in the non-outermost __rcu_read_unlock(), since the completion of an inner __rcu_read_unlock() does not permit the grace period to complete. > However, you are correct that in the non-preempt tree RCU case, > the compiler barrier in __rcu_read_lock was not always present. > In fact it was added by: > > : commit 386afc91144b36b42117b0092893f15bc8798a80 > : Author: Linus Torvalds <torvalds@linux-foundation.org> > : Date: Tue Apr 9 10:48:33 2013 -0700 > : > : spinlocks and preemption points need to be at least compiler barriers > > I suspect this is what prompted you to remove it in 2015. If I remember correctly, it was pointed out to me that in !PREEMPT kernels, the compiler barrier in the preempt_disable() invoked in rcu_read_lock() (and similar on the rcu_read_unlock() side) wasn't helping anything, > > I do not believe that reverting that patch will help you at all. > > > > But who knows? So please point me at the full code body that was being > > debated earlier on this thread. It will no doubt take me quite a while to > > dig through it, given my being on the road for the next couple of weeks, > > but so it goes. > > Please refer to my response to Linus for the code in question. > > In any case, I am now even more certain that compiler barriers are > not needed in the code in question. The reasoning is quite simple. > If you need those compiler barriers then you surely need real memory > barriers. OK, we are in agreement on that point, then! > Vice versa, if real memory barriers are already present thanks to > RCU, then you don't need those compiler barriers. For users of RCU, this seems reasonable. On the other hand, the compiler barriers in PREEMPT Tree RCU's outermost __rcu_read_lock() and __rcu_read_unlock() invocations really are needed by RCU internals. This is because RCU uses of interrupt handlers that access per-task and per-CPU variables, and these need to be able to sense the edges of the nested set of RCU read-side critical sections. It is OK for these interrupt handlers to think that the critical section is larger than it really is, but fatal for them to think that the critical sections are smaller than they really are. > In fact this calls into question the use of READ_ONCE/WRITE_ONCE in > RCU primitives such as rcu_dereference and rcu_assign_pointer. No, these are -not- called into question, or if they are, the question gets quickly answered it a way that supports current Linux-kernel code. As mentioned in earlier emails, the traditional uses of RCU that involve rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() all work just fine. In fact, from what I can see, the issue stems from having developed intuitions from working with the traditional rcu_dereference(), rcu_assign_pointer(), and synchronize_rcu() linked-structure use cases, and then attempting to apply these intuition to use cases that have neither rcu_dereference() nor rcu_assign_pointer(). Don't get me wrong, it is only natural to try to extend your intuitions to something that admittedly looks pretty similar to the traditional use cases. But this is one of those cases where "pretty similar" might not be good enough. > IIRC > when RCU was first added to the Linux kernel we did not have compiler > barriers in rcu_dereference and rcu_assign_pointer. They were added > later on. From what I can see, rcu_dereference() still does not have a compiler barrier. Please note that the pair of barrier() calls in READ_ONCE() only apply when READ_ONCE()ing something larger than the machine can load. And if your platform cannot load and store pointers with a single access, the Linux kernel isn't going to do very well regardless. Ditto for WRITE_ONCE(). > As compiler barriers per se are useless, these are surely meant to > be coupled with the memory barriers provided by RCU grace periods > and synchronize_rcu. But then those real memory barriers would have > compiler barriers too. So why do we need the compiler barriers in > rcu_dereference and rcu_assign_pointer? In rcu_dereference(), RCU does not need them. They are instead inherited from READ_ONCE() for when it is used on a data structure too big for any single load instruction available on the system in question. These barrier() calls are in a case that rcu_dereference() had better not be using -- after all, using them would mean that the hardware didn't have a load instruction big enough to handle a pointer. In rcu_assign_pointer(), RCU just needs this to act like a release store, that is, the store itself must not be reordered with any earlier memory accesses. The Linux kernel's smp_store_release() currently over-approximates this using a barrier() or equivalent inline-assembly directive, which enforces compiler ordering for not only the release store, but also far all memory accesses following the release store. Obviously, barrier is not enough for weakly ordered systems, which must also emit an appropriate memory-barrier instruction (or a special load instruction for architectures like ARMv8 providing such a thing). The compiler barriers in __rcu_read_lock() and __rcu_read_unlock() are there so that preemptible Tree RCU can use its various tricks to make readers perform and scale well. Read-side state is confined to the CPU and/or task in the common case, thus avoiding heavy synchronization overhead in the common case (or, in the case of !PREEMPT RCU, thus avoiding -any- synchronization overhead in the common case). For example, the compiler barriers ensure that RCU's scheduler-clock code and softirq code can trust per-CPU/task state indicating whether or not there is an RCU read-side critical section in effect. Does that help? Or am I missing your point? Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 3:03 ` Herbert Xu 2019-06-03 9:27 ` Paul E. McKenney @ 2019-06-03 15:55 ` Linus Torvalds 2019-06-03 16:07 ` Linus Torvalds 1 sibling, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-03 15:55 UTC (permalink / raw) To: Herbert Xu Cc: Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Sun, Jun 2, 2019 at 8:03 PM Herbert Xu <herbert@gondor.apana.org.au> wrote: > > In any case, I am now even more certain that compiler barriers are > not needed in the code in question. The reasoning is quite simple. > If you need those compiler barriers then you surely need real memory > barriers. So the above statement is not necessarily correct. Compiler barriers very much can have real effects even in the absense of "real" memory barriers. But those effects are obviously not about multiple CPU's - they are about code generation and can be about ordering on _one_ CPU. Those effects definitely happen, though. So a compiler barrier without a memory barrier may make a difference if you (a) compile for UP (when a compiler barrier basically _is_ a memory barrier) (b) have deadlock or ordering avoidance with only the local CPU taking interrupts. (c) need to re-load a value in a loop, but ordering isn't a concern and possibly other situations. In the above, (a) may be pointless and trivial, but (b) and (c) are issues even on SMP. Some things only matter for the local CPU - an interrupt or a code sequence that happens on another CPU can just block, but if an interrupt comes in on the same CPU may dead-lock and depend on particular access ordering. And (c) is for things like cpu_relax() in loops that read stuff (although honestly, READ_ONCE() is generally a much better pattern). But it sounds like in this case at least, Herbert's and Paul's disagreements aren't really all that fundamentally about the memory barriers and locking, as just the fact that in general the only thing that RCU protects against is single accesses, and thus any RCU region technically should use something that guarantees that the compiler might not do stupid things. We do require a _minimum_ of compiler sanity, but the compiler turning a non-marked read into two reads has definitely happened, and can be very subtle. Even if on a C level it looks like a single access, and correctness doesn't care about _what_ the value we read is, it might be turned by the compiler into two separate accesses that get two different values, and then the end result may be insane and unreliable. So on the whole, I do think that it's usually a good idea to show _which_ access is protected by RCU. Perhaps with a READ_ONCE/WRITE_ONCE pattern, although it can be other things. I don't believe that it would necessarily help to turn a rcu_read_lock() into a compiler barrier, because for the non-preempt case rcu_read_lock() doesn't need to actually _do_ anything, and anything that matters for the RCU read lock will already be a compiler barrier for other reasons (ie a function call that can schedule). Anyway, I suspect the code is correct in practice, but I do have some sympathy for the "RCU protected accesses should be marked". Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 15:55 ` Linus Torvalds @ 2019-06-03 16:07 ` Linus Torvalds 2019-06-03 19:53 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-03 16:07 UTC (permalink / raw) To: Herbert Xu Cc: Paul E. McKenney, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds <torvalds@linux-foundation.org> wrote: > > I don't believe that it would necessarily help to turn a > rcu_read_lock() into a compiler barrier, because for the non-preempt > case rcu_read_lock() doesn't need to actually _do_ anything, and > anything that matters for the RCU read lock will already be a compiler > barrier for other reasons (ie a function call that can schedule). Actually, thinking a bit more about this, and trying to come up with special cases, I'm not at all convinced. Even if we don't have preemption enabled, it turns out that we *do* have things that can cause scheduling without being compiler barriers. In particular, user accesses are not necessarily full compiler barriers. One common pattern (x86) is asm volatile("call __get_user_%P4" which explicitly has a "asm volaile" so that it doesn't re-order wrt other asms (and thus other user accesses), but it does *not* have a "memory" clobber, because the user access doesn't actually change kernel memory. Not even if it's a "put_user()". So we've made those fairly relaxed on purpose. And they might be relaxed enough that they'd allow re-ordering wrt something that does a rcu read lock, unless the rcu read lock has some compiler barrier in it. IOW, imagine completely made up code like get_user(val, ptr) rcu_read_lock(); WRITE_ONCE(state, 1); and unless the rcu lock has a barrier in it, I actually think that write to 'state' could migrate to *before* the get_user(). I'm not convinced we have anything that remotely looks like the above, but I'm actually starting to think that yes, all RCU barriers had better be compiler barriers. Because this is very much an example of something where you don't necessarily need a memory barrier, but there's a code generation barrier needed because of local ordering requirements. The possible faulting behavior of "get_user()" must not migrate into the RCU critical region. Paul? So I think the rule really should be: every single form of locking that has any semantic meaning at all, absolutely needs to be at least a compiler barrier. (That "any semantic meaning" weaselwording is because I suspect that we have locking that truly and intentionally becomes no-ops because it's based on things that aren't relevant in some configurations. But generally compiler barriers are really pretty damn cheap, even from a code generation standpoint, and can help make the resulting code more legible, so I think we should not try to aggressively remove them without _very_ good reasons) Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 16:07 ` Linus Torvalds @ 2019-06-03 19:53 ` Paul E. McKenney 2019-06-03 20:24 ` Linus Torvalds 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-03 19:53 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 09:07:29AM -0700, Linus Torvalds wrote: > On Mon, Jun 3, 2019 at 8:55 AM Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > > > I don't believe that it would necessarily help to turn a > > rcu_read_lock() into a compiler barrier, because for the non-preempt > > case rcu_read_lock() doesn't need to actually _do_ anything, and > > anything that matters for the RCU read lock will already be a compiler > > barrier for other reasons (ie a function call that can schedule). > > Actually, thinking a bit more about this, and trying to come up with > special cases, I'm not at all convinced. > > Even if we don't have preemption enabled, it turns out that we *do* > have things that can cause scheduling without being compiler barriers. > > In particular, user accesses are not necessarily full compiler > barriers. One common pattern (x86) is > > asm volatile("call __get_user_%P4" > > which explicitly has a "asm volaile" so that it doesn't re-order wrt > other asms (and thus other user accesses), but it does *not* have a > "memory" clobber, because the user access doesn't actually change > kernel memory. Not even if it's a "put_user()". > > So we've made those fairly relaxed on purpose. And they might be > relaxed enough that they'd allow re-ordering wrt something that does a > rcu read lock, unless the rcu read lock has some compiler barrier in > it. > > IOW, imagine completely made up code like > > get_user(val, ptr) > rcu_read_lock(); > WRITE_ONCE(state, 1); > > and unless the rcu lock has a barrier in it, I actually think that > write to 'state' could migrate to *before* the get_user(). > > I'm not convinced we have anything that remotely looks like the above, > but I'm actually starting to think that yes, all RCU barriers had > better be compiler barriers. > > Because this is very much an example of something where you don't > necessarily need a memory barrier, but there's a code generation > barrier needed because of local ordering requirements. The possible > faulting behavior of "get_user()" must not migrate into the RCU > critical region. > > Paul? I agree that !PREEMPT rcu_read_lock() would not affect compiler code generation, but given that get_user() is a volatile asm, isn't the compiler already forbidden from reordering it with the volatile-casted WRITE_ONCE() access, even if there was nothing at all between them? Or are asms an exception to the rule that volatile executions cannot be reordered? > So I think the rule really should be: every single form of locking > that has any semantic meaning at all, absolutely needs to be at least > a compiler barrier. > > (That "any semantic meaning" weaselwording is because I suspect that > we have locking that truly and intentionally becomes no-ops because > it's based on things that aren't relevant in some configurations. But > generally compiler barriers are really pretty damn cheap, even from a > code generation standpoint, and can help make the resulting code more > legible, so I think we should not try to aggressively remove them > without _very_ good reasons) We can of course put them back in, but this won't help in the typical rcu_assign_pointer(), rcu_dereference(), and synchronize_rcu() situation (nor do I see how it helps in Hubert's example). And in other RCU use cases, the accesses analogous to the rcu_assign_pointer() and rcu_dereference() (in Hubert's example, the accesses to variable "a") really need to be READ_ONCE()/WRITE_ONCE() or stronger, correct? Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 19:53 ` Paul E. McKenney @ 2019-06-03 20:24 ` Linus Torvalds 2019-06-04 21:14 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Linus Torvalds @ 2019-06-03 20:24 UTC (permalink / raw) To: Paul E. McKenney Cc: Herbert Xu, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote: > > I agree that !PREEMPT rcu_read_lock() would not affect compiler code > generation, but given that get_user() is a volatile asm, isn't the > compiler already forbidden from reordering it with the volatile-casted > WRITE_ONCE() access, even if there was nothing at all between them? > Or are asms an exception to the rule that volatile executions cannot > be reordered? Paul, you MAKE NO SENSE. What is wrong with you? I just showed you an example of where rcu_read_lock() needs to be a compiler barrier, and then you make incoherent noises about WRITE_ONCE() that do not even exist in that example. Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already showed code that doesn't have those accessors, so reality doesn't match your fevered imagination. And sometimes it's not even possible, since you can't do a bitfield access, for exmaple, with READ_ONCE(). > We can of course put them back in, Stop the craziness. It's not "we can". It is a "we will". So I will add that barrier, and you need to stop arguing against it based on specious theoretical arguments that do not match reality. And we will not ever remove that barrier again. Herbert already pointed to me having to do this once before in commit 386afc91144b ("spinlocks and preemption points need to be at least compiler barriers"), and rcu_read_lock() clearly has at a minimum that same preemption point issue. Linus ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-03 20:24 ` Linus Torvalds @ 2019-06-04 21:14 ` Paul E. McKenney 2019-06-05 2:21 ` Herbert Xu 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-04 21:14 UTC (permalink / raw) To: Linus Torvalds Cc: Herbert Xu, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Mon, Jun 03, 2019 at 01:24:32PM -0700, Linus Torvalds wrote: > On Mon, Jun 3, 2019 at 12:53 PM Paul E. McKenney <paulmck@linux.ibm.com> wrote: > > > > I agree that !PREEMPT rcu_read_lock() would not affect compiler code > > generation, but given that get_user() is a volatile asm, isn't the > > compiler already forbidden from reordering it with the volatile-casted > > WRITE_ONCE() access, even if there was nothing at all between them? > > Or are asms an exception to the rule that volatile executions cannot > > be reordered? > > Paul, you MAKE NO SENSE. > > What is wrong with you? Mostly that I didn't check all architectures' definitions of get_user(). Had I done so, I would have seen that not all of the corresponding asms have the "volatile" keyword. And of course, without that keyword, there is absolutely nothing preventing the compiler from reordering the asm with pretty much anything. The only things that would be absolutely guaranteed to prevent reordering would be things like memory clobbers (barrier()) or accesses that overlap the asm's input/output list. Yeah, I know, even with the "volatile" keyword, it is not entirely clear how much reordering the compiler is allowed to do. I was relying on https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: Qualifiers volatile The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations. See Volatile. But the linked-to "Volatile" section later in that same web page mostly talks about the compiler's ability to hoist asms out of loops. > I just showed you an example of where rcu_read_lock() needs to be a > compiler barrier, and then you make incoherent noises about > WRITE_ONCE() that do not even exist in that example. I thought we were discussing this example, but it doesn't matter because I was missing your point about get_user() and page faults: get_user(val, ptr) rcu_read_lock(); WRITE_ONCE(state, 1); But regardless, given that some architectures omit volatile from their asms implementing get_user(), even an optimistic interpretation of that part of the GCC documentation would still permit reordering the above. And again, I was missing your point about get_user() causing page faults and thus context switches. > Forget about your READ_ONCE/WRITE_ONCE theories. Herbert already > showed code that doesn't have those accessors, so reality doesn't > match your fevered imagination. I get the feeling that you believe that I want LKMM to be some sort of final judge and absolute arbiter of what code is legal and not from a memory-ordering perspective. This is absolutely -not- the case. The current state of the art, despite the recent impressive progress, simply cannot reasonably do this. So all I can claim is that LKMM dispenses advice, hopefully good advice. (It is early days for LKMM's handling of plain accesses, so some work might be required to deliver on the "good advice" promise, but we have to start somewhere. Plus it is progressing nicely.) The places where long-standing RCU patterns require rcu_dereference() and rcu_assign_pointer() do require some attention to avoid compiler optimizations, and {READ,WRITE}_ONCE() is one way of addressing this. But not the only way, nor always the best way. For example, some fastpaths might need the optimizations that {READ,WRITE}_ONCE() suppresses. Therefore, Linux kernel hackers have a number of other ways of paying attention. For example, accesses might be constrained via barrier() and friends. For another example, some developers might check assembly output (hopefully scripted somehow). Again, the Linux-kernel memory model dispenses advice, not absolutes. Furthermore, the way it dispenses advice is currently a bit limited. It can currently say that it is nervous about lack of {READ,WRITE}_ONCE(), as in "Flag data-race", but it would be difficult to make it recommend the other options in an intelligent way. So we should interpret "Flag data-race" as LKMM saying "I am nervous about your unmarked accesses" rather than "You absolutely must call {READ,WRITE}_ONCE() more often!!!" Again, advice, not absolutes. So the idea is that you add and remove {READ,WRITE}_ONCE() to/from the -litmus- -tests- to determine which accesses LKMM is nervous about. But that doesn't necessarily mean that {READ,WRITE}_ONCE() goes into the corresponding places in the Linux kernel. Does that help, or am I still confused? > And sometimes it's not even possible, since you can't do a bitfield > access, for example, with READ_ONCE(). Ah, good point. So the Linux kernel uses bitfields to communicate between mainline and interrupt handlers. New one on me. :-/ > > We can of course put them back in, > > Stop the craziness. It's not "we can". It is a "we will". > > So I will add that barrier, and you need to stop arguing against it > based on specious theoretical arguments that do not match reality. And > we will not ever remove that barrier again. Herbert already pointed to > me having to do this once before in commit 386afc91144b ("spinlocks > and preemption points need to be at least compiler barriers"), and > rcu_read_lock() clearly has at a minimum that same preemption point > issue. And the lack of "volatile" allows get_user() to migrate page faults (and thus context switches) into RCU read-side critical sections in CONFIG_PREEMPT=n. Yes, this would be very bad. OK, I finally got it, so please accept my apologies for my earlier confusion. I don't yet see a commit from you, so I queued the one below locally and started testing. Thanx, Paul ------------------------------------------------------------------------ commit 9b4766c5523efb8d3d52b2ba2a29fd69cdfc65bb Author: Paul E. McKenney <paulmck@linux.ibm.com> Date: Tue Jun 4 14:05:52 2019 -0700 rcu: Restore barrier() to rcu_read_lock() and rcu_read_unlock() Commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree RCU readers") removed the barrier() calls from rcu_read_lock() and rcu_write_lock() in CONFIG_PREEMPT=n&&CONFIG_PREEMPT_COUNT=n kernels. Within RCU, this commit was OK, but it failed to account for things like get_user() that can pagefault and that can be reordered by the compiler. Lack of the barrier() calls in rcu_read_lock() and rcu_read_unlock() can cause these page faults to migrate into RCU read-side critical sections, which in CONFIG_PREEMPT=n kernels could result in too-short grace periods and arbitrary misbehavior. Please see commit 386afc91144b ("spinlocks and preemption points need to be at least compiler barriers") for more details. This commit therefore restores the barrier() call to both rcu_read_lock() and rcu_read_unlock(). It also removes them from places in the RCU update machinery that used to need compensatory barrier() calls, effectively reverting commit bb73c52bad36 ("rcu: Don't disable preemption for Tiny and Tree RCU readers"). Reported-by: Herbert Xu <herbert@gondor.apana.org.au> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Paul E. McKenney <paulmck@linux.ibm.com> diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h index 0c9b92799abc..8f7167478c1d 100644 --- a/include/linux/rcupdate.h +++ b/include/linux/rcupdate.h @@ -56,14 +56,12 @@ void __rcu_read_unlock(void); static inline void __rcu_read_lock(void) { - if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) - preempt_disable(); + preempt_disable(); } static inline void __rcu_read_unlock(void) { - if (IS_ENABLED(CONFIG_PREEMPT_COUNT)) - preempt_enable(); + preempt_enable(); } static inline int rcu_preempt_depth(void) diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h index 3f52d8438e0f..841060fce33c 100644 --- a/kernel/rcu/tree_plugin.h +++ b/kernel/rcu/tree_plugin.h @@ -288,7 +288,6 @@ void rcu_note_context_switch(bool preempt) struct rcu_data *rdp = this_cpu_ptr(&rcu_data); struct rcu_node *rnp; - barrier(); /* Avoid RCU read-side critical sections leaking down. */ trace_rcu_utilization(TPS("Start context switch")); lockdep_assert_irqs_disabled(); WARN_ON_ONCE(!preempt && t->rcu_read_lock_nesting > 0); @@ -340,7 +339,6 @@ void rcu_note_context_switch(bool preempt) if (rdp->exp_deferred_qs) rcu_report_exp_rdp(rdp); trace_rcu_utilization(TPS("End context switch")); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ } EXPORT_SYMBOL_GPL(rcu_note_context_switch); @@ -828,11 +826,6 @@ static void rcu_qs(void) * dyntick-idle quiescent state visible to other CPUs, which will in * some cases serve for expedited as well as normal grace periods. * Either way, register a lightweight quiescent state. - * - * The barrier() calls are redundant in the common case when this is - * called externally, but just in case this is called from within this - * file. - * */ void rcu_all_qs(void) { @@ -847,14 +840,12 @@ void rcu_all_qs(void) return; } this_cpu_write(rcu_data.rcu_urgent_qs, false); - barrier(); /* Avoid RCU read-side critical sections leaking down. */ if (unlikely(raw_cpu_read(rcu_data.rcu_need_heavy_qs))) { local_irq_save(flags); rcu_momentary_dyntick_idle(); local_irq_restore(flags); } rcu_qs(); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ preempt_enable(); } EXPORT_SYMBOL_GPL(rcu_all_qs); @@ -864,7 +855,6 @@ EXPORT_SYMBOL_GPL(rcu_all_qs); */ void rcu_note_context_switch(bool preempt) { - barrier(); /* Avoid RCU read-side critical sections leaking down. */ trace_rcu_utilization(TPS("Start context switch")); rcu_qs(); /* Load rcu_urgent_qs before other flags. */ @@ -877,7 +867,6 @@ void rcu_note_context_switch(bool preempt) rcu_tasks_qs(current); out: trace_rcu_utilization(TPS("End context switch")); - barrier(); /* Avoid RCU read-side critical sections leaking up. */ } EXPORT_SYMBOL_GPL(rcu_note_context_switch); ^ permalink raw reply related [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-04 21:14 ` Paul E. McKenney @ 2019-06-05 2:21 ` Herbert Xu 2019-06-05 3:30 ` Paul E. McKenney 0 siblings, 1 reply; 54+ messages in thread From: Herbert Xu @ 2019-06-05 2:21 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote: > > Yeah, I know, even with the "volatile" keyword, it is not entirely clear > how much reordering the compiler is allowed to do. I was relying on > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: The volatile keyword doesn't give any guarantees of this kind. The key to ensuring ordering between unrelated variable/register reads/writes is the memory clobber: 6.47.2.6 Clobbers and Scratch Registers ... "memory" The "memory" clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters). To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm. Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed. Using the "memory" clobber effectively forms a read/write memory barrier for the compiler. Note that this clobber does not prevent the processor from doing speculative reads past the asm statement. To prevent that, you need processor-specific fence instructions. IOW you need a barrier(). Thanks, -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-05 2:21 ` Herbert Xu @ 2019-06-05 3:30 ` Paul E. McKenney 2019-06-06 4:37 ` Herbert Xu 0 siblings, 1 reply; 54+ messages in thread From: Paul E. McKenney @ 2019-06-05 3:30 UTC (permalink / raw) To: Herbert Xu Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Wed, Jun 05, 2019 at 10:21:17AM +0800, Herbert Xu wrote: > On Tue, Jun 04, 2019 at 02:14:49PM -0700, Paul E. McKenney wrote: > > > > Yeah, I know, even with the "volatile" keyword, it is not entirely clear > > how much reordering the compiler is allowed to do. I was relying on > > https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, which says: > > The volatile keyword doesn't give any guarantees of this kind. > The key to ensuring ordering between unrelated variable/register > reads/writes is the memory clobber: > > 6.47.2.6 Clobbers and Scratch Registers > > ... > > "memory" The "memory" clobber tells the compiler that the assembly > code performs memory reads or writes to items other than those > listed in the input and output operands (for example, accessing > the memory pointed to by one of the input parameters). To ensure > memory contains correct values, GCC may need to flush specific > register values to memory before executing the asm. Further, > the compiler does not assume that any values read from memory > before an asm remain unchanged after that asm; it reloads them as > needed. Using the "memory" clobber effectively forms a read/write > memory barrier for the compiler. > > Note that this clobber does not prevent the processor from > doing speculative reads past the asm statement. To prevent that, > you need processor-specific fence instructions. > > IOW you need a barrier(). Understood. Does the patch I sent out a few hours ago cover it? Or is something else needed? Other than updates to the RCU requirements documentation, which is forthcoming. Thanx, Paul ^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: rcu_read_lock lost its compiler barrier 2019-06-05 3:30 ` Paul E. McKenney @ 2019-06-06 4:37 ` Herbert Xu 0 siblings, 0 replies; 54+ messages in thread From: Herbert Xu @ 2019-06-06 4:37 UTC (permalink / raw) To: Paul E. McKenney Cc: Linus Torvalds, Frederic Weisbecker, Boqun Feng, Fengguang Wu, LKP, LKML, Netdev, David S. Miller On Tue, Jun 04, 2019 at 08:30:39PM -0700, Paul E. McKenney wrote: > > Understood. Does the patch I sent out a few hours ago cover it? Or is > something else needed? It looks good to me. > Other than updates to the RCU requirements documentation, which is > forthcoming. Thanks Paul. -- Email: Herbert Xu <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 54+ messages in thread
end of thread, other threads:[~2019-06-08 18:51 UTC | newest] Thread overview: 54+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- [not found] <20150910005708.GA23369@wfg-t540p.sh.intel.com> [not found] ` <20150910102513.GA1677@fixme-laptop.cn.ibm.com> [not found] ` <20150910171649.GE4029@linux.vnet.ibm.com> [not found] ` <20150911021933.GA1521@fixme-laptop.cn.ibm.com> [not found] ` <20150921193045.GA13674@lerouge> [not found] ` <20150921204327.GH4029@linux.vnet.ibm.com> 2019-06-02 5:56 ` rcu_read_lock lost its compiler barrier Herbert Xu 2019-06-02 20:54 ` Linus Torvalds 2019-06-03 2:46 ` Herbert Xu 2019-06-03 3:47 ` Paul E. McKenney 2019-06-03 4:01 ` Herbert Xu 2019-06-03 4:17 ` Herbert Xu 2019-06-03 7:23 ` Paul E. McKenney 2019-06-03 8:42 ` Paul E. McKenney 2019-06-03 15:26 ` David Laight 2019-06-03 15:40 ` Linus Torvalds 2019-06-03 5:26 ` Herbert Xu 2019-06-03 6:42 ` Boqun Feng 2019-06-03 20:03 ` Paul E. McKenney 2019-06-04 14:44 ` Alan Stern 2019-06-04 16:04 ` Linus Torvalds 2019-06-04 17:00 ` Alan Stern 2019-06-04 17:29 ` Linus Torvalds 2019-06-07 14:09 ` inet: frags: Turn fqdir->dead into an int for old Alphas Herbert Xu 2019-06-07 15:26 ` Eric Dumazet 2019-06-07 15:32 ` Herbert Xu 2019-06-07 16:13 ` Eric Dumazet 2019-06-07 16:19 ` Linus Torvalds 2019-06-08 15:27 ` Paul E. McKenney 2019-06-08 17:42 ` Linus Torvalds 2019-06-08 17:50 ` Linus Torvalds 2019-06-08 18:50 ` Paul E. McKenney 2019-06-08 18:14 ` Paul E. McKenney 2019-06-06 4:51 ` rcu_read_lock lost its compiler barrier Herbert Xu 2019-06-06 6:05 ` Paul E. McKenney 2019-06-06 6:14 ` Herbert Xu 2019-06-06 9:06 ` Paul E. McKenney 2019-06-06 9:28 ` Herbert Xu 2019-06-06 10:58 ` Paul E. McKenney 2019-06-06 13:38 ` Herbert Xu 2019-06-06 13:48 ` Paul E. McKenney 2019-06-06 8:16 ` Andrea Parri 2019-06-06 14:19 ` Alan Stern 2019-06-08 15:19 ` Paul E. McKenney 2019-06-08 15:56 ` Alan Stern 2019-06-08 16:31 ` Paul E. McKenney 2019-06-03 9:35 ` Paul E. McKenney 2019-06-06 8:38 ` Andrea Parri 2019-06-06 9:32 ` Herbert Xu 2019-06-03 0:06 ` Paul E. McKenney 2019-06-03 3:03 ` Herbert Xu 2019-06-03 9:27 ` Paul E. McKenney 2019-06-03 15:55 ` Linus Torvalds 2019-06-03 16:07 ` Linus Torvalds 2019-06-03 19:53 ` Paul E. McKenney 2019-06-03 20:24 ` Linus Torvalds 2019-06-04 21:14 ` Paul E. McKenney 2019-06-05 2:21 ` Herbert Xu 2019-06-05 3:30 ` Paul E. McKenney 2019-06-06 4:37 ` Herbert Xu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).