All of lore.kernel.org
 help / color / mirror / Atom feed
* memory barrier question
@ 2010-09-15 14:36 Miklos Szeredi
  2010-09-15 19:12 ` Rafael J. Wysocki
  2010-09-16 11:55 ` David Howells
  0 siblings, 2 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-15 14:36 UTC (permalink / raw)
  To: dhowells; +Cc: linux-kernel, linux-arch

Hi,

I'm trying to understand memory barriers but not quite succeeding.

Consider the following example:

Start:
	p = NULL;
	x = 0;

CPU1:
	atomic_inc(&x);
	p = &x;

CPU2:
	if (p)
		z = atomic_read(p);

Is it possible to end up with z == 0?  What if there's a lock/unlock
before setting "p"?  What if there's a write barrier before setting
"p"?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-15 14:36 memory barrier question Miklos Szeredi
@ 2010-09-15 19:12 ` Rafael J. Wysocki
  2010-09-16 11:55 ` David Howells
  1 sibling, 0 replies; 41+ messages in thread
From: Rafael J. Wysocki @ 2010-09-15 19:12 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, linux-kernel, linux-arch

On Wednesday, September 15, 2010, Miklos Szeredi wrote:
> Hi,
> 
> I'm trying to understand memory barriers but not quite succeeding.
> 
> Consider the following example:
> 
> Start:
> 	p = NULL;
> 	x = 0;
> 
> CPU1:
> 	atomic_inc(&x);
> 	p = &x;
> 
> CPU2:
> 	if (p)
> 		z = atomic_read(p);
> 
> Is it possible to end up with z == 0?

Yes, it is.  CPU1 can reorder the two instructions in theory and if
the atomic_read() on CPU2 happens between them, it may return 0 in theory.

> What if there's a lock/unlock before setting "p"?  What if there's a write
> barrier before setting "p"?

A write barrier should help and locking functions are implicit memory barriers.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-15 14:36 memory barrier question Miklos Szeredi
  2010-09-15 19:12 ` Rafael J. Wysocki
@ 2010-09-16 11:55 ` David Howells
  2010-09-16 13:42     ` Miklos Szeredi
                     ` (3 more replies)
  1 sibling, 4 replies; 41+ messages in thread
From: David Howells @ 2010-09-16 11:55 UTC (permalink / raw)
  To: Miklos Szeredi, Paul E. McKenney; +Cc: dhowells, linux-kernel, linux-arch

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Consider the following example:
> 
> Start:
> 	p = NULL;
> 	x = 0;
> 
> CPU1:
> 	atomic_inc(&x);
> 	p = &x;
> 
> CPU2:
> 	if (p)
> 		z = atomic_read(p);
> 
> Is it possible to end up with z == 0?

I think so.  I'm not sure that you can assume that CPU1 does its two
'operations' in the same order.  You can guarantee that the read of x,
increment, and write of x will be done in an order, and that no one else will
see an intermediate state, but you can't guarantee that CPU2 will see x
changed before p is changed.

In Documentation/memory-barriers.txt, it says:

	The following also do _not_ imply memory barriers, and so may require
	explicit memory barriers under some circumstances
	(smp_mb__before_atomic_dec() for instance):

		atomic_add();
		atomic_sub();
		atomic_inc();
		atomic_dec();

so you need _two_ memory barriers, e.g.:

	CPU1:
		atomic_inc(&x);
		smp_mb__after_atomic_inc()
		p = &x;

	CPU2:
		q = p;
		smp_rmb();
		if (q)
			z = atomic_read(q);

Note that atomic_inc() may imply a suitable memory barrier on some arches, and
so has special variant barrier functions of its own.

> What if there's a lock/unlock before setting "p"?

If there's a lock+unlock between, then this counts as a full memory barrier:

	CPU1:
		atomic_inc(&x);
		spin_lock(&foo);
		spin_unlock(&foo);
		p = &x;

but you still need the matching smp_rmb() on CPU2.

> What if there's a write barrier before setting "p"?

That's fine, but you still need the matching smp_rmb() on CPU2.

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 11:55 ` David Howells
  2010-09-16 13:42     ` Miklos Szeredi
@ 2010-09-16 13:42     ` Miklos Szeredi
  2010-09-16 16:18     ` Peter Zijlstra
  2010-09-16 17:59   ` David Howells
  3 siblings, 0 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 13:42 UTC (permalink / raw)
  To: David Howells; +Cc: miklos, paulmck, dhowells, linux-kernel, linux-arch

On Thu, 16 Sep 2010, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Consider the following example:
> > 
> > Start:
> > 	p = NULL;
> > 	x = 0;
> > 
> > CPU1:
> > 	atomic_inc(&x);
> > 	p = &x;
> > 
> > CPU2:
> > 	if (p)
> > 		z = atomic_read(p);
> > 
> > Is it possible to end up with z == 0?
> 
> I think so.  I'm not sure that you can assume that CPU1 does its two
> 'operations' in the same order.  You can guarantee that the read of x,
> increment, and write of x will be done in an order, and that no one else will
> see an intermediate state, but you can't guarantee that CPU2 will see x
> changed before p is changed.
> 
> In Documentation/memory-barriers.txt, it says:
> 
> 	The following also do _not_ imply memory barriers, and so may require
> 	explicit memory barriers under some circumstances
> 	(smp_mb__before_atomic_dec() for instance):
> 
> 		atomic_add();
> 		atomic_sub();
> 		atomic_inc();
> 		atomic_dec();
> 
> so you need _two_ memory barriers, e.g.:
> 
> 	CPU1:
> 		atomic_inc(&x);
> 		smp_mb__after_atomic_inc()
> 		p = &x;
> 
> 	CPU2:
> 		q = p;
> 		smp_rmb();
> 		if (q)
> 			z = atomic_read(q);
> 
> Note that atomic_inc() may imply a suitable memory barrier on some arches, and
> so has special variant barrier functions of its own.

Is the rmb() really needed?

Take this code from fs/namei.c for example:

		inode = next.dentry->d_inode;
		if (!inode)
			goto out_dput;

		if (inode->i_op->follow_link) {

It happily dereferences dentry->d_inode without a barrier after
checking it for non-null, while that d_inode might have just been
initialized on another CPU with a freshly created inode.  There's
absolutely no synchornization with that on this side.

Isn't the fact that we check the pointer for being non-null (together
with locking/barrier on the other side) enough to ensure that it's
safe to dereference it?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
@ 2010-09-16 13:42     ` Miklos Szeredi
  0 siblings, 0 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 13:42 UTC (permalink / raw)
  Cc: miklos, paulmck, dhowells, linux-kernel, linux-arch

On Thu, 16 Sep 2010, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Consider the following example:
> > 
> > Start:
> > 	p = NULL;
> > 	x = 0;
> > 
> > CPU1:
> > 	atomic_inc(&x);
> > 	p = &x;
> > 
> > CPU2:
> > 	if (p)
> > 		z = atomic_read(p);
> > 
> > Is it possible to end up with z == 0?
> 
> I think so.  I'm not sure that you can assume that CPU1 does its two
> 'operations' in the same order.  You can guarantee that the read of x,
> increment, and write of x will be done in an order, and that no one else will
> see an intermediate state, but you can't guarantee that CPU2 will see x
> changed before p is changed.
> 
> In Documentation/memory-barriers.txt, it says:
> 
> 	The following also do _not_ imply memory barriers, and so may require
> 	explicit memory barriers under some circumstances
> 	(smp_mb__before_atomic_dec() for instance):
> 
> 		atomic_add();
> 		atomic_sub();
> 		atomic_inc();
> 		atomic_dec();
> 
> so you need _two_ memory barriers, e.g.:
> 
> 	CPU1:
> 		atomic_inc(&x);
> 		smp_mb__after_atomic_inc()
> 		p = &x;
> 
> 	CPU2:
> 		q = p;
> 		smp_rmb();
> 		if (q)
> 			z = atomic_read(q);
> 
> Note that atomic_inc() may imply a suitable memory barrier on some arches, and
> so has special variant barrier functions of its own.

Is the rmb() really needed?

Take this code from fs/namei.c for example:

		inode = next.dentry->d_inode;
		if (!inode)
			goto out_dput;

		if (inode->i_op->follow_link) {

It happily dereferences dentry->d_inode without a barrier after
checking it for non-null, while that d_inode might have just been
initialized on another CPU with a freshly created inode.  There's
absolutely no synchornization with that on this side.

Isn't the fact that we check the pointer for being non-null (together
with locking/barrier on the other side) enough to ensure that it's
safe to dereference it?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
@ 2010-09-16 13:42     ` Miklos Szeredi
  0 siblings, 0 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 13:42 UTC (permalink / raw)
  To: David Howells; +Cc: miklos, paulmck, linux-kernel, linux-arch

On Thu, 16 Sep 2010, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Consider the following example:
> > 
> > Start:
> > 	p = NULL;
> > 	x = 0;
> > 
> > CPU1:
> > 	atomic_inc(&x);
> > 	p = &x;
> > 
> > CPU2:
> > 	if (p)
> > 		z = atomic_read(p);
> > 
> > Is it possible to end up with z == 0?
> 
> I think so.  I'm not sure that you can assume that CPU1 does its two
> 'operations' in the same order.  You can guarantee that the read of x,
> increment, and write of x will be done in an order, and that no one else will
> see an intermediate state, but you can't guarantee that CPU2 will see x
> changed before p is changed.
> 
> In Documentation/memory-barriers.txt, it says:
> 
> 	The following also do _not_ imply memory barriers, and so may require
> 	explicit memory barriers under some circumstances
> 	(smp_mb__before_atomic_dec() for instance):
> 
> 		atomic_add();
> 		atomic_sub();
> 		atomic_inc();
> 		atomic_dec();
> 
> so you need _two_ memory barriers, e.g.:
> 
> 	CPU1:
> 		atomic_inc(&x);
> 		smp_mb__after_atomic_inc()
> 		p = &x;
> 
> 	CPU2:
> 		q = p;
> 		smp_rmb();
> 		if (q)
> 			z = atomic_read(q);
> 
> Note that atomic_inc() may imply a suitable memory barrier on some arches, and
> so has special variant barrier functions of its own.

Is the rmb() really needed?

Take this code from fs/namei.c for example:

		inode = next.dentry->d_inode;
		if (!inode)
			goto out_dput;

		if (inode->i_op->follow_link) {

It happily dereferences dentry->d_inode without a barrier after
checking it for non-null, while that d_inode might have just been
initialized on another CPU with a freshly created inode.  There's
absolutely no synchornization with that on this side.

Isn't the fact that we check the pointer for being non-null (together
with locking/barrier on the other side) enough to ensure that it's
safe to dereference it?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 11:55 ` David Howells
  2010-09-16 13:42     ` Miklos Szeredi
@ 2010-09-16 14:30   ` David Howells
  2010-09-16 15:03     ` Paul E. McKenney
  2010-09-16 16:18     ` Peter Zijlstra
  2010-09-16 17:59   ` David Howells
  3 siblings, 1 reply; 41+ messages in thread
From: David Howells @ 2010-09-16 14:30 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, paulmck, linux-kernel, linux-arch

Miklos Szeredi <miklos@szeredi.hu> wrote:

> Is the rmb() really needed?
> 
> Take this code from fs/namei.c for example:
> 
> 		inode = next.dentry->d_inode;
> 		if (!inode)
> 			goto out_dput;
> 
> 		if (inode->i_op->follow_link) {
> 
> It happily dereferences dentry->d_inode without a barrier after
> checking it for non-null, while that d_inode might have just been
> initialized on another CPU with a freshly created inode.  There's
> absolutely no synchornization with that on this side.

Perhaps it's not necessary; once set, how likely is i_op to be changed once
I_NEW is cleared?

> Isn't the fact that we check the pointer for being non-null (together
> with locking/barrier on the other side) enough to ensure that it's
> safe to dereference it?

It's possible that since there's a dependency between the variables on the
reading CPU that the barrier is not required.

I think I have to refer that question to Paul.

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 14:30   ` David Howells
@ 2010-09-16 15:03     ` Paul E. McKenney
  2010-09-16 16:06       ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-16 15:03 UTC (permalink / raw)
  To: David Howells; +Cc: Miklos Szeredi, linux-kernel, linux-arch

On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> Miklos Szeredi <miklos@szeredi.hu> wrote:
> 
> > Is the rmb() really needed?
> > 
> > Take this code from fs/namei.c for example:
> > 
> > 		inode = next.dentry->d_inode;
> > 		if (!inode)
> > 			goto out_dput;
> > 
> > 		if (inode->i_op->follow_link) {
> > 
> > It happily dereferences dentry->d_inode without a barrier after
> > checking it for non-null, while that d_inode might have just been
> > initialized on another CPU with a freshly created inode.  There's
> > absolutely no synchornization with that on this side.
> 
> Perhaps it's not necessary; once set, how likely is i_op to be changed once
> I_NEW is cleared?

Are the path_get()s protecting this?

If there is no protection, then something like rcu_dereference() is
needed for the assignment from next.dentry->d_inode.

> > Isn't the fact that we check the pointer for being non-null (together
> > with locking/barrier on the other side) enough to ensure that it's
> > safe to dereference it?
> 
> It's possible that since there's a dependency between the variables on the
> reading CPU that the barrier is not required.
> 
> I think I have to refer that question to Paul.

We would need either one of the rcu_dereference() or smp_read_barrier_depends()
APIs to enforce the dependency, for example, against the compiler.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 15:03     ` Paul E. McKenney
@ 2010-09-16 16:06       ` Miklos Szeredi
  2010-09-16 16:37         ` Paul E. McKenney
  2010-09-16 16:50         ` Jamie Lokier
  0 siblings, 2 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 16:06 UTC (permalink / raw)
  To: paulmck; +Cc: dhowells, miklos, linux-kernel, linux-arch

On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > 
> > > Is the rmb() really needed?
> > > 
> > > Take this code from fs/namei.c for example:
> > > 
> > > 		inode = next.dentry->d_inode;
> > > 		if (!inode)
> > > 			goto out_dput;
> > > 
> > > 		if (inode->i_op->follow_link) {
> > > 
> > > It happily dereferences dentry->d_inode without a barrier after
> > > checking it for non-null, while that d_inode might have just been
> > > initialized on another CPU with a freshly created inode.  There's
> > > absolutely no synchornization with that on this side.
> > 
> > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > I_NEW is cleared?
> 
> Are the path_get()s protecting this?

No, when creating a file the dentry will go from negative to positive
independently from lookup.  The dentry can get instantiated with an
inode between the path_get() and dereferencing ->d_inode.

> 
> If there is no protection, then something like rcu_dereference() is
> needed for the assignment from next.dentry->d_inode.

Do I understand correctly that the problem is that a CPU may have a
stale cache associated with *inode, one that was loaded before the
write barrier took effect?

Funny that such a bug could stay unnoticed in so often excercised
code.  Yeah I know it's alpha only.  I wonder how much of this pattern
exists in the kernel elswhere without the necessary barriers.

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 11:55 ` David Howells
@ 2010-09-16 16:18     ` Peter Zijlstra
  2010-09-16 14:30   ` David Howells
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2010-09-16 16:18 UTC (permalink / raw)
  To: David Howells; +Cc: Miklos Szeredi, Paul E. McKenney, linux-kernel, linux-arch

On Thu, 2010-09-16 at 12:55 +0100, David Howells wrote:
> If there's a lock+unlock between, then this counts as a full memory
> barrier

I though it was unlock+lock.

>From memory-barriers.txt:

And a couple of implicit varieties:

 (5) LOCK operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the LOCK operation will appear to happen after the LOCK
     operation with respect to the other components of the system.

     Memory operations that occur before a LOCK operation may appear to happen
     after it completes.

     A LOCK operation should almost always be paired with an UNLOCK operation.


 (6) UNLOCK operations.

     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.

     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.

     LOCK and UNLOCK operations are guaranteed to appear with respect to each
     other strictly in the order specified.

     The use of LOCK and UNLOCK operations generally precludes the need for
     other sorts of memory barrier (but note the exceptions mentioned in the
     subsection "MMIO write barrier").


Therefore:

  A = 5
  LOCK

  UNLOCK
  B = 6

May be observed as:

   LOCK
   B = 6
   A = 5
   UNLOCK

The sequence:

  A = 5
  UNLOCK
  LOCK
  B = 6

Is fully ordered.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
@ 2010-09-16 16:18     ` Peter Zijlstra
  0 siblings, 0 replies; 41+ messages in thread
From: Peter Zijlstra @ 2010-09-16 16:18 UTC (permalink / raw)
  To: David Howells; +Cc: Miklos Szeredi, Paul E. McKenney, linux-kernel, linux-arch

On Thu, 2010-09-16 at 12:55 +0100, David Howells wrote:
> If there's a lock+unlock between, then this counts as a full memory
> barrier

I though it was unlock+lock.

From memory-barriers.txt:

And a couple of implicit varieties:

 (5) LOCK operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the LOCK operation will appear to happen after the LOCK
     operation with respect to the other components of the system.

     Memory operations that occur before a LOCK operation may appear to happen
     after it completes.

     A LOCK operation should almost always be paired with an UNLOCK operation.


 (6) UNLOCK operations.

     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.

     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.

     LOCK and UNLOCK operations are guaranteed to appear with respect to each
     other strictly in the order specified.

     The use of LOCK and UNLOCK operations generally precludes the need for
     other sorts of memory barrier (but note the exceptions mentioned in the
     subsection "MMIO write barrier").


Therefore:

  A = 5
  LOCK

  UNLOCK
  B = 6

May be observed as:

   LOCK
   B = 6
   A = 5
   UNLOCK

The sequence:

  A = 5
  UNLOCK
  LOCK
  B = 6

Is fully ordered.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 16:06       ` Miklos Szeredi
@ 2010-09-16 16:37         ` Paul E. McKenney
  2010-09-16 16:56           ` Miklos Szeredi
  2010-09-16 16:50         ` Jamie Lokier
  1 sibling, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-16 16:37 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: dhowells, linux-kernel, linux-arch

On Thu, Sep 16, 2010 at 06:06:53PM +0200, Miklos Szeredi wrote:
> On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > > Is the rmb() really needed?
> > > > 
> > > > Take this code from fs/namei.c for example:
> > > > 
> > > > 		inode = next.dentry->d_inode;
> > > > 		if (!inode)
> > > > 			goto out_dput;
> > > > 
> > > > 		if (inode->i_op->follow_link) {
> > > > 
> > > > It happily dereferences dentry->d_inode without a barrier after
> > > > checking it for non-null, while that d_inode might have just been
> > > > initialized on another CPU with a freshly created inode.  There's
> > > > absolutely no synchornization with that on this side.
> > > 
> > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > I_NEW is cleared?
> > 
> > Are the path_get()s protecting this?
> 
> No, when creating a file the dentry will go from negative to positive
> independently from lookup.  The dentry can get instantiated with an
> inode between the path_get() and dereferencing ->d_inode.
> 
> > If there is no protection, then something like rcu_dereference() is
> > needed for the assignment from next.dentry->d_inode.
> 
> Do I understand correctly that the problem is that a CPU may have a
> stale cache associated with *inode, one that was loaded before the
> write barrier took effect?

Yes, especially if the compiler is aggressively optimizing.

> Funny that such a bug could stay unnoticed in so often excercised
> code.  Yeah I know it's alpha only.  I wonder how much of this pattern
> exists in the kernel elswhere without the necessary barriers.

The probability of failure is low, but non-zero.  :-/

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 16:06       ` Miklos Szeredi
  2010-09-16 16:37         ` Paul E. McKenney
@ 2010-09-16 16:50         ` Jamie Lokier
  1 sibling, 0 replies; 41+ messages in thread
From: Jamie Lokier @ 2010-09-16 16:50 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: paulmck, dhowells, linux-kernel, linux-arch

Miklos Szeredi wrote:
> On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > 
> > > > Is the rmb() really needed?
> > > > 
> > > > Take this code from fs/namei.c for example:
> > > > 
> > > > 		inode = next.dentry->d_inode;
> > > > 		if (!inode)
> > > > 			goto out_dput;
> > > > 
> > > > 		if (inode->i_op->follow_link) {
> > > > 
> > > > It happily dereferences dentry->d_inode without a barrier after
> > > > checking it for non-null, while that d_inode might have just been
> > > > initialized on another CPU with a freshly created inode.  There's
> > > > absolutely no synchornization with that on this side.
> > > 
> > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > I_NEW is cleared?
> > 
> > Are the path_get()s protecting this?
> 
> No, when creating a file the dentry will go from negative to positive
> independently from lookup.  The dentry can get instantiated with an
> inode between the path_get() and dereferencing ->d_inode.
> 
> > 
> > If there is no protection, then something like rcu_dereference() is
> > needed for the assignment from next.dentry->d_inode.
> 
> Do I understand correctly that the problem is that a CPU may have a
> stale cache associated with *inode, one that was loaded before the
> write barrier took effect?
> 
> Funny that such a bug could stay unnoticed in so often excercised
> code.  Yeah I know it's alpha only.

When I first saw read_barrier_depends(), I thought it must be Alpha's
speculative execution, fetching memory out of order and confirming
it's valid later.  I was really surprised to find out it's not that -
it's a quirk of the Alpha's cache/forwarding protocol.  Others
presumably don't have it because they were designed with awareness of
this coding pattern.

But...

I wonder if it can happen on IA64 with it's funky memory-alias
compiler optimisations.

I wonder if it can happen on x86 and others, if the compiler decides
this is a valid transformation (it is with a single CPU):

Original code:

    foo = global_ptr_to_foo;
    foo_x = foo->x;

    bar = global_ptr_to_bar;
    bar_y = bar->y;
    // use bar_y;

Transformed by compiler:

    foo = global_ptr_to_foo;
    foo_x = foo->x;

    bar = global_ptr_to_bar;
    bar_y = (__typeof__(bar->y))foo_x;
    if ((void *)bar != (void *)foo)
        bar_y = bar->y;

    // use bar_y;

In other words, without a barrier, the compiler doesn't have to order
the executed bar->y dereference *instruction* after the bar =
global_ptr_to_bar instruction.  Thus making it a compiler property,
not a CPU one.

There is no danger of dereferencing NULL in that example, but
dereferencing the values from the wrong object is just as wrong.

-- Jamie

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 16:37         ` Paul E. McKenney
@ 2010-09-16 16:56           ` Miklos Szeredi
  2010-09-16 17:09             ` James Bottomley
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 16:56 UTC (permalink / raw)
  To: paulmck; +Cc: miklos, dhowells, linux-kernel, linux-arch

On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> On Thu, Sep 16, 2010 at 06:06:53PM +0200, Miklos Szeredi wrote:
> > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > 
> > > > > Is the rmb() really needed?
> > > > > 
> > > > > Take this code from fs/namei.c for example:
> > > > > 
> > > > > 		inode = next.dentry->d_inode;
> > > > > 		if (!inode)
> > > > > 			goto out_dput;
> > > > > 
> > > > > 		if (inode->i_op->follow_link) {
> > > > > 
> > > > > It happily dereferences dentry->d_inode without a barrier after
> > > > > checking it for non-null, while that d_inode might have just been
> > > > > initialized on another CPU with a freshly created inode.  There's
> > > > > absolutely no synchornization with that on this side.
> > > > 
> > > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > > I_NEW is cleared?
> > > 
> > > Are the path_get()s protecting this?
> > 
> > No, when creating a file the dentry will go from negative to positive
> > independently from lookup.  The dentry can get instantiated with an
> > inode between the path_get() and dereferencing ->d_inode.
> > 
> > > If there is no protection, then something like rcu_dereference() is
> > > needed for the assignment from next.dentry->d_inode.
> > 
> > Do I understand correctly that the problem is that a CPU may have a
> > stale cache associated with *inode, one that was loaded before the
> > write barrier took effect?
> 
> Yes, especially if the compiler is aggressively optimizing.

How do compiler optimizations make a difference?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 16:56           ` Miklos Szeredi
@ 2010-09-16 17:09             ` James Bottomley
  2010-09-16 17:17               ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: James Bottomley @ 2010-09-16 17:09 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: paulmck, dhowells, linux-kernel, linux-arch

On Thu, 2010-09-16 at 18:56 +0200, Miklos Szeredi wrote:
> On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > On Thu, Sep 16, 2010 at 06:06:53PM +0200, Miklos Szeredi wrote:
> > > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > 
> > > > > > Is the rmb() really needed?
> > > > > > 
> > > > > > Take this code from fs/namei.c for example:
> > > > > > 
> > > > > > 		inode = next.dentry->d_inode;
> > > > > > 		if (!inode)
> > > > > > 			goto out_dput;
> > > > > > 
> > > > > > 		if (inode->i_op->follow_link) {
> > > > > > 
> > > > > > It happily dereferences dentry->d_inode without a barrier after
> > > > > > checking it for non-null, while that d_inode might have just been
> > > > > > initialized on another CPU with a freshly created inode.  There's
> > > > > > absolutely no synchornization with that on this side.
> > > > > 
> > > > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > > > I_NEW is cleared?
> > > > 
> > > > Are the path_get()s protecting this?
> > > 
> > > No, when creating a file the dentry will go from negative to positive
> > > independently from lookup.  The dentry can get instantiated with an
> > > inode between the path_get() and dereferencing ->d_inode.
> > > 
> > > > If there is no protection, then something like rcu_dereference() is
> > > > needed for the assignment from next.dentry->d_inode.
> > > 
> > > Do I understand correctly that the problem is that a CPU may have a
> > > stale cache associated with *inode, one that was loaded before the
> > > write barrier took effect?
> > 
> > Yes, especially if the compiler is aggressively optimizing.
> 
> How do compiler optimizations make a difference?

There are two types of reorderings that cause problems if you expect the
bus visible ordering to matter.  One is CPU issue reordering, where the
cpu decides to output loads and stores in a different order than the
input instruction stream actually said.  The other is compiler
re-ordering where the compiler actually reorders the instructions to
execute in a different order from what you'd expect by simply reading
the C code.

We have compiler barrier instructions for the latter and barriers which
issue CPU primitives for the former.

James



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 17:09             ` James Bottomley
@ 2010-09-16 17:17               ` Miklos Szeredi
  2010-09-16 17:40                 ` James Bottomley
  2010-09-17 21:49                 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-16 17:17 UTC (permalink / raw)
  To: James Bottomley; +Cc: miklos, paulmck, dhowells, linux-kernel, linux-arch

On Thu, 16 Sep 2010, James Bottomley wrote:
> On Thu, 2010-09-16 at 18:56 +0200, Miklos Szeredi wrote:
> > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > On Thu, Sep 16, 2010 at 06:06:53PM +0200, Miklos Szeredi wrote:
> > > > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > > > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > > 
> > > > > > > Is the rmb() really needed?
> > > > > > > 
> > > > > > > Take this code from fs/namei.c for example:
> > > > > > > 
> > > > > > > 		inode = next.dentry->d_inode;
> > > > > > > 		if (!inode)
> > > > > > > 			goto out_dput;
> > > > > > > 
> > > > > > > 		if (inode->i_op->follow_link) {
> > > > > > > 
> > > > > > > It happily dereferences dentry->d_inode without a barrier after
> > > > > > > checking it for non-null, while that d_inode might have just been
> > > > > > > initialized on another CPU with a freshly created inode.  There's
> > > > > > > absolutely no synchornization with that on this side.
> > > > > > 
> > > > > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > > > > I_NEW is cleared?
> > > > > 
> > > > > Are the path_get()s protecting this?
> > > > 
> > > > No, when creating a file the dentry will go from negative to positive
> > > > independently from lookup.  The dentry can get instantiated with an
> > > > inode between the path_get() and dereferencing ->d_inode.
> > > > 
> > > > > If there is no protection, then something like rcu_dereference() is
> > > > > needed for the assignment from next.dentry->d_inode.
> > > > 
> > > > Do I understand correctly that the problem is that a CPU may have a
> > > > stale cache associated with *inode, one that was loaded before the
> > > > write barrier took effect?
> > > 
> > > Yes, especially if the compiler is aggressively optimizing.
> > 
> > How do compiler optimizations make a difference?
> 
> There are two types of reorderings that cause problems if you expect the
> bus visible ordering to matter.  One is CPU issue reordering, where the
> cpu decides to output loads and stores in a different order than the
> input instruction stream actually said.  The other is compiler
> re-ordering where the compiler actually reorders the instructions to
> execute in a different order from what you'd expect by simply reading
> the C code.
> 
> We have compiler barrier instructions for the latter and barriers which
> issue CPU primitives for the former.

Right but in the concrete namei example I can't see how a compiler
optimization can make a difference.  The order of the loads is quite
clear:

   LOAD inode = next.dentry->inode
   if (inode != NULL)
   	LOAD inode->f_op

What is there the compiler can optimize?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 17:17               ` Miklos Szeredi
@ 2010-09-16 17:40                 ` James Bottomley
  2010-09-17 21:49                 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 41+ messages in thread
From: James Bottomley @ 2010-09-16 17:40 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: paulmck, dhowells, linux-kernel, linux-arch

On Thu, 2010-09-16 at 19:17 +0200, Miklos Szeredi wrote:
> On Thu, 16 Sep 2010, James Bottomley wrote:
> > On Thu, 2010-09-16 at 18:56 +0200, Miklos Szeredi wrote:
> > > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > > On Thu, Sep 16, 2010 at 06:06:53PM +0200, Miklos Szeredi wrote:
> > > > > On Thu, 16 Sep 2010, Paul E. McKenney wrote:
> > > > > > On Thu, Sep 16, 2010 at 03:30:56PM +0100, David Howells wrote:
> > > > > > > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > > > > > 
> > > > > > > > Is the rmb() really needed?
> > > > > > > > 
> > > > > > > > Take this code from fs/namei.c for example:
> > > > > > > > 
> > > > > > > > 		inode = next.dentry->d_inode;
> > > > > > > > 		if (!inode)
> > > > > > > > 			goto out_dput;
> > > > > > > > 
> > > > > > > > 		if (inode->i_op->follow_link) {
> > > > > > > > 
> > > > > > > > It happily dereferences dentry->d_inode without a barrier after
> > > > > > > > checking it for non-null, while that d_inode might have just been
> > > > > > > > initialized on another CPU with a freshly created inode.  There's
> > > > > > > > absolutely no synchornization with that on this side.
> > > > > > > 
> > > > > > > Perhaps it's not necessary; once set, how likely is i_op to be changed once
> > > > > > > I_NEW is cleared?
> > > > > > 
> > > > > > Are the path_get()s protecting this?
> > > > > 
> > > > > No, when creating a file the dentry will go from negative to positive
> > > > > independently from lookup.  The dentry can get instantiated with an
> > > > > inode between the path_get() and dereferencing ->d_inode.
> > > > > 
> > > > > > If there is no protection, then something like rcu_dereference() is
> > > > > > needed for the assignment from next.dentry->d_inode.
> > > > > 
> > > > > Do I understand correctly that the problem is that a CPU may have a
> > > > > stale cache associated with *inode, one that was loaded before the
> > > > > write barrier took effect?
> > > > 
> > > > Yes, especially if the compiler is aggressively optimizing.
> > > 
> > > How do compiler optimizations make a difference?
> > 
> > There are two types of reorderings that cause problems if you expect the
> > bus visible ordering to matter.  One is CPU issue reordering, where the
> > cpu decides to output loads and stores in a different order than the
> > input instruction stream actually said.  The other is compiler
> > re-ordering where the compiler actually reorders the instructions to
> > execute in a different order from what you'd expect by simply reading
> > the C code.
> > 
> > We have compiler barrier instructions for the latter and barriers which
> > issue CPU primitives for the former.
> 
> Right but in the concrete namei example I can't see how a compiler
> optimization can make a difference.  The order of the loads is quite
> clear:
> 
>    LOAD inode = next.dentry->inode
>    if (inode != NULL)
>    	LOAD inode->f_op
> 
> What is there the compiler can optimize?

In this example, it can't the if is conditional on the previous
executions, so even a speculating CPU is required to do ordering  If the
compiler could prove inode wasn't null (which it can't I think in this
case) it might then re-order.  in your original question, there were
ways the compiler could behave strangely (for instance, it would likely
start by initialising p to &x rather than NULL).

James



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 11:55 ` David Howells
                     ` (2 preceding siblings ...)
  2010-09-16 16:18     ` Peter Zijlstra
@ 2010-09-16 17:59   ` David Howells
  3 siblings, 0 replies; 41+ messages in thread
From: David Howells @ 2010-09-16 17:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: dhowells, Miklos Szeredi, Paul E. McKenney, linux-kernel, linux-arch

Peter Zijlstra <peterz@infradead.org> wrote:

> Therefore:
> 
>   A = 5
>   LOCK
> 
>   UNLOCK
>   B = 6
> 
> May be observed as:
> 
>    LOCK
>    B = 6
>    A = 5
>    UNLOCK

Yeah, you're right.

David

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-16 17:17               ` Miklos Szeredi
  2010-09-16 17:40                 ` James Bottomley
@ 2010-09-17 21:49                 ` Benjamin Herrenschmidt
  2010-09-17 23:12                   ` Paul E. McKenney
  2010-09-18  1:12                   ` Alan Cox
  1 sibling, 2 replies; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2010-09-17 21:49 UTC (permalink / raw)
  To: Miklos Szeredi
  Cc: James Bottomley, paulmck, dhowells, linux-kernel, linux-arch


> Right but in the concrete namei example I can't see how a compiler
> optimization can make a difference.  The order of the loads is quite
> clear:
> 
>    LOAD inode = next.dentry->inode
>    if (inode != NULL)
>    	LOAD inode->f_op
> 
> What is there the compiler can optimize?

Those two loads depend on each other, I don't think any implementation
can re-order them. In fact, such data dependency is typically what is
used to avoid having barriers in some cases. The second load cannot be
issued until the value from the first one is returned.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-17 21:49                 ` Benjamin Herrenschmidt
@ 2010-09-17 23:12                   ` Paul E. McKenney
  2010-09-19  2:47                     ` Benjamin Herrenschmidt
  2010-09-18  1:12                   ` Alan Cox
  1 sibling, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-17 23:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Miklos Szeredi, James Bottomley, dhowells, linux-kernel, linux-arch

On Sat, Sep 18, 2010 at 07:49:08AM +1000, Benjamin Herrenschmidt wrote:
> 
> > Right but in the concrete namei example I can't see how a compiler
> > optimization can make a difference.  The order of the loads is quite
> > clear:
> > 
> >    LOAD inode = next.dentry->inode
> >    if (inode != NULL)
> >    	LOAD inode->f_op
> > 
> > What is there the compiler can optimize?
> 
> Those two loads depend on each other, I don't think any implementation
> can re-order them. In fact, such data dependency is typically what is
> used to avoid having barriers in some cases. The second load cannot be
> issued until the value from the first one is returned.

Sufficiently sadistic compiler and CPU implementations could do value
speculation, for example, driven by profile-feedback optimization.
Then the guess might initially incorrect, but then a store by some other
CPU could make the subsequent test decide (wrongly) that the guess had
in fact been correct.

Needless to say, I am not a fan of value speculation.  But other people
do like it a lot.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-17 21:49                 ` Benjamin Herrenschmidt
  2010-09-17 23:12                   ` Paul E. McKenney
@ 2010-09-18  1:12                   ` Alan Cox
  1 sibling, 0 replies; 41+ messages in thread
From: Alan Cox @ 2010-09-18  1:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Miklos Szeredi, James Bottomley, paulmck, dhowells, linux-kernel,
	linux-arch

> >    LOAD inode = next.dentry->inode
> >    if (inode != NULL)
> >    	LOAD inode->f_op
> > 
> > What is there the compiler can optimize?
> 
> Those two loads depend on each other, I don't think any implementation
> can re-order them. In fact, such data dependency is typically what is
> used to avoid having barriers in some cases. The second load cannot be
> issued until the value from the first one is returned.

Actually I've met several compilers which will optimise them but not gcc
and not in a Linux environment. Some compilers know how to optimise the
== NULL case for pointer dereferences providing their programming
environment has a couple of low pages mapped read only and reading 0.

Alan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-17 23:12                   ` Paul E. McKenney
@ 2010-09-19  2:47                     ` Benjamin Herrenschmidt
  2010-09-19 15:26                       ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Benjamin Herrenschmidt @ 2010-09-19  2:47 UTC (permalink / raw)
  To: paulmck
  Cc: Miklos Szeredi, James Bottomley, dhowells, linux-kernel, linux-arch

On Fri, 2010-09-17 at 16:12 -0700, Paul E. McKenney wrote:
> On Sat, Sep 18, 2010 at 07:49:08AM +1000, Benjamin Herrenschmidt wrote:
> > 
> > > Right but in the concrete namei example I can't see how a compiler
> > > optimization can make a difference.  The order of the loads is quite
> > > clear:
> > > 
> > >    LOAD inode = next.dentry->inode
> > >    if (inode != NULL)
> > >    	LOAD inode->f_op
> > > 
> > > What is there the compiler can optimize?
> > 
> > Those two loads depend on each other, I don't think any implementation
> > can re-order them. In fact, such data dependency is typically what is
> > used to avoid having barriers in some cases. The second load cannot be
> > issued until the value from the first one is returned.
> 
> Sufficiently sadistic compiler and CPU implementations could do value
> speculation, for example, driven by profile-feedback optimization.
> Then the guess might initially incorrect, but then a store by some other
> CPU could make the subsequent test decide (wrongly) that the guess had
> in fact been correct.
> 
> Needless to say, I am not a fan of value speculation.  But other people
> do like it a lot.

Well, this verges on insanity... we get to a point where nobody's going
to get any code right :-)

I don't think the powerpc arch allows that, that leaves us with the
compiler, but so far I don't think gcc is -that- crazy. Those constructs
are common enough...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-19  2:47                     ` Benjamin Herrenschmidt
@ 2010-09-19 15:26                       ` Paul E. McKenney
  2010-09-19 20:15                         ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-19 15:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Miklos Szeredi, James Bottomley, dhowells, linux-kernel, linux-arch

On Sun, Sep 19, 2010 at 12:47:01PM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-09-17 at 16:12 -0700, Paul E. McKenney wrote:
> > On Sat, Sep 18, 2010 at 07:49:08AM +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > > Right but in the concrete namei example I can't see how a compiler
> > > > optimization can make a difference.  The order of the loads is quite
> > > > clear:
> > > > 
> > > >    LOAD inode = next.dentry->inode
> > > >    if (inode != NULL)
> > > >    	LOAD inode->f_op
> > > > 
> > > > What is there the compiler can optimize?
> > > 
> > > Those two loads depend on each other, I don't think any implementation
> > > can re-order them. In fact, such data dependency is typically what is
> > > used to avoid having barriers in some cases. The second load cannot be
> > > issued until the value from the first one is returned.
> > 
> > Sufficiently sadistic compiler and CPU implementations could do value
> > speculation, for example, driven by profile-feedback optimization.
> > Then the guess might initially incorrect, but then a store by some other
> > CPU could make the subsequent test decide (wrongly) that the guess had
> > in fact been correct.
> > 
> > Needless to say, I am not a fan of value speculation.  But other people
> > do like it a lot.
> 
> Well, this verges on insanity... we get to a point where nobody's going
> to get any code right :-)
> 
> I don't think the powerpc arch allows that, that leaves us with the
> compiler, but so far I don't think gcc is -that- crazy. Those constructs
> are common enough...

Give it a few years.  There are reportedly already other compilers that do
this, which is not too surprising given that the perception of insanity
is limited to lockless parallel code.  If you have single-threaded code,
such as code and data under a lock (where the data is never accessed
without holding that lock), then this sort of optimization is pretty safe.
I still don't like it, but the compiler guys would argue that this is
because I am one of those insane parallel-programming guys.

Furthermore, there are other ways to get into trouble.  If the code
continued as follows:

   LOAD inode = next.dentry->inode
   if (inode != NULL)
   	LOAD inode->f_op
	do_something_using_lots_of_registers();
	LOAD inode->some_other_field

and if the code expected ->f_op and ->some_other_field to be from the
same inode structure, severe disappointment could ensue.  This is because
the compiler is within its rights to reload from next.dentry->inode,
especially given register pressure.  In fact, the compiler would be within
its rights to reload from next.dentry->inode in the "LOAD inode->f_op"
statement.  And it might well get NULL from such a reload.

This code sequence therefore needs -at- -least- an ACCESS_ONCE() to keep
the compiler in line.

Remember, by default, the compiler is permitted to assume that it is
generating single-threaded code.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-19 15:26                       ` Paul E. McKenney
@ 2010-09-19 20:15                         ` Miklos Szeredi
  2010-09-19 21:59                           ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-19 20:15 UTC (permalink / raw)
  To: paulmck; +Cc: benh, miklos, James.Bottomley, dhowells, linux-kernel, linux-arch

On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> Give it a few years.  There are reportedly already other compilers that do
> this, which is not too surprising given that the perception of insanity
> is limited to lockless parallel code.  If you have single-threaded code,
> such as code and data under a lock (where the data is never accessed
> without holding that lock), then this sort of optimization is pretty safe.
> I still don't like it, but the compiler guys would argue that this is
> because I am one of those insane parallel-programming guys.
> 
> Furthermore, there are other ways to get into trouble.  If the code
> continued as follows:
> 
>    LOAD inode = next.dentry->inode
>    if (inode != NULL)
>    	LOAD inode->f_op
> 	do_something_using_lots_of_registers();
> 	LOAD inode->some_other_field
> 
> and if the code expected ->f_op and ->some_other_field to be from the
> same inode structure, severe disappointment could ensue.  This is because
> the compiler is within its rights to reload from next.dentry->inode,
> especially given register pressure.  In fact, the compiler would be within
> its rights to reload from next.dentry->inode in the "LOAD inode->f_op"
> statement.  And it might well get NULL from such a reload.

Except the VFS doesn't allow that.  dentry->inode can go from NULL to
non-NULL anytime but will only go from non-NULL to NULL when there are
no possible external references to the dentry.

The compiler and the CPU cannot move the "LOAD inode->some_field"
before the "LOAD dentry->inode", because of the conditional, right?

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-19 20:15                         ` Miklos Szeredi
@ 2010-09-19 21:59                           ` Paul E. McKenney
  2010-09-20  0:58                             ` James Bottomley
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-19 21:59 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: benh, James.Bottomley, dhowells, linux-kernel, linux-arch

On Sun, Sep 19, 2010 at 10:15:51PM +0200, Miklos Szeredi wrote:
> On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > Give it a few years.  There are reportedly already other compilers that do
> > this, which is not too surprising given that the perception of insanity
> > is limited to lockless parallel code.  If you have single-threaded code,
> > such as code and data under a lock (where the data is never accessed
> > without holding that lock), then this sort of optimization is pretty safe.
> > I still don't like it, but the compiler guys would argue that this is
> > because I am one of those insane parallel-programming guys.
> > 
> > Furthermore, there are other ways to get into trouble.  If the code
> > continued as follows:
> > 
> >    LOAD inode = next.dentry->inode
> >    if (inode != NULL)
> >    	LOAD inode->f_op
> > 	do_something_using_lots_of_registers();
> > 	LOAD inode->some_other_field
> > 
> > and if the code expected ->f_op and ->some_other_field to be from the
> > same inode structure, severe disappointment could ensue.  This is because
> > the compiler is within its rights to reload from next.dentry->inode,
> > especially given register pressure.  In fact, the compiler would be within
> > its rights to reload from next.dentry->inode in the "LOAD inode->f_op"
> > statement.  And it might well get NULL from such a reload.
> 
> Except the VFS doesn't allow that.  dentry->inode can go from NULL to
> non-NULL anytime but will only go from non-NULL to NULL when there are
> no possible external references to the dentry.
> 
> The compiler and the CPU cannot move the "LOAD inode->some_field"
> before the "LOAD dentry->inode", because of the conditional, right?

Other than Alpha, the CPU cannot.  The standard -does- permit the
compiler to guess the value of the pointer, thus effectively moving the
load prior to the conditional.  At present, as far as I know, gcc does
not actually do this.

Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
possibly saving much pain and headache later on.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-19 21:59                           ` Paul E. McKenney
@ 2010-09-20  0:58                             ` James Bottomley
  2010-09-20  1:29                               ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: James Bottomley @ 2010-09-20  0:58 UTC (permalink / raw)
  To: paulmck; +Cc: Miklos Szeredi, benh, dhowells, linux-kernel, linux-arch

On Sun, 2010-09-19 at 14:59 -0700, Paul E. McKenney wrote:
> On Sun, Sep 19, 2010 at 10:15:51PM +0200, Miklos Szeredi wrote:
> > On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > Give it a few years.  There are reportedly already other compilers that do
> > > this, which is not too surprising given that the perception of insanity
> > > is limited to lockless parallel code.  If you have single-threaded code,
> > > such as code and data under a lock (where the data is never accessed
> > > without holding that lock), then this sort of optimization is pretty safe.
> > > I still don't like it, but the compiler guys would argue that this is
> > > because I am one of those insane parallel-programming guys.
> > > 
> > > Furthermore, there are other ways to get into trouble.  If the code
> > > continued as follows:
> > > 
> > >    LOAD inode = next.dentry->inode
> > >    if (inode != NULL)
> > >    	LOAD inode->f_op
> > > 	do_something_using_lots_of_registers();
> > > 	LOAD inode->some_other_field
> > > 
> > > and if the code expected ->f_op and ->some_other_field to be from the
> > > same inode structure, severe disappointment could ensue.  This is because
> > > the compiler is within its rights to reload from next.dentry->inode,
> > > especially given register pressure.  In fact, the compiler would be within
> > > its rights to reload from next.dentry->inode in the "LOAD inode->f_op"
> > > statement.  And it might well get NULL from such a reload.
> > 
> > Except the VFS doesn't allow that.  dentry->inode can go from NULL to
> > non-NULL anytime but will only go from non-NULL to NULL when there are
> > no possible external references to the dentry.
> > 
> > The compiler and the CPU cannot move the "LOAD inode->some_field"
> > before the "LOAD dentry->inode", because of the conditional, right?
> 
> Other than Alpha, the CPU cannot.  The standard -does- permit the
> compiler to guess the value of the pointer, thus effectively moving the
> load prior to the conditional.  At present, as far as I know, gcc does
> not actually do this.

I believe the standard only allows this if the compiler can *prove* (not
guess) the result of the if test.  Even on Parisc where this can be
expressed as a nullified instruction pair, the architectural rules
forbid speculation of the latter before the former (the only time they
can be executed out of order is when the CPU can prove the values
internally, i.e. they're both cached)

> Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> possibly saving much pain and headache later on.

OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
(like list traversal) because some compilers can reload cached values
across an explicit barrier (which isn't here).

James



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20  0:58                             ` James Bottomley
@ 2010-09-20  1:29                               ` Paul E. McKenney
  2010-09-20 16:01                                 ` Miklos Szeredi
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-20  1:29 UTC (permalink / raw)
  To: James Bottomley; +Cc: Miklos Szeredi, benh, dhowells, linux-kernel, linux-arch

On Sun, Sep 19, 2010 at 07:58:36PM -0500, James Bottomley wrote:
> On Sun, 2010-09-19 at 14:59 -0700, Paul E. McKenney wrote:
> > On Sun, Sep 19, 2010 at 10:15:51PM +0200, Miklos Szeredi wrote:
> > > On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > > Give it a few years.  There are reportedly already other compilers that do
> > > > this, which is not too surprising given that the perception of insanity
> > > > is limited to lockless parallel code.  If you have single-threaded code,
> > > > such as code and data under a lock (where the data is never accessed
> > > > without holding that lock), then this sort of optimization is pretty safe.
> > > > I still don't like it, but the compiler guys would argue that this is
> > > > because I am one of those insane parallel-programming guys.
> > > > 
> > > > Furthermore, there are other ways to get into trouble.  If the code
> > > > continued as follows:
> > > > 
> > > >    LOAD inode = next.dentry->inode
> > > >    if (inode != NULL)
> > > >    	LOAD inode->f_op
> > > > 	do_something_using_lots_of_registers();
> > > > 	LOAD inode->some_other_field
> > > > 
> > > > and if the code expected ->f_op and ->some_other_field to be from the
> > > > same inode structure, severe disappointment could ensue.  This is because
> > > > the compiler is within its rights to reload from next.dentry->inode,
> > > > especially given register pressure.  In fact, the compiler would be within
> > > > its rights to reload from next.dentry->inode in the "LOAD inode->f_op"
> > > > statement.  And it might well get NULL from such a reload.
> > > 
> > > Except the VFS doesn't allow that.  dentry->inode can go from NULL to
> > > non-NULL anytime but will only go from non-NULL to NULL when there are
> > > no possible external references to the dentry.
> > > 
> > > The compiler and the CPU cannot move the "LOAD inode->some_field"
> > > before the "LOAD dentry->inode", because of the conditional, right?
> > 
> > Other than Alpha, the CPU cannot.  The standard -does- permit the
> > compiler to guess the value of the pointer, thus effectively moving the
> > load prior to the conditional.  At present, as far as I know, gcc does
> > not actually do this.
> 
> I believe the standard only allows this if the compiler can *prove* (not
> guess) the result of the if test.  Even on Parisc where this can be
> expressed as a nullified instruction pair, the architectural rules
> forbid speculation of the latter before the former (the only time they
> can be executed out of order is when the CPU can prove the values
> internally, i.e. they're both cached)

Agreed, but unless you tell it otherwise, the compiler is allowed to
assume that the code it is generating is the only thread running in
the address space in question.  That is most definitely not the case
here, so we might need to let the compiler in on this information.

> > Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> > possibly saving much pain and headache later on.
> 
> OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
> (like list traversal) because some compilers can reload cached values
> across an explicit barrier (which isn't here).

ACCESS_ONCE() also tells the compiler not to try to guess.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20  1:29                               ` Paul E. McKenney
@ 2010-09-20 16:01                                 ` Miklos Szeredi
  2010-09-20 18:25                                     ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Miklos Szeredi @ 2010-09-20 16:01 UTC (permalink / raw)
  To: paulmck; +Cc: James.Bottomley, miklos, benh, dhowells, linux-kernel, linux-arch

On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> > > possibly saving much pain and headache later on.
> > 
> > OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
> > (like list traversal) because some compilers can reload cached values
> > across an explicit barrier (which isn't here).
> 
> ACCESS_ONCE() also tells the compiler not to try to guess.

If the code is written like this:

	if (ACCESS_ONCE(dentry->d_inode)) {
		blah = dentry->d_inode->i_some_field
		...
	}

does the compiler guarantee anything or does it need a full compiler
barrier to prevent reordering?

Because that pattern is, again, pretty much all over the place.  Yeah
it can be rewritten but that's not always feasable since it's
difficult to audit, would possibly need extra function arguments,
etc...

Thanks,
Miklos

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20 16:01                                 ` Miklos Szeredi
@ 2010-09-20 18:25                                     ` Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-20 18:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: James.Bottomley, benh, dhowells, linux-kernel, linux-arch

On Mon, Sep 20, 2010 at 06:01:58PM +0200, Miklos Szeredi wrote:
> On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > > Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> > > > possibly saving much pain and headache later on.
> > > 
> > > OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
> > > (like list traversal) because some compilers can reload cached values
> > > across an explicit barrier (which isn't here).
> > 
> > ACCESS_ONCE() also tells the compiler not to try to guess.
> 
> If the code is written like this:
> 
> 	if (ACCESS_ONCE(dentry->d_inode)) {
> 		blah = dentry->d_inode->i_some_field
> 		...
> 	}
> 
> does the compiler guarantee anything or does it need a full compiler
> barrier to prevent reordering?

>From what I understand, this could do strange things.  The compiler
is forced to access dentry->d_inode for the "if" check, but would be
free to use some previously fetched value for the assignment to "blah".
Unless of course this code was under a lock that prevented any
change to dentry->d_inode.

If the code is to execute in a lockless manner, I would instead suggest
something like the following:

	p = ACCESS_ONCE(dentry->d_inode);
	if (p) {
		blah = p->i_some_field
		...
	}

This would force the compiler to actually fetch dentry->d_inode
and only then dereference it.

This would -not- constrain the CPU in any way, but the only CPU that
I know of that misbehaves in this case is DEC Alpha.

So my version of the above code would do what you expect on most CPUs,
but really could fail on DEC Alpha.  If you don't believe me, please feel
free to take a look at http://www.openvms.compaq.com/wizard/wiz_2637.html.

But do we really care about Alpha anymore?  (I can see it now... The
Alpha portion of the kernel tree moves to staging...)

> Because that pattern is, again, pretty much all over the place.  Yeah
> it can be rewritten but that's not always feasable since it's
> difficult to audit, would possibly need extra function arguments,
> etc...

Again, the pattern is OK if you are preventing the pointer from changing.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
@ 2010-09-20 18:25                                     ` Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-20 18:25 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: James.Bottomley, benh, dhowells, linux-kernel, linux-arch

On Mon, Sep 20, 2010 at 06:01:58PM +0200, Miklos Szeredi wrote:
> On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > > Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> > > > possibly saving much pain and headache later on.
> > > 
> > > OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
> > > (like list traversal) because some compilers can reload cached values
> > > across an explicit barrier (which isn't here).
> > 
> > ACCESS_ONCE() also tells the compiler not to try to guess.
> 
> If the code is written like this:
> 
> 	if (ACCESS_ONCE(dentry->d_inode)) {
> 		blah = dentry->d_inode->i_some_field
> 		...
> 	}
> 
> does the compiler guarantee anything or does it need a full compiler
> barrier to prevent reordering?

From what I understand, this could do strange things.  The compiler
is forced to access dentry->d_inode for the "if" check, but would be
free to use some previously fetched value for the assignment to "blah".
Unless of course this code was under a lock that prevented any
change to dentry->d_inode.

If the code is to execute in a lockless manner, I would instead suggest
something like the following:

	p = ACCESS_ONCE(dentry->d_inode);
	if (p) {
		blah = p->i_some_field
		...
	}

This would force the compiler to actually fetch dentry->d_inode
and only then dereference it.

This would -not- constrain the CPU in any way, but the only CPU that
I know of that misbehaves in this case is DEC Alpha.

So my version of the above code would do what you expect on most CPUs,
but really could fail on DEC Alpha.  If you don't believe me, please feel
free to take a look at http://www.openvms.compaq.com/wizard/wiz_2637.html.

But do we really care about Alpha anymore?  (I can see it now... The
Alpha portion of the kernel tree moves to staging...)

> Because that pattern is, again, pretty much all over the place.  Yeah
> it can be rewritten but that's not always feasable since it's
> difficult to audit, would possibly need extra function arguments,
> etc...

Again, the pattern is OK if you are preventing the pointer from changing.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20 18:25                                     ` Paul E. McKenney
  (?)
@ 2010-09-20 18:57                                     ` Paul E. McKenney
  -1 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-20 18:57 UTC (permalink / raw)
  To: Miklos Szeredi; +Cc: James.Bottomley, benh, dhowells, linux-kernel, linux-arch

On Mon, Sep 20, 2010 at 11:25:04AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 20, 2010 at 06:01:58PM +0200, Miklos Szeredi wrote:
> > On Sun, 19 Sep 2010, Paul E. McKenney wrote:
> > > > > Again, please put at least an ACCESS_ONCE() in.  Trivial to do now,
> > > > > possibly saving much pain and headache later on.
> > > > 
> > > > OK, lost you here.  ACCESS_ONCE() is only needed in certain situations
> > > > (like list traversal) because some compilers can reload cached values
> > > > across an explicit barrier (which isn't here).
> > > 
> > > ACCESS_ONCE() also tells the compiler not to try to guess.
> > 
> > If the code is written like this:
> > 
> > 	if (ACCESS_ONCE(dentry->d_inode)) {
> > 		blah = dentry->d_inode->i_some_field
> > 		...
> > 	}
> > 
> > does the compiler guarantee anything or does it need a full compiler
> > barrier to prevent reordering?
> 
> From what I understand, this could do strange things.  The compiler
> is forced to access dentry->d_inode for the "if" check, but would be
> free to use some previously fetched value for the assignment to "blah".
> Unless of course this code was under a lock that prevented any
> change to dentry->d_inode.
> 
> If the code is to execute in a lockless manner, I would instead suggest
> something like the following:
> 
> 	p = ACCESS_ONCE(dentry->d_inode);
> 	if (p) {
> 		blah = p->i_some_field
> 		...
> 	}
> 
> This would force the compiler to actually fetch dentry->d_inode
> and only then dereference it.

Are the initial check and the assignment to "blah" in different
functions or something?  If so, the following might be easier to
deal with:

	if (ACCESS_ONCE(dentry->d_inode)) {
		blah = ACCESS_ONCE(dentry->d_inode)->i_some_field
		...
	}

The compiler is forbidden to reorder volatile accesses (at least assuming
that they have a well-defined order, which they do in this case).  The
CPU is required to make successive accesses to the same memory location
be in order.

So as long as the only change to dentry->d_inode is a NULL-to-non-NULL
transition, the above should work, other than on DEC Alpha.

						Thanx, Paul

> This would -not- constrain the CPU in any way, but the only CPU that
> I know of that misbehaves in this case is DEC Alpha.
> 
> So my version of the above code would do what you expect on most CPUs,
> but really could fail on DEC Alpha.  If you don't believe me, please feel
> free to take a look at http://www.openvms.compaq.com/wizard/wiz_2637.html.
> 
> But do we really care about Alpha anymore?  (I can see it now... The
> Alpha portion of the kernel tree moves to staging...)
> 
> > Because that pattern is, again, pretty much all over the place.  Yeah
> > it can be rewritten but that's not always feasable since it's
> > difficult to audit, would possibly need extra function arguments,
> > etc...
> 
> Again, the pattern is OK if you are preventing the pointer from changing.
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20 18:25                                     ` Paul E. McKenney
  (?)
  (?)
@ 2010-09-20 20:26                                     ` Michael Cree
  2010-09-20 20:40                                       ` Paul E. McKenney
  -1 siblings, 1 reply; 41+ messages in thread
From: Michael Cree @ 2010-09-20 20:26 UTC (permalink / raw)
  To: paulmck
  Cc: Miklos Szeredi, James.Bottomley, benh, dhowells, linux-kernel,
	linux-arch

On 21/09/10 06:25, Paul E. McKenney wrote:
> But do we really care about Alpha anymore?  (I can see it now... The
> Alpha portion of the kernel tree moves to staging...)

I do.

Sent from my Compaq Alpha XP1000 :-)

Cheers
Michael.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20 20:26                                     ` Michael Cree
@ 2010-09-20 20:40                                       ` Paul E. McKenney
  2010-09-21 14:59                                         ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-20 20:40 UTC (permalink / raw)
  To: Michael Cree
  Cc: Miklos Szeredi, James.Bottomley, benh, dhowells, linux-kernel,
	linux-arch

On Tue, Sep 21, 2010 at 08:26:14AM +1200, Michael Cree wrote:
> On 21/09/10 06:25, Paul E. McKenney wrote:
> >But do we really care about Alpha anymore?  (I can see it now... The
> >Alpha portion of the kernel tree moves to staging...)
> 
> I do.
> 
> Sent from my Compaq Alpha XP1000 :-)

In that case, an smp_read_barrier_depends() or rcu_dereference() is
also required...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-20 20:40                                       ` Paul E. McKenney
@ 2010-09-21 14:59                                         ` Paul E. McKenney
  2010-09-22 18:41                                           ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-21 14:59 UTC (permalink / raw)
  To: Michael Cree
  Cc: Miklos Szeredi, James.Bottomley, benh, dhowells, linux-kernel,
	linux-arch

On Mon, Sep 20, 2010 at 01:40:55PM -0700, Paul E. McKenney wrote:
> On Tue, Sep 21, 2010 at 08:26:14AM +1200, Michael Cree wrote:
> > On 21/09/10 06:25, Paul E. McKenney wrote:
> > >But do we really care about Alpha anymore?  (I can see it now... The
> > >Alpha portion of the kernel tree moves to staging...)
> > 
> > I do.
> > 
> > Sent from my Compaq Alpha XP1000 :-)
> 
> In that case, an smp_read_barrier_depends() or rcu_dereference() is
> also required...

Wait a minute...  I finally got a chance to look up the system.  This
system has only a single Alpha CPU, right?

If so, just build a UP kernel and be happy.  No need to worry about
any of these SMP issues.

So, an alternative proposal -- support only UP kernels on Alpha.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
  2010-09-21 14:59                                         ` Paul E. McKenney
@ 2010-09-22 18:41                                           ` Paul E. McKenney
  0 siblings, 0 replies; 41+ messages in thread
From: Paul E. McKenney @ 2010-09-22 18:41 UTC (permalink / raw)
  To: Michael Cree
  Cc: Miklos Szeredi, James.Bottomley, benh, dhowells, linux-kernel,
	linux-arch

On Tue, Sep 21, 2010 at 07:59:10AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 20, 2010 at 01:40:55PM -0700, Paul E. McKenney wrote:
> > On Tue, Sep 21, 2010 at 08:26:14AM +1200, Michael Cree wrote:
> > > On 21/09/10 06:25, Paul E. McKenney wrote:
> > > >But do we really care about Alpha anymore?  (I can see it now... The
> > > >Alpha portion of the kernel tree moves to staging...)
> > > 
> > > I do.
> > > 
> > > Sent from my Compaq Alpha XP1000 :-)
> > 
> > In that case, an smp_read_barrier_depends() or rcu_dereference() is
> > also required...
> 
> Wait a minute...  I finally got a chance to look up the system.  This
> system has only a single Alpha CPU, right?
> 
> If so, just build a UP kernel and be happy.  No need to worry about
> any of these SMP issues.
> 
> So, an alternative proposal -- support only UP kernels on Alpha.  ;-)

And there do seem to be some SMP Alpha systems still in use:

	http://www.alphalinux.org/wiki/index.php/User:Armin76/HW_list

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Memory barrier question.
  2010-10-30  6:11     ` Eric Dumazet
@ 2010-10-30 12:40       ` Tetsuo Handa
  0 siblings, 0 replies; 41+ messages in thread
From: Tetsuo Handa @ 2010-10-30 12:40 UTC (permalink / raw)
  To: eric.dumazet; +Cc: paulmck, linux-kernel

Hello.

Eric Dumazet wrote:
> Le samedi 30 octobre 2010 a 14:48 +0900, Tetsuo Handa a ecrit :
> > Then, what about list heads that are dynamically allocated/initialized?
> > 
> > sruct hashed_word {
> > 	struct list_head list;
> > 	void *key;
> > 	char *buf;
> > };
> > static struct list_head tables[1u << TABLEBIT];
> > static void (*some_callback) (void *);
> > 
> 
> typedef void (*some_callback_t) (void *);
> 
> static some_callback_t __rcu *some_callback;
> 
> > static void init(void)
> > {
> > 	int i;
> > 	/* Initialize the table. */
> > 	for (i = 0; i < (1u << TABLEBIT); i++)
> > 		INIT_LIST_HEAD(&tables[i]);
> > 	smp_wmb();
> > 	/* Allow other CPUs to access the table. */
> > 	some_callback = func;
> 
> rcu_assign_pointer(some_callback, func);
> 
This one is OK. But a bit different from what I want to do.

> > }
> > 
> > /*
> >  * This function is called by
> >  *
> >  *	if (some_callback)
> >  *		some_callback(some_ptr);
> >  *
> >  * rather than
> >  *
> >  *	func(some_ptr);
> 
> well no, see later.
> 
I want to split built-in part and module part. I'm happy to embed

	if (some_callback)
		some_callback(some_ptr);

into built-in part but I'm not happy to expose tables to built-in part.
I'm using function pointer to realize it (like LSM).

> >  *
> >  * .
> >  */
> > static void func(void *some_ptr)
> > {
> > 	struct hashed_word *ptr;
> > 	struct list_head *table = &tables[hash_ptr(some_ptr, TABLEBIT)];
> > 	/* We must make sure that readers see table->next != NULL. */
> 
> 
> > 	smp_rmb();
> Nope.. delete this
> 
> > 	rcu_read_lock();
> > 	list_for_each_entry_rcu(ptr, table, list) {
> > 		if (ptr->key != some_ptr)
> > 			continue;
> > 		printk("%s\n", ptr->buf);
> > 		break;
> > 	}
> > 	rcu_read_unlock();
> > }
> > 
> > How can I make sure INIT_LIST_HEAD() in init() takes effect?
> > Is smp_rmb() appropriate?
> 
> Please avoid smp_rmb()/smb_wmb() and use RCU api only.
> 
> some_callback_t *ptr;
> 
> rcu_read_lock();
> ptr = rcu_dereference(some_callback);
> if (ptr) {

I want to know how I can guarantee that "table->next != NULL"
by exposing only "some_callback" to built-in part.
Once the result of INIT_LIST_HEAD() became visible to readers,
readers no longer need to use RCU for guaranteeing table->next != NULL.

> 	list_for_each_entry_rcu(aux, table, list) {
>  		if (aux->key != ptr)
>  			continue;
>  		printk("%s\n", aux->buf);
>  		break;
>  	}
>  }
> rcu_read_unlock();

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Memory barrier question.
  2010-10-30  5:48   ` Tetsuo Handa
@ 2010-10-30  6:11     ` Eric Dumazet
  2010-10-30 12:40       ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Eric Dumazet @ 2010-10-30  6:11 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: paulmck, linux-kernel

Le samedi 30 octobre 2010 à 14:48 +0900, Tetsuo Handa a écrit :
> Paul E. McKenney wrote:
> > For the CPU, let's take it by type of architecture...
> > 
> > First, let's get the UP-only architectures out of the way.  These would
> > always see their changes in order, so woiuld always see "hello".
> 
> Of course.
> 
> > Second, let's consider the TSO architectures, including x86, SPARC,
> > PA-RISC, and IBM Mainframe.  On these architectures, reads are not
> > reordered by the CPU, so if they see the new pointer, they will also
> > see the new characters -- hence "hello".
> 
> Yes.
> 
> > Next, let's consider weakly ordered systems that respect dependency
> > ordering (ARM, PowerPC, Itanium).  The load of the pointer would
> > always be ordered with respect to any dereference of the pointer,
> > so they would always see "hello".
> 
> I'm relieved to hear that.
> 
> > This leave DEC Alpha.  In this architecture, smp_read_barrier_depends()
> > expands to smp_rmb(), which forces the ordering as required.  So
> > Alpha also sees "hello."
> 
> Yes.
> 
> > I believe that this covers all of the cases.
> > 
> > Am I missing anything?
> 
> You are right. Regarding list elements, they are appropriately protected.
> 
> 
> 
> Then, what about list heads that are dynamically allocated/initialized?
> 
> sruct hashed_word {
> 	struct list_head list;
> 	void *key;
> 	char *buf;
> };
> static struct list_head tables[1u << TABLEBIT];
> static void (*some_callback) (void *);
> 

typedef void (*some_callback_t) (void *);

static some_callback_t __rcu *some_callback;

> static void init(void)
> {
> 	int i;
> 	/* Initialize the table. */
> 	for (i = 0; i < (1u << TABLEBIT); i++)
> 		INIT_LIST_HEAD(&tables[i]);
> 	smp_wmb();
> 	/* Allow other CPUs to access the table. */
> 	some_callback = func;

rcu_assign_pointer(some_callback, func);

> }
> 
> /*
>  * This function is called by
>  *
>  *	if (some_callback)
>  *		some_callback(some_ptr);
>  *
>  * rather than
>  *
>  *	func(some_ptr);

well no, see later.

>  *
>  * .
>  */
> static void func(void *some_ptr)
> {
> 	struct hashed_word *ptr;
> 	struct list_head *table = &tables[hash_ptr(some_ptr, TABLEBIT)];
> 	/* We must make sure that readers see table->next != NULL. */


> 	smp_rmb();
Nope.. delete this

> 	rcu_read_lock();
> 	list_for_each_entry_rcu(ptr, table, list) {
> 		if (ptr->key != some_ptr)
> 			continue;
> 		printk("%s\n", ptr->buf);
> 		break;
> 	}
> 	rcu_read_unlock();
> }
> 
> How can I make sure INIT_LIST_HEAD() in init() takes effect?
> Is smp_rmb() appropriate?

Please avoid smp_rmb()/smb_wmb() and use RCU api only.

some_callback_t *ptr;

rcu_read_lock();
ptr = rcu_dereference(some_callback);
if (ptr) {
	list_for_each_entry_rcu(aux, table, list) {
 		if (aux->key != ptr)
 			continue;
 		printk("%s\n", aux->buf);
 		break;
 	}
 }
rcu_read_unlock();




^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Memory barrier question.
  2010-10-29 15:09 ` Paul E. McKenney
@ 2010-10-30  5:48   ` Tetsuo Handa
  2010-10-30  6:11     ` Eric Dumazet
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2010-10-30  5:48 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Paul E. McKenney wrote:
> For the CPU, let's take it by type of architecture...
> 
> First, let's get the UP-only architectures out of the way.  These would
> always see their changes in order, so woiuld always see "hello".

Of course.

> Second, let's consider the TSO architectures, including x86, SPARC,
> PA-RISC, and IBM Mainframe.  On these architectures, reads are not
> reordered by the CPU, so if they see the new pointer, they will also
> see the new characters -- hence "hello".

Yes.

> Next, let's consider weakly ordered systems that respect dependency
> ordering (ARM, PowerPC, Itanium).  The load of the pointer would
> always be ordered with respect to any dereference of the pointer,
> so they would always see "hello".

I'm relieved to hear that.

> This leave DEC Alpha.  In this architecture, smp_read_barrier_depends()
> expands to smp_rmb(), which forces the ordering as required.  So
> Alpha also sees "hello."

Yes.

> I believe that this covers all of the cases.
> 
> Am I missing anything?

You are right. Regarding list elements, they are appropriately protected.



Then, what about list heads that are dynamically allocated/initialized?

sruct hashed_word {
	struct list_head list;
	void *key;
	char *buf;
};
static struct list_head tables[1u << TABLEBIT];
static void (*some_callback) (void *);

static void init(void)
{
	int i;
	/* Initialize the table. */
	for (i = 0; i < (1u << TABLEBIT); i++)
		INIT_LIST_HEAD(&tables[i]);
	smp_wmb();
	/* Allow other CPUs to access the table. */
	some_callback = func;
}

/*
 * This function is called by
 *
 *	if (some_callback)
 *		some_callback(some_ptr);
 *
 * rather than
 *
 *	func(some_ptr);
 *
 * .
 */
static void func(void *some_ptr)
{
	struct hashed_word *ptr;
	struct list_head *table = &tables[hash_ptr(some_ptr, TABLEBIT)];
	/* We must make sure that readers see table->next != NULL. */
	smp_rmb();
	rcu_read_lock();
	list_for_each_entry_rcu(ptr, table, list) {
		if (ptr->key != some_ptr)
			continue;
		printk("%s\n", ptr->buf);
		break;
	}
	rcu_read_unlock();
}

How can I make sure INIT_LIST_HEAD() in init() takes effect?
Is smp_rmb() appropriate?

Well, I wish we can wait for all CPUs to see table->next != NULL
before proceeding to "some_callback = func;" so that readers are
guaranteed to see table->next != NULL without issuing memory barriers...

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Memory barrier question.
  2010-10-29 13:23 Memory " Tetsuo Handa
@ 2010-10-29 15:09 ` Paul E. McKenney
  2010-10-30  5:48   ` Tetsuo Handa
  0 siblings, 1 reply; 41+ messages in thread
From: Paul E. McKenney @ 2010-10-29 15:09 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: linux-kernel

On Fri, Oct 29, 2010 at 10:23:41PM +0900, Tetsuo Handa wrote:
> Hello.
> 
> I got a question regarding memory barrier.
> 
> sruct word {
> 	struct list_head list;
> 	char *buf;
> };
> 
> static LIST_HEAD(wordlist);
> static DEFINE_SPINLOCK(wordlist_lock);
> 
> --- On CPU 0 ---
> 
> struct word *hello = kzalloc(sizeof(*hello), GFP_KERNEL);
> hello->buf = kstrdup("hello", GFP_KERNEL);
> spin_lock(&wordlist_lock);
> list_add_rcu(&hello.list, &wordlist);
> spin_unlock(&wordlist_lock);
> 
> --- On CPU 1 ---
> 
> struct word *ptr;
> rcu_read_lock();
> list_for_each_entry_rcu(ptr, &wordlist, list) {
> 	char *str = ptr->buf;
> 	printk("%s\n", str);
> }
> rcu_read_unlock();
> 
> Use of rcu_assign_pointer() and rcu_dereference() guarantees that
> CPU 1 gets &hello->list by reading wordlist.next only after
> CPU 1 can get kstrdup()ed pointer by reading hello->buf.
> But what guarantees that CPU 1 gets "hello" by reading kstrdup()ed pointer?
> 
> Say, kstrdup("hello", GFP_KERNEL) stores
> 
>   'h' -> 0xC0000000
>   'e' -> 0xC0000001
>   'l' -> 0xC0000002
>   'l' -> 0xC0000003
>   'o' -> 0xC0000004
>   '\0' -> 0xC0000005
> 
> and hello->buf = kstrdup() stores
> 
>   0xC0000000 -> hello->buf
> 
> .
> 
> If ordered by smp_wmb() by CPU 0 and smp_rmb() by CPU 1,
> str = ptr->buf will load
> 
>   0xC0000000 -> str
> 
> and printk("%s\n", str) will load
> 
>   0xC0000000 -> 'h'
>   0xC0000001 -> 'e'
>   0xC0000002 -> 'l'
>   0xC0000003 -> 'l'
>   0xC0000004 -> 'o'
>   0xC0000005 -> '\0'
> 
> .
> 
> Since CPU 0 issued smp_wmb() (inside list_add_rcu()) but CPU 1 did not issue
> smp_rmb() (inside list_for_each_entry_rcu()), I think CPU 1 would see bogus
> values like
> 
>   0xC0000000 -> 'h'
>   0xC0000001 -> 'a'
>   0xC0000002 -> 'l'
>   0xC0000003 -> '1'
>   0xC0000004 -> 'o'
>   0xC0000005 -> 'w'
>   0xC0000006 -> 'e'
>   0xC0000007 -> 'e'
>   0xC0000008 -> 'n'
>   0xC0000009 -> '\0'

Timely string value, given this coming Sunday in USA.  ;-)

> .
> 
> It seems to me that people do not call smp_rmb() before reading memory
> which was dynamically allocated/initialized. What am I missing?

There is the compiler and the CPU.  The compiler is prohibited from
reordering the reads due to the ACCESS_ONCE().

For the CPU, let's take it by type of architecture...

First, let's get the UP-only architectures out of the way.  These would
always see their changes in order, so woiuld always see "hello".

Second, let's consider the TSO architectures, including x86, SPARC,
PA-RISC, and IBM Mainframe.  On these architectures, reads are not
reordered by the CPU, so if they see the new pointer, they will also
see the new characters -- hence "hello".

Next, let's consider weakly ordered systems that respect dependency
ordering (ARM, PowerPC, Itanium).  The load of the pointer would
always be ordered with respect to any dereference of the pointer,
so they would always see "hello".

This leave DEC Alpha.  In this architecture, smp_read_barrier_depends()
expands to smp_rmb(), which forces the ordering as required.  So
Alpha also sees "hello."

I believe that this covers all of the cases.

Am I missing anything?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Memory barrier question.
@ 2010-10-29 13:23 Tetsuo Handa
  2010-10-29 15:09 ` Paul E. McKenney
  0 siblings, 1 reply; 41+ messages in thread
From: Tetsuo Handa @ 2010-10-29 13:23 UTC (permalink / raw)
  To: paulmck; +Cc: linux-kernel

Hello.

I got a question regarding memory barrier.

sruct word {
	struct list_head list;
	char *buf;
};

static LIST_HEAD(wordlist);
static DEFINE_SPINLOCK(wordlist_lock);

--- On CPU 0 ---

struct word *hello = kzalloc(sizeof(*hello), GFP_KERNEL);
hello->buf = kstrdup("hello", GFP_KERNEL);
spin_lock(&wordlist_lock);
list_add_rcu(&hello.list, &wordlist);
spin_unlock(&wordlist_lock);

--- On CPU 1 ---

struct word *ptr;
rcu_read_lock();
list_for_each_entry_rcu(ptr, &wordlist, list) {
	char *str = ptr->buf;
	printk("%s\n", str);
}
rcu_read_unlock();

Use of rcu_assign_pointer() and rcu_dereference() guarantees that
CPU 1 gets &hello->list by reading wordlist.next only after
CPU 1 can get kstrdup()ed pointer by reading hello->buf.
But what guarantees that CPU 1 gets "hello" by reading kstrdup()ed pointer?

Say, kstrdup("hello", GFP_KERNEL) stores

  'h' -> 0xC0000000
  'e' -> 0xC0000001
  'l' -> 0xC0000002
  'l' -> 0xC0000003
  'o' -> 0xC0000004
  '\0' -> 0xC0000005

and hello->buf = kstrdup() stores

  0xC0000000 -> hello->buf

.

If ordered by smp_wmb() by CPU 0 and smp_rmb() by CPU 1,
str = ptr->buf will load

  0xC0000000 -> str

and printk("%s\n", str) will load

  0xC0000000 -> 'h'
  0xC0000001 -> 'e'
  0xC0000002 -> 'l'
  0xC0000003 -> 'l'
  0xC0000004 -> 'o'
  0xC0000005 -> '\0'

.

Since CPU 0 issued smp_wmb() (inside list_add_rcu()) but CPU 1 did not issue
smp_rmb() (inside list_for_each_entry_rcu()), I think CPU 1 would see bogus
values like

  0xC0000000 -> 'h'
  0xC0000001 -> 'a'
  0xC0000002 -> 'l'
  0xC0000003 -> '1'
  0xC0000004 -> 'o'
  0xC0000005 -> 'w'
  0xC0000006 -> 'e'
  0xC0000007 -> 'e'
  0xC0000008 -> 'n'
  0xC0000009 -> '\0'

.

It seems to me that people do not call smp_rmb() before reading memory
which was dynamically allocated/initialized. What am I missing?

Regards.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: memory barrier question
@ 2010-09-20 10:34 George Spelvin
  0 siblings, 0 replies; 41+ messages in thread
From: George Spelvin @ 2010-09-20 10:34 UTC (permalink / raw)
  To: benh, miklos; +Cc: linux, linux-arch, linux-kernel

>>    LOAD inode = next.dentry->inode
>>    if (inode != NULL)
>>    	LOAD inode->f_op
>> 
>> What is there the compiler can optimize?

> Those two loads depend on each other, I don't think any implementation
> can re-order them. In fact, such data dependency is typically what is
> used to avoid having barriers in some cases. The second load cannot be
> issued until the value from the first one is returned.

It's quite hard for the *compiler* to do the loads out of order, but it's
vaguely possible on, say, an ia64 system which supports non-faulting loads.

That is, the compiled code could be:

"I guess the next inode will be at address X"
retry:
LOAD preloaded == *X;
LOAD inode = next.dentry->inode
if (inode == X)
	TRAP if fetch of preloaded had a page fault
	LOAD preloaded.f_op (guaranteed non blocking)
else
	update guess X = inode
	goto retry;
	
However, it's *very* easy for a processor to do exactly that!  There
days, hardware cache prefetchers look for sequential access patterns,
and if your inodes happen to get allocated to consecutive addresses,
it's possible that the prefetcher could have done the read ahead of you.

Even if reads are generated in one order, there's no reason that one can't
be stalled due to some resource conflict (say a bank conflict in the L1
cache) while the second is not stalled and ends up completing first.


So, for example, the following is not safe without barriers:

START:
	dentry->inode == a
	a->f_op == a_op
	b->f_op = INVALID
CPU 1:
	b->f_op = b_op
	dentry->inode = b
CPU 2:
	LOAD inode = dentry->inode
	LOAD op = inode->f_op
	assert(op != INVALID)

While this is a FALSE OVERSIMPLIFICATION that will lead you astray if
you take it to far, you can describe hazards in terms of a serialized stream
of "external observer" memory updates.

The above code has two hazards:
- CPU 1 might write dentry->inode before the b->f_op write is complete, or
- CPU 2 might read inode->f_op befire is reads dentry->inode.  This is
  difficult, but a hardware prefetcher might make a lucky guess.

Either way, CPU 2 ends up reading an invalid op vector.


Although the single serialized stream conceptual model works fine between
two processors, it is not guaranteed to work between three.  That is,
there is no guarantee that the following won't happen:

START:
	a = b = 0
CPU 1:
	a = 1;
	/* No barrier! */
	b = 1;
CPU 2:
	LOAD a;	(loads 1)
	-- read barrier --
	LOAD b; (loads 0)
CPU 3:
	LOAD b;	(loads 1)
	-- read barrier --
	LOAD a; (loads 0)

This is obviously inconsistent with the existence of a "true" sequential
ordering, but is entirely legal according to a less-strict memory ordering.

The lack of a write barrier on CPU 1 means that the order in which the
writes will be seen by other processor is not only undefined, it might
be different for different processors!

The Alpha processor manual had a good description of various ugly corner
cases.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2010-10-30 12:40 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-09-15 14:36 memory barrier question Miklos Szeredi
2010-09-15 19:12 ` Rafael J. Wysocki
2010-09-16 11:55 ` David Howells
2010-09-16 13:42   ` Miklos Szeredi
2010-09-16 13:42     ` Miklos Szeredi
2010-09-16 13:42     ` Miklos Szeredi
2010-09-16 14:30   ` David Howells
2010-09-16 15:03     ` Paul E. McKenney
2010-09-16 16:06       ` Miklos Szeredi
2010-09-16 16:37         ` Paul E. McKenney
2010-09-16 16:56           ` Miklos Szeredi
2010-09-16 17:09             ` James Bottomley
2010-09-16 17:17               ` Miklos Szeredi
2010-09-16 17:40                 ` James Bottomley
2010-09-17 21:49                 ` Benjamin Herrenschmidt
2010-09-17 23:12                   ` Paul E. McKenney
2010-09-19  2:47                     ` Benjamin Herrenschmidt
2010-09-19 15:26                       ` Paul E. McKenney
2010-09-19 20:15                         ` Miklos Szeredi
2010-09-19 21:59                           ` Paul E. McKenney
2010-09-20  0:58                             ` James Bottomley
2010-09-20  1:29                               ` Paul E. McKenney
2010-09-20 16:01                                 ` Miklos Szeredi
2010-09-20 18:25                                   ` Paul E. McKenney
2010-09-20 18:25                                     ` Paul E. McKenney
2010-09-20 18:57                                     ` Paul E. McKenney
2010-09-20 20:26                                     ` Michael Cree
2010-09-20 20:40                                       ` Paul E. McKenney
2010-09-21 14:59                                         ` Paul E. McKenney
2010-09-22 18:41                                           ` Paul E. McKenney
2010-09-18  1:12                   ` Alan Cox
2010-09-16 16:50         ` Jamie Lokier
2010-09-16 16:18   ` Peter Zijlstra
2010-09-16 16:18     ` Peter Zijlstra
2010-09-16 17:59   ` David Howells
2010-09-20 10:34 George Spelvin
2010-10-29 13:23 Memory " Tetsuo Handa
2010-10-29 15:09 ` Paul E. McKenney
2010-10-30  5:48   ` Tetsuo Handa
2010-10-30  6:11     ` Eric Dumazet
2010-10-30 12:40       ` Tetsuo Handa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.