linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: blk-throttle: Correct the placement of smp_rmb()
       [not found]                 ` <20101208213331.GA4895@redhat.com>
@ 2010-12-08 22:06                   ` Oleg Nesterov
  2010-12-09  1:45                     ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2010-12-08 22:06 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Vivek Goyal, linux-kernel

On 12/08, Oleg Nesterov wrote:
>
> Unfortunately, I can't prove this. You can ask
> Paul McKenney if you want the authoritative answer.

Well. I think we should ask ;) This is interesting.

Paul could you please shed a light?

Suppose we have 2 variables, A = 0 and B = 0.

	CPU0 does:

		A = 1;
		wmb();
		B = 1;

	CPU1 does:

		B = 0;
		mb();
		if (A)
			A = 2;

My understanding is: after that we can safely assume that

	B == 1 || A == 2

IOW. Either CPU1 notices that A was changed, or CPU0 "wins"
and sets B = 1 "after" CPU1. But, it is not possible that
CPU1 clears B "after" it was set by CPU0 _and_ sees A == 0.

Is it true? I think it should be true, but can't prove. This
reminds me the old (and long) discussion about STORE-MB-LOAD.
Iirc, finally it was decided that

	CPU0:				CPU1:

	A = 1;				B = 1;
	mb();				mb();
	if (B)				if (A)
		printf("Yes");			printf("Yes");

should print "Yes" at least once. This looks very similar to
the the previous example.

Thanks,

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-08 22:06                   ` blk-throttle: Correct the placement of smp_rmb() Oleg Nesterov
@ 2010-12-09  1:45                     ` Paul E. McKenney
  2010-12-09  2:37                       ` Vivek Goyal
  2010-12-09  9:26                       ` Oleg Nesterov
  0 siblings, 2 replies; 7+ messages in thread
From: Paul E. McKenney @ 2010-12-09  1:45 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Vivek Goyal, linux-kernel

On Wed, Dec 08, 2010 at 11:06:40PM +0100, Oleg Nesterov wrote:
> On 12/08, Oleg Nesterov wrote:
> >
> > Unfortunately, I can't prove this. You can ask
> > Paul McKenney if you want the authoritative answer.
> 
> Well. I think we should ask ;) This is interesting.
> 
> Paul could you please shed a light?
> 
> Suppose we have 2 variables, A = 0 and B = 0.
> 
> 	CPU0 does:
> 
> 		A = 1;
> 		wmb();
> 		B = 1;
> 
> 	CPU1 does:
> 
> 		B = 0;
> 		mb();
> 		if (A)
> 			A = 2;
> 
> My understanding is: after that we can safely assume that
> 
> 	B == 1 || A == 2
> 
> IOW. Either CPU1 notices that A was changed, or CPU0 "wins"
> and sets B = 1 "after" CPU1. But, it is not possible that
> CPU1 clears B "after" it was set by CPU0 _and_ sees A == 0.
> 
> Is it true? I think it should be true, but can't prove.

I was afraid that a question like this might be coming...  ;-)

The question is whether you can rely on the modification order of the
stores to B to deduce anything useful about the order in which the
accesses to A occurred.  The answer currently is I believe you can
for a simple example such as the one above, but I am checking with
the hardware guys.  In addition, please note that I am not sure if
all possible generalizations do what you want.  For example, imagine a
1024-CPU system in which the first 1023 CPUs do:

	A[smp_processor_id()] = 1;
	wmb();
	B = smp_processor_id();

where the elements of A are cache-line aligned and padded.  Suppose
that the remaining CPU does:

	i = random() % 1023;
	B = -1;
	mb();
	if (A[i])
		A[i] = 2;

Are we guaranteed that B!=-1||A[i]==2?

In this case, it could take all of the CPUs quite some time to come to
agreement on the order of all 1024 assignments to B.  I am bugging some
hardware guys about this.  It has been awhile, so they forgot to run
away when they saw me coming.  ;-)

>                                                         This
> reminds me the old (and long) discussion about STORE-MB-LOAD.
> Iirc, finally it was decided that
> 
> 	CPU0:				CPU1:
> 
> 	A = 1;				B = 1;
> 	mb();				mb();
> 	if (B)				if (A)
> 		printf("Yes");			printf("Yes");
> 
> should print "Yes" at least once. This looks very similar to
> the the previous example.

>From a hardware point of view, this example is very different than the
earlier one.  You are not using the order of independent CPUs' stores to a
single variable here and in addition are using mb() everywhere instead of
a combination of mb() and wmb().  So, yes, this one is guaranteed to work.

But what the heck are you guys really trying to do, anyway?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-09  1:45                     ` Paul E. McKenney
@ 2010-12-09  2:37                       ` Vivek Goyal
  2010-12-09  9:26                       ` Oleg Nesterov
  1 sibling, 0 replies; 7+ messages in thread
From: Vivek Goyal @ 2010-12-09  2:37 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Oleg Nesterov, linux-kernel

On Wed, Dec 08, 2010 at 05:45:19PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 08, 2010 at 11:06:40PM +0100, Oleg Nesterov wrote:
> > On 12/08, Oleg Nesterov wrote:
> > >
> > > Unfortunately, I can't prove this. You can ask
> > > Paul McKenney if you want the authoritative answer.
> > 
> > Well. I think we should ask ;) This is interesting.
> > 
> > Paul could you please shed a light?
> > 
> > Suppose we have 2 variables, A = 0 and B = 0.
> > 
> > 	CPU0 does:
> > 
> > 		A = 1;
> > 		wmb();
> > 		B = 1;
> > 
> > 	CPU1 does:
> > 
> > 		B = 0;
> > 		mb();
> > 		if (A)
> > 			A = 2;
> > 
> > My understanding is: after that we can safely assume that
> > 
> > 	B == 1 || A == 2
> > 
> > IOW. Either CPU1 notices that A was changed, or CPU0 "wins"
> > and sets B = 1 "after" CPU1. But, it is not possible that
> > CPU1 clears B "after" it was set by CPU0 _and_ sees A == 0.
> > 
> > Is it true? I think it should be true, but can't prove.
> 
> I was afraid that a question like this might be coming...  ;-)
> 
> The question is whether you can rely on the modification order of the
> stores to B to deduce anything useful about the order in which the
> accesses to A occurred.  The answer currently is I believe you can
> for a simple example such as the one above, but I am checking with
> the hardware guys.  In addition, please note that I am not sure if
> all possible generalizations do what you want.  For example, imagine a
> 1024-CPU system in which the first 1023 CPUs do:
> 
> 	A[smp_processor_id()] = 1;
> 	wmb();
> 	B = smp_processor_id();
> 
> where the elements of A are cache-line aligned and padded.  Suppose
> that the remaining CPU does:
> 
> 	i = random() % 1023;
> 	B = -1;
> 	mb();
> 	if (A[i])
> 		A[i] = 2;
> 
> Are we guaranteed that B!=-1||A[i]==2?
> 
> In this case, it could take all of the CPUs quite some time to come to
> agreement on the order of all 1024 assignments to B.  I am bugging some
> hardware guys about this.  It has been awhile, so they forgot to run
> away when they saw me coming.  ;-)
> 
> >                                                         This
> > reminds me the old (and long) discussion about STORE-MB-LOAD.
> > Iirc, finally it was decided that
> > 
> > 	CPU0:				CPU1:
> > 
> > 	A = 1;				B = 1;
> > 	mb();				mb();
> > 	if (B)				if (A)
> > 		printf("Yes");			printf("Yes");
> > 
> > should print "Yes" at least once. This looks very similar to
> > the the previous example.
> 
> >From a hardware point of view, this example is very different than the
> earlier one.  You are not using the order of independent CPUs' stores to a
> single variable here and in addition are using mb() everywhere instead of
> a combination of mb() and wmb().  So, yes, this one is guaranteed to work.
> 
> But what the heck are you guys really trying to do, anyway?  ;-)

Hi Paul,

I pulled oleg into reviewing some block/blk-throttle.c code which I was
not very sure about and that discussion led to above situation. Anyway,
following is my requirement.

There is a hash list on which many throtl_groups (tg) are linked. Each
throtl group has some read/write limits which can be updated with the
help of cgroup files. If limit of any of the groups is updated, we
need to detect and process the change. Looking at current code, oleg
suggested that there is a simpler way to do that. Some thing like
following. 

limit change side (throtl_update_blkio_group_read_bps())
-----------------
tg->bps[READ] = X;
smp_wmb();

tg->limits_changed = true;
smp_wmb();

td->limits_changed = true;
throtl_schedule_delayed_work();

process limit change (throtl_process_limit_change())
----------------------------------------------------

if (!td->limits_changed)
	return;

td->limits_changed = false;
smp_mb();

hlist_for_each_entry_safe() {
	if (!tg->limits_changed)
		return;
	tg->limits_changed = false
	smp_mb();
	process_group_limit_change();
}

And we were wondering if above is correct.

Thanks
Vivek

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-09  1:45                     ` Paul E. McKenney
  2010-12-09  2:37                       ` Vivek Goyal
@ 2010-12-09  9:26                       ` Oleg Nesterov
  2010-12-09 19:47                         ` Paul E. McKenney
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2010-12-09  9:26 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Vivek Goyal, linux-kernel

On 12/08, Paul E. McKenney wrote:
>
> On Wed, Dec 08, 2010 at 11:06:40PM +0100, Oleg Nesterov wrote:
>
> > 	CPU0 does:
> >
> > 		A = 1;
> > 		wmb();
> > 		B = 1;
> >
> > 	CPU1 does:
> >
> > 		B = 0;
> > 		mb();
> > 		if (A)
> > 			A = 2;
> >
> > My understanding is: after that we can safely assume that
> >
> > 	B == 1 || A == 2
> >
> > IOW. Either CPU1 notices that A was changed, or CPU0 "wins"
> > and sets B = 1 "after" CPU1. But, it is not possible that
> > CPU1 clears B "after" it was set by CPU0 _and_ sees A == 0.
> >
> > Is it true? I think it should be true, but can't prove.
>
> I was afraid that a question like this might be coming...  ;-)

I am proud of myself!

> The question is whether you can rely on the modification order of the
> stores to B to deduce anything useful about the order in which the
> accesses to A occurred.  The answer currently is I believe you can
> for a simple example such as the one above, but I am checking with
> the hardware guys.  In addition, please note that I am not sure if
> all possible generalizations do what you want.  For example, imagine a
> 1024-CPU system in which the first 1023 CPUs do:
>
> 	A[smp_processor_id()] = 1;
> 	wmb();
> 	B = smp_processor_id();
>
> where the elements of A are cache-line aligned and padded.  Suppose
> that the remaining CPU does:
>
> 	i = random() % 1023;
> 	B = -1;
> 	mb();
> 	if (A[i])
> 		A[i] = 2;
>
> Are we guaranteed that B!=-1||A[i]==2?
>
> In this case, it could take all of the CPUs quite some time to come to
> agreement on the order of all 1024 assignments to B.  I am bugging some
> hardware guys about this.

Yes, thanks a lot. Of course, my example was intentionally oversimplified,
this generalization is closer to the real life.

> It has been awhile, so they forgot to run
> away when they saw me coming.  ;-)

Hehe. I am very glad you didn't run away when you saw another question
from me ;)

> > 	CPU0:				CPU1:
> >
> > 	A = 1;				B = 1;
> > 	mb();				mb();
> > 	if (B)				if (A)
> > 		printf("Yes");			printf("Yes");
> >
> > should print "Yes" at least once. This looks very similar to
> > the the previous example.
>
> From a hardware point of view, this example is very different than the
> earlier one. You are not using the order of independent CPUs' stores to a
> single variable here and in addition are using mb() everywhere instead of
> a combination of mb() and wmb().  So, yes, this one is guaranteed to work.

OK, thanks.

> But what the heck are you guys really trying to do, anyway?  ;-)

Vivek has already answered.

Basically, we have

	update_object(obj)
	{
		modify_obj(obj);

		wmb();

		obj->was_changed = true;
	}

It can be called many times. Sooner or later, we will call

	process_object(obj)
	{
		if (!obj->was_changed)
			return;

		obj->was_changed = false;

		mb();

		do_process_object(obj);
	}

All we need is to ensure that eventually do_process_object(obj)
will see the result of the last invocation of modify_obj().

IOW. It is fine to miss the changes in obj, but only if
obj->was_changed continues to be T, in this case process_object()
will be called again.

However, if process_object() clears ->was_changed, we must be sure
that do_process_object() can't miss the result of the previous
modify_obj().

Thanks Paul,

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-09  9:26                       ` Oleg Nesterov
@ 2010-12-09 19:47                         ` Paul E. McKenney
  2010-12-10 16:46                           ` Oleg Nesterov
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2010-12-09 19:47 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Vivek Goyal, linux-kernel

On Thu, Dec 09, 2010 at 10:26:59AM +0100, Oleg Nesterov wrote:
> On 12/08, Paul E. McKenney wrote:
> >
> > On Wed, Dec 08, 2010 at 11:06:40PM +0100, Oleg Nesterov wrote:
> >
> > > 	CPU0 does:
> > >
> > > 		A = 1;
> > > 		wmb();
> > > 		B = 1;
> > >
> > > 	CPU1 does:
> > >
> > > 		B = 0;
> > > 		mb();
> > > 		if (A)
> > > 			A = 2;
> > >
> > > My understanding is: after that we can safely assume that
> > >
> > > 	B == 1 || A == 2
> > >
> > > IOW. Either CPU1 notices that A was changed, or CPU0 "wins"
> > > and sets B = 1 "after" CPU1. But, it is not possible that
> > > CPU1 clears B "after" it was set by CPU0 _and_ sees A == 0.
> > >
> > > Is it true? I think it should be true, but can't prove.
> >
> > I was afraid that a question like this might be coming...  ;-)
> 
> I am proud of myself!

;-)

> > The question is whether you can rely on the modification order of the
> > stores to B to deduce anything useful about the order in which the
> > accesses to A occurred.  The answer currently is I believe you can
> > for a simple example such as the one above, but I am checking with
> > the hardware guys.  In addition, please note that I am not sure if
> > all possible generalizations do what you want.  For example, imagine a
> > 1024-CPU system in which the first 1023 CPUs do:
> >
> > 	A[smp_processor_id()] = 1;
> > 	wmb();
> > 	B = smp_processor_id();
> >
> > where the elements of A are cache-line aligned and padded.  Suppose
> > that the remaining CPU does:
> >
> > 	i = random() % 1023;
> > 	B = -1;
> > 	mb();
> > 	if (A[i])
> > 		A[i] = 2;
> >
> > Are we guaranteed that B!=-1||A[i]==2?
> >
> > In this case, it could take all of the CPUs quite some time to come to
> > agreement on the order of all 1024 assignments to B.  I am bugging some
> > hardware guys about this.
> 
> Yes, thanks a lot. Of course, my example was intentionally oversimplified,
> this generalization is closer to the real life.
> 
> > It has been awhile, so they forgot to run
> > away when they saw me coming.  ;-)
> 
> Hehe. I am very glad you didn't run away when you saw another question
> from me ;)

I did seriously consider doing so.  ;-)

> > > 	CPU0:				CPU1:
> > >
> > > 	A = 1;				B = 1;
> > > 	mb();				mb();
> > > 	if (B)				if (A)
> > > 		printf("Yes");			printf("Yes");
> > >
> > > should print "Yes" at least once. This looks very similar to
> > > the the previous example.
> >
> > From a hardware point of view, this example is very different than the
> > earlier one. You are not using the order of independent CPUs' stores to a
> > single variable here and in addition are using mb() everywhere instead of
> > a combination of mb() and wmb().  So, yes, this one is guaranteed to work.
> 
> OK, thanks.
> 
> > But what the heck are you guys really trying to do, anyway?  ;-)
> 
> Vivek has already answered.
> 
> Basically, we have
> 
> 	update_object(obj)
> 	{
> 		modify_obj(obj);
> 
> 		wmb();
> 
> 		obj->was_changed = true;
> 	}
> 
> It can be called many times. Sooner or later, we will call
> 
> 	process_object(obj)
> 	{
> 		if (!obj->was_changed)
> 			return;

Ah, and if you have a huge number of CPUs executing update_object()
at just this point, we have the scenario you showed my in your initial
email.

> 		obj->was_changed = false;
> 
> 		mb();
> 
> 		do_process_object(obj);
> 	}
> 
> All we need is to ensure that eventually do_process_object(obj)
> will see the result of the last invocation of modify_obj().
> 
> IOW. It is fine to miss the changes in obj, but only if
> obj->was_changed continues to be T, in this case process_object()
> will be called again.
> 
> However, if process_object() clears ->was_changed, we must be sure
> that do_process_object() can't miss the result of the previous
> modify_obj().

I am assuming that process_object() executes reasonably infrequently,
or alternatively that there is usually nothing for it to do.  If my
assumption is correct, I suggest the following, which should not add
much additional overhead and should work regardless of what the hardware
guys end up telling me.  But I am checking with the hardware guys about
this one as well, just in case I have managed to get myself confused.
It wouldn't be the first time...  :-/

	update_object(obj)
	{
		modify_obj(obj);

		wmb();

		atomic_set(&obj->was_changed, true);
	}

	process_object(obj)
	{
		if (!atomic_read(&obj->was_changed))
			return;

		if (!atomic_xchg(&obj->was_changed, false))
			return;

		/* mb(); implied by atomic_xchg(), so no longer needed. */

		do_process_object(obj);
	}

One caution:  The wmb() in update_object() means that modify_object()
might read some variable and get a -newer- value for that variable than
would a subsequent read from that same variable in do_process_object().
If this would cause a problem, the wmb() should instead be an mb().
(See http://paulmck.livejournal.com/20312.html for explanation.)

The reason that I say that this should not take much additional
overhead is that all of the writes were taking cache-miss latencies,
and you had lots of memory barriers that make it difficult for the
CPUs' store buffers to hide that latency.  The only added overhead
is from the atomic instruction, but given that you already had a
cache miss from the original write and a memory barrier, I would not
expect it to be noticeable.

But enough time on my soapbox...  Would this do what you need it to?
If so, hopefully it really does what I think it does.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-09 19:47                         ` Paul E. McKenney
@ 2010-12-10 16:46                           ` Oleg Nesterov
  2010-12-10 23:35                             ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Nesterov @ 2010-12-10 16:46 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Vivek Goyal, linux-kernel

On 12/09, Paul E. McKenney wrote:
>
> On Thu, Dec 09, 2010 at 10:26:59AM +0100, Oleg Nesterov wrote:
> >
> > 	update_object(obj)
> > 	{
> > 		modify_obj(obj);
> >
> > 		wmb();
> >
> > 		obj->was_changed = true;
> > 	}
> >
> > It can be called many times. Sooner or later, we will call
> >
> > 	process_object(obj)
> > 	{
> > 		if (!obj->was_changed)
> > 			return;
>
> Ah, and if you have a huge number of CPUs executing update_object()
> at just this point, we have the scenario you showed my in your initial
> email.

Yes.

> 	update_object(obj)
> 	{
> 		modify_obj(obj);
>
> 		wmb();
>
> 		atomic_set(&obj->was_changed, true);
> 	}
>
> 	process_object(obj)
> 	{
> 		if (!atomic_read(&obj->was_changed))
> 			return;
>
> 		if (!atomic_xchg(&obj->was_changed, false))
> 			return;
>
> 		/* mb(); implied by atomic_xchg(), so no longer needed. */
>
> 		do_process_object(obj);
> 	}

This is what we were going to do initially. Except, I think the
plain bool/xchg can be used instead of atomic_t/atomic_xchg ?

But then we decided to discuss the alternatives. Partly because
this looked like the interesting question, but mostly to keep
you busy ;)

> One caution:  The wmb() in update_object() means that modify_object()
> might read some variable and get a -newer- value for that variable than
> would a subsequent read from that same variable in do_process_object().
> If this would cause a problem, the wmb() should instead be an mb().

Yes. And in this case I even _seem_ to understand why we need
s/wmb/mb/ change.

But the original code (I mean, the code we are trying to fix/change)
doesn't have the load-load dependency, so I think wmb() is enough.

> The reason that I say that this should not take much additional
> overhead is that all of the writes were taking cache-miss latencies,
> and you had lots of memory barriers that make it difficult for the
> CPUs' store buffers to hide that latency.  The only added overhead
> is from the atomic instruction, but given that you already had a
> cache miss from the original write and a memory barrier, I would not
> expect it to be noticeable.
>
> But enough time on my soapbox...  Would this do what you need it to?
> If so, hopefully it really does what I think it does.  ;-)

OK, thanks Paul.

So I guess it would be safer to return to initial plan and use xchg().

> (See http://paulmck.livejournal.com/20312.html for explanation.)

Oh. Very interesting. Transitive memory barriers.

You know, I always wanted to understand this aspect. May be you can
look at

	http://marc.info/?l=linux-kernel&m=118944341320665

starting from "To simplify the example above". This pseudo-code tries
to resemble the real-life code we discussed, that is why it uses the
pointers (dereference lack read_barrier_depends(), please ignore).

And no, I can't understand why foo_1() needs the full barrier :/
Or may be I can? Suppose that CPU 0 and CPU 1 share the store-buffer
(no, no, I do not pretend I _really_ understand what this actually
 means;). In this case, perhaps we can forget abou CPU 0 and rewrite
this code as

	void foo_1(void)
	{
		X = 1;	/* it was actually written by CPU 0 */

		r1 = x;
		smp_rmb();  /* The only change. */
		r2 = y;
	}

	void foo_2(void)
	{
		y = 1;
		smp_mb();
		r3 = x;
	}

In this case smp_rmb() obviously can't help. Does it make any sense?


But, when I look at the link I sent you again, I feel I am totatlly
confused. Nothing new, I always knew that memory barriers were specially
designed to drive me crazy.

Oleg.


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: blk-throttle: Correct the placement of smp_rmb()
  2010-12-10 16:46                           ` Oleg Nesterov
@ 2010-12-10 23:35                             ` Paul E. McKenney
  0 siblings, 0 replies; 7+ messages in thread
From: Paul E. McKenney @ 2010-12-10 23:35 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: Vivek Goyal, linux-kernel

On Fri, Dec 10, 2010 at 05:46:45PM +0100, Oleg Nesterov wrote:
> On 12/09, Paul E. McKenney wrote:
> >
> > On Thu, Dec 09, 2010 at 10:26:59AM +0100, Oleg Nesterov wrote:
> > >
> > > 	update_object(obj)
> > > 	{
> > > 		modify_obj(obj);
> > >
> > > 		wmb();
> > >
> > > 		obj->was_changed = true;
> > > 	}
> > >
> > > It can be called many times. Sooner or later, we will call
> > >
> > > 	process_object(obj)
> > > 	{
> > > 		if (!obj->was_changed)
> > > 			return;
> >
> > Ah, and if you have a huge number of CPUs executing update_object()
> > at just this point, we have the scenario you showed my in your initial
> > email.
> 
> Yes.
> 
> > 	update_object(obj)
> > 	{
> > 		modify_obj(obj);
> >
> > 		wmb();
> >
> > 		atomic_set(&obj->was_changed, true);
> > 	}
> >
> > 	process_object(obj)
> > 	{
> > 		if (!atomic_read(&obj->was_changed))
> > 			return;
> >
> > 		if (!atomic_xchg(&obj->was_changed, false))
> > 			return;
> >
> > 		/* mb(); implied by atomic_xchg(), so no longer needed. */
> >
> > 		do_process_object(obj);
> > 	}
> 
> This is what we were going to do initially. Except, I think the
> plain bool/xchg can be used instead of atomic_t/atomic_xchg ?
> 
> But then we decided to discuss the alternatives. Partly because
> this looked like the interesting question, but mostly to keep
> you busy ;)

As long as it uses the relevant atomic instructions on all architectures,
should be fine.  Sigh.  Might need ACCESS_ONCE() on the reads and writes.

> > One caution:  The wmb() in update_object() means that modify_object()
> > might read some variable and get a -newer- value for that variable than
> > would a subsequent read from that same variable in do_process_object().
> > If this would cause a problem, the wmb() should instead be an mb().
> 
> Yes. And in this case I even _seem_ to understand why we need
> s/wmb/mb/ change.
> 
> But the original code (I mean, the code we are trying to fix/change)
> doesn't have the load-load dependency, so I think wmb() is enough.

If you don't need transitivity, wmb() will be OK.

> > The reason that I say that this should not take much additional
> > overhead is that all of the writes were taking cache-miss latencies,
> > and you had lots of memory barriers that make it difficult for the
> > CPUs' store buffers to hide that latency.  The only added overhead
> > is from the atomic instruction, but given that you already had a
> > cache miss from the original write and a memory barrier, I would not
> > expect it to be noticeable.
> >
> > But enough time on my soapbox...  Would this do what you need it to?
> > If so, hopefully it really does what I think it does.  ;-)
> 
> OK, thanks Paul.
> 
> So I guess it would be safer to return to initial plan and use xchg().

I believe so, but it will take the hardware guys a few more days to
chase their stuff down.

> > (See http://paulmck.livejournal.com/20312.html for explanation.)
> 
> Oh. Very interesting. Transitive memory barriers.
> 
> You know, I always wanted to understand this aspect. May be you can
> look at
> 
> 	http://marc.info/?l=linux-kernel&m=118944341320665
> 
> starting from "To simplify the example above". This pseudo-code tries
> to resemble the real-life code we discussed, that is why it uses the
> pointers (dereference lack read_barrier_depends(), please ignore).

Yes, your example at the above URL looks to me to depend on transitivity,
which wmb() and rmb() will not provide.  To be safe, both should be mb().

> And no, I can't understand why foo_1() needs the full barrier :/
> Or may be I can? Suppose that CPU 0 and CPU 1 share the store-buffer
> (no, no, I do not pretend I _really_ understand what this actually
>  means;). In this case, perhaps we can forget abou CPU 0 and rewrite
> this code as
> 
> 	void foo_1(void)
> 	{
> 		X = 1;	/* it was actually written by CPU 0 */
> 
> 		r1 = x;
> 		smp_rmb();  /* The only change. */
> 		r2 = y;
> 	}
> 
> 	void foo_2(void)
> 	{
> 		y = 1;
> 		smp_mb();
> 		r3 = x;
> 	}
> 
> In this case smp_rmb() obviously can't help. Does it make any sense?

Yes, there are a number of architectures where that transformation is
exactly the right analysis tool -- If a CPU loads something just before
a memory barrier, it is as if that same CPU did the store.

Very good!!!  ;-)

> But, when I look at the link I sent you again, I feel I am totatlly
> confused. Nothing new, I always knew that memory barriers were specially
> designed to drive me crazy.

Only if you are too sane for code without barriers to drive you crazy.  ;-)

But yes, this is why it is best to bury barriers behind other APIs.  They
are very subtle and difficult to get right.  The bit about them being
the union of all architectures can be especially painful!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-12-10 23:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20101207123454.GA11997@redhat.com>
     [not found] ` <20101207160102.GB16363@redhat.com>
     [not found]   ` <20101208184507.GA30071@redhat.com>
     [not found]     ` <20101208190918.GI31703@redhat.com>
     [not found]       ` <20101208191600.GA32753@redhat.com>
     [not found]         ` <20101208193031.GJ31703@redhat.com>
     [not found]           ` <20101208193308.GA1044@redhat.com>
     [not found]             ` <20101208200750.GA2202@redhat.com>
     [not found]               ` <20101208204624.GK31703@redhat.com>
     [not found]                 ` <20101208213331.GA4895@redhat.com>
2010-12-08 22:06                   ` blk-throttle: Correct the placement of smp_rmb() Oleg Nesterov
2010-12-09  1:45                     ` Paul E. McKenney
2010-12-09  2:37                       ` Vivek Goyal
2010-12-09  9:26                       ` Oleg Nesterov
2010-12-09 19:47                         ` Paul E. McKenney
2010-12-10 16:46                           ` Oleg Nesterov
2010-12-10 23:35                             ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).