Memory-ordering recipes

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Memory-ordering recipes
@ 2017-09-17 23:05 Paul E. McKenney
  2017-09-18  7:52 ` Boqun Feng
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-17 23:05 UTC (permalink / raw)
  To: j.alglave, luc.maranget, parri.andrea, stern, dhowells, peterz,
	will.deacon, boqun.feng, npiggin
  Cc: linux-kernel

Hello!

The topic of memory-ordering recipes came up at the Linux Plumbers
Conference microconference on Friday, so I thought that I should summarize
what is currently "out there":

1.	memory-barriers.txt:  A bit rambling and diffuse for a recipes
	document.

2.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
	Many of the examples are on-point, but this is aimed more
	at understanding the memory model than at an organized set
	of recipes.

3.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
	Slides 15-20.  Again, some of the litmus tests are on-point,
	but the focus is more on understanding the memory model than on
	an organized set of recipes.

So what litmus tests are needed?  Here is my initial set:

1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

	Lots of variety here, can in some cases substitute:
	
	a.	READ_ONCE() for smp_load_acquire()
	b.	WRITE_ONCE() for smp_store_release()
	c.	Dependencies for both smp_load_acquire() and
		smp_store_release().
	d.	smp_wmb() for smp_store_release() in first thread
		of ISA2 and Z6.2.
	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.

2.	MP (see test6.pdf for nickname translation)

	a.	smp_store_release() / smp_load_acquire()
	b.	rcu_assign_pointer() / rcu_dereference()
	c.	smp_wmb() / smp_rmb()
	d.	Replacing either of the above with smp_mb()

3.	SB

	a.	smp_mb(), as in lockless wait-wakeup coordination.
		And as in sys_membarrier()-scheduler coordination,
		for that matter.

Others?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-17 23:05 Memory-ordering recipes Paul E. McKenney
@ 2017-09-18  7:52 ` Boqun Feng
  2017-09-18 14:25   ` Paul E. McKenney
  2017-09-21 12:45 ` Peter Zijlstra
  2017-10-17 21:01 ` Paul E. McKenney
  2 siblings, 1 reply; 13+ messages in thread
From: Boqun Feng @ 2017-09-18  7:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells, peterz,
	will.deacon, npiggin, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2311 bytes --]

On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> Hello!
> 

Hi Paul,

> The topic of memory-ordering recipes came up at the Linux Plumbers
> Conference microconference on Friday, so I thought that I should summarize
> what is currently "out there":
> 
> 1.	memory-barriers.txt:  A bit rambling and diffuse for a recipes
> 	document.
> 
> 2.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
> 	Many of the examples are on-point, but this is aimed more
> 	at understanding the memory model than at an organized set
> 	of recipes.
> 
> 3.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html

Duplicate links ;-) This should a link to some slides?

> 	Slides 15-20.  Again, some of the litmus tests are on-point,
> 	but the focus is more on understanding the memory model than on
> 	an organized set of recipes.
> 
> So what litmus tests are needed?  Here is my initial set:
> 
> 1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 2.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 	c.	smp_wmb() / smp_rmb()
> 	d.	Replacing either of the above with smp_mb()
> 
> 3.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.

	b.	replace smp_mb() with smp_mb__before_atomic() followed
		by a _relaxed cmpchg? As in pv_kick_node():

		https://marc.info/?l=linux-kernel&m=150274124711012

Besides, do we also want to add Co* into the set? I think there may be
some people still confused to think per-loc SC is not held, and they may
add unnecessary barriers in their code. Those (Co*) recipes could serve
as a guide for state-machine style programming. Thoughts?

Regards,
Boqun

> 
> Others?
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-18  7:52 ` Boqun Feng
@ 2017-09-18 14:25   ` Paul E. McKenney
  2017-09-19  2:00     ` Boqun Feng
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-18 14:25 UTC (permalink / raw)
  To: Boqun Feng
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells, peterz,
	will.deacon, npiggin, linux-kernel

On Mon, Sep 18, 2017 at 03:52:42PM +0800, Boqun Feng wrote:
> On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> 
> Hi Paul,
> 
> > The topic of memory-ordering recipes came up at the Linux Plumbers
> > Conference microconference on Friday, so I thought that I should summarize
> > what is currently "out there":
> > 
> > 1.	memory-barriers.txt:  A bit rambling and diffuse for a recipes
> > 	document.
> > 
> > 2.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
> > 	Many of the examples are on-point, but this is aimed more
> > 	at understanding the memory model than at an organized set
> > 	of recipes.
> > 
> > 3.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
> 
> Duplicate links ;-) This should a link to some slides?

Indeed!  How about this one?

http://www.linuxplumbersconf.org/2017/ocw//system/presentations/4708/original/LKMM-overview.2017.09.15b.pdf

> > 	Slides 15-20.  Again, some of the litmus tests are on-point,
> > 	but the focus is more on understanding the memory model than on
> > 	an organized set of recipes.
> > 
> > So what litmus tests are needed?  Here is my initial set:
> > 
> > 1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> > 	Lots of variety here, can in some cases substitute:
> > 	
> > 	a.	READ_ONCE() for smp_load_acquire()
> > 	b.	WRITE_ONCE() for smp_store_release()
> > 	c.	Dependencies for both smp_load_acquire() and
> > 		smp_store_release().
> > 	d.	smp_wmb() for smp_store_release() in first thread
> > 		of ISA2 and Z6.2.
> > 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> > 2.	MP (see test6.pdf for nickname translation)
> > 
> > 	a.	smp_store_release() / smp_load_acquire()
> > 	b.	rcu_assign_pointer() / rcu_dereference()
> > 	c.	smp_wmb() / smp_rmb()
> > 	d.	Replacing either of the above with smp_mb()
> > 
> > 3.	SB
> > 
> > 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> > 		And as in sys_membarrier()-scheduler coordination,
> > 		for that matter.
> 
> 	b.	replace smp_mb() with smp_mb__before_atomic() followed
> 		by a _relaxed cmpchg? As in pv_kick_node():
> 
> 		https://marc.info/?l=linux-kernel&m=150274124711012
> 
> Besides, do we also want to add Co* into the set? I think there may be
> some people still confused to think per-loc SC is not held, and they may
> add unnecessary barriers in their code. Those (Co*) recipes could serve
> as a guide for state-machine style programming. Thoughts?

Indeed, it would be good to have some single-variable-SC recipes.

And single-variable-SC holds only if you use READ_ONCE().  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-18 14:25   ` Paul E. McKenney
@ 2017-09-19  2:00     ` Boqun Feng
  0 siblings, 0 replies; 13+ messages in thread
From: Boqun Feng @ 2017-09-19  2:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells, peterz,
	will.deacon, npiggin, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 3004 bytes --]

On Mon, Sep 18, 2017 at 07:25:48AM -0700, Paul E. McKenney wrote:
> On Mon, Sep 18, 2017 at 03:52:42PM +0800, Boqun Feng wrote:
> > On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > 
> > Hi Paul,
> > 
> > > The topic of memory-ordering recipes came up at the Linux Plumbers
> > > Conference microconference on Friday, so I thought that I should summarize
> > > what is currently "out there":
> > > 
> > > 1.	memory-barriers.txt:  A bit rambling and diffuse for a recipes
> > > 	document.
> > > 
> > > 2.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
> > > 	Many of the examples are on-point, but this is aimed more
> > > 	at understanding the memory model than at an organized set
> > > 	of recipes.
> > > 
> > > 3.	https://www.kernel.org/pub/linux/kernel/people/paulmck/LWNLinuxMM/Examples.html
> > 
> > Duplicate links ;-) This should a link to some slides?
> 
> Indeed!  How about this one?
> 
> http://www.linuxplumbersconf.org/2017/ocw//system/presentations/4708/original/LKMM-overview.2017.09.15b.pdf
> 

Got it.

Thanks for the link ;-)

Regards,
Boqun

> > > 	Slides 15-20.  Again, some of the litmus tests are on-point,
> > > 	but the focus is more on understanding the memory model than on
> > > 	an organized set of recipes.
> > > 
> > > So what litmus tests are needed?  Here is my initial set:
> > > 
> > > 1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > > 
> > > 	Lots of variety here, can in some cases substitute:
> > > 	
> > > 	a.	READ_ONCE() for smp_load_acquire()
> > > 	b.	WRITE_ONCE() for smp_store_release()
> > > 	c.	Dependencies for both smp_load_acquire() and
> > > 		smp_store_release().
> > > 	d.	smp_wmb() for smp_store_release() in first thread
> > > 		of ISA2 and Z6.2.
> > > 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > > 
> > > 2.	MP (see test6.pdf for nickname translation)
> > > 
> > > 	a.	smp_store_release() / smp_load_acquire()
> > > 	b.	rcu_assign_pointer() / rcu_dereference()
> > > 	c.	smp_wmb() / smp_rmb()
> > > 	d.	Replacing either of the above with smp_mb()
> > > 
> > > 3.	SB
> > > 
> > > 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> > > 		And as in sys_membarrier()-scheduler coordination,
> > > 		for that matter.
> > 
> > 	b.	replace smp_mb() with smp_mb__before_atomic() followed
> > 		by a _relaxed cmpchg? As in pv_kick_node():
> > 
> > 		https://marc.info/?l=linux-kernel&m=150274124711012
> > 
> > Besides, do we also want to add Co* into the set? I think there may be
> > some people still confused to think per-loc SC is not held, and they may
> > add unnecessary barriers in their code. Those (Co*) recipes could serve
> > as a guide for state-machine style programming. Thoughts?
> 
> Indeed, it would be good to have some single-variable-SC recipes.
> 
> And single-variable-SC holds only if you use READ_ONCE().  ;-)
> 
> 							Thanx, Paul
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-17 23:05 Memory-ordering recipes Paul E. McKenney
  2017-09-18  7:52 ` Boqun Feng
@ 2017-09-21 12:45 ` Peter Zijlstra
  2017-09-21 15:26   ` Paul E. McKenney
  2017-10-17 21:01 ` Paul E. McKenney
  2 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-09-21 12:45 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:

> So what litmus tests are needed?  Here is my initial set:
> 
> 1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 2.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 	c.	smp_wmb() / smp_rmb()
> 	d.	Replacing either of the above with smp_mb()
> 
> 3.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.
> 
> Others?

So I have no idea what you're proposing here. The above is just a bunch
of words without meaning :-(

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-21 12:45 ` Peter Zijlstra
@ 2017-09-21 15:26   ` Paul E. McKenney
  2017-09-21 16:15     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-21 15:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Thu, Sep 21, 2017 at 02:45:31PM +0200, Peter Zijlstra wrote:
> On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> 
> > So what litmus tests are needed?  Here is my initial set:
> > 
> > 1.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

I grouped the expansions of all of these names at the end.

> > 	Lots of variety here, can in some cases substitute:
> > 	
> > 	a.	READ_ONCE() for smp_load_acquire()
> > 	b.	WRITE_ONCE() for smp_store_release()
> > 	c.	Dependencies for both smp_load_acquire() and
> > 		smp_store_release().
> > 	d.	smp_wmb() for smp_store_release() in first thread
> > 		of ISA2 and Z6.2.
> > 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.

The key point of these is to illustrate patterns that require only
the lightest ordering.

> > 2.	MP (see test6.pdf for nickname translation)
> > 
> > 	a.	smp_store_release() / smp_load_acquire()
> > 	b.	rcu_assign_pointer() / rcu_dereference()
> > 	c.	smp_wmb() / smp_rmb()
> > 	d.	Replacing either of the above with smp_mb()

This one is a workhorse, used very frequently.

> > 3.	SB
> > 
> > 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> > 		And as in sys_membarrier()-scheduler coordination,
> > 		for that matter.

This is also used frequently, for example, in lockless wait-wakeup
situations.

> > Others?

Someone (quite rightly) suggested adding some single-variable SC
examples, which I will do.

> So I have no idea what you're proposing here. The above is just a bunch
> of words without meaning :-(

As Boqun said on IRC, this URL is your friend:

	https://www.cl.cam.ac.uk/~pes20/ppc-supplemental/test6.pdf

Many of the names seem quite arbitrary, but the abbreviation can be useful.

ISA2 is the first one on page 2, and has this pattern of reads and
writes:

	CPU 0			CPU 1			CPU 2

	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	r3 = READ_ONCE(x);

	BUG_ON(r1 == 1 && r2 == 1 && r3 == 0);

Arbitrary ordering can be added to all of these litmus-test patterns,
for example, the writes to y and z might become smp_store_release()
and the read from z might become smp_load_acquire().  Or, alternatively,
smp_mb() might be placed between accesses on all thread CPUs.  The key
point is that "ISA2" identifies not a specific litmus test, but instead a
family of them with the same pattern of reads, writes and RMW operations,
but with different ordering properties.

Z6.3 is the second one on page 2:

	CPU 0			CPU 1			CPU 2

	WRITE_ONCE(x, 2);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	WRITE_ONCE(x, 1);

	BUG_ON(r1 == 1 && r2 == 1 && x == 2);

LB is the last on on the extreme left of page 1.  "LB" stands for
"load buffering", and each CPU's first access is a load and last
access is a store:

	CPU 0			CPU 1

	r1 = READ_ONCE(x);	r2 = READ_ONCE(y);
	WRITE_ONCE(y, 1);	WRITE_ONCE(x, 1);

	BUG_ON(r1 == 1 && r2 == 1);

3.LB is simply a three-CPU extension of LB, and "LB" by itself is
sometimes used to cover an arbitrary number of CPUs:

	CPU 0			CPU 1			CPU 2

	r1 = READ_ONCE(x);	r2 = READ_ONCE(y);	r3 = READ_ONCE(z);
	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	WRITE_ONCE(x, 1);

	BUG_ON(r1 == 1 && r2 == 1 && r3 == 1);

MP is the second on the extreme left of page 1.  "MP" stands for "message
passing", and is used very heavily.  The idea is that "x" is the message
(sent by CPU 0), and "y" is a flag saying that the message is ready to
be received (by CPU 1).

	CPU 0			CPU 1

	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);
	WRITE_ONCE(y, 1);	r1 = READ_ONCE(x);

	BUG_ON(r1 == 1 && r2 == 0);

SB is the fourth on the extreme left of page 1.  "SB" stands for "store
buffering" because systems without store buffers won't reorder this one.

	CPU 0			CPU 1

	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
	r1 = READ_ONCE(y);	r2 = READ_ONCE(x);

	BUG_ON(r1 == 0 && r2 == 0);

Does that help?

Oh, and the actual recipes would include ordering as indicated by
the sub-bullets.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-21 15:26   ` Paul E. McKenney
@ 2017-09-21 16:15     ` Peter Zijlstra
  2017-09-21 16:40       ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-09-21 16:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Thu, Sep 21, 2017 at 08:26:42AM -0700, Paul E. McKenney wrote:
> ISA2 is the first one on page 2, and has this pattern of reads and
> writes:
> 
> 	CPU 0			CPU 1			CPU 2
> 
> 	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
> 	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	r3 = READ_ONCE(x);
> 
> 	BUG_ON(r1 == 1 && r2 == 1 && r3 == 0);
> 
> Arbitrary ordering can be added to all of these litmus-test patterns,
> for example, the writes to y and z might become smp_store_release()
> and the read from z might become smp_load_acquire().  Or, alternatively,
> smp_mb() might be placed between accesses on all thread CPUs.  The key
> point is that "ISA2" identifies not a specific litmus test, but instead a
> family of them with the same pattern of reads, writes and RMW operations,
> but with different ordering properties.
> 
> Z6.3 is the second one on page 2:
> 
> 	CPU 0			CPU 1			CPU 2
> 
> 	WRITE_ONCE(x, 2);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
> 	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	WRITE_ONCE(x, 1);
> 
> 	BUG_ON(r1 == 1 && r2 == 1 && x == 2);

But why are these useful to include in a recipes list? I would imagine
those should cover the simple 2 threads stuff. Once you go fancy and
need 3 CPUs I feel people had better know wth they're on about.

> LB is the last on on the extreme left of page 1.  "LB" stands for
> "load buffering", and each CPU's first access is a load and last
> access is a store:
> 
> 	CPU 0			CPU 1
> 
> 	r1 = READ_ONCE(x);	r2 = READ_ONCE(y);
> 	WRITE_ONCE(y, 1);	WRITE_ONCE(x, 1);
> 
> 	BUG_ON(r1 == 1 && r2 == 1);

> MP is the second on the extreme left of page 1.  "MP" stands for "message
> passing", and is used very heavily.  The idea is that "x" is the message
> (sent by CPU 0), and "y" is a flag saying that the message is ready to
> be received (by CPU 1).
> 
> 	CPU 0			CPU 1
> 
> 	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);
> 	WRITE_ONCE(y, 1);	r1 = READ_ONCE(x);
> 
> 	BUG_ON(r1 == 1 && r2 == 0);

Right, these two are fairly common patterns.

> SB is the fourth on the extreme left of page 1.  "SB" stands for "store
> buffering" because systems without store buffers won't reorder this one.
> 
> 	CPU 0			CPU 1
> 
> 	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
> 	r1 = READ_ONCE(y);	r2 = READ_ONCE(x);
> 
> 	BUG_ON(r1 == 0 && r2 == 0);
> 
> Does that help?
> 
> Oh, and the actual recipes would include ordering as indicated by
> the sub-bullets.

Which just generates a terrible lot of noise. Why would people be
interested in these permutations? Why not the minimal set that makes the
guarantee?

Also, none of these cover 'simple' stuff like a ring-buffer.

So I have to ask, what is the purpose of this recipes list?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-21 16:15     ` Peter Zijlstra
@ 2017-09-21 16:40       ` Paul E. McKenney
  2017-09-22  9:29         ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-21 16:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Thu, Sep 21, 2017 at 06:15:47PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 21, 2017 at 08:26:42AM -0700, Paul E. McKenney wrote:
> > ISA2 is the first one on page 2, and has this pattern of reads and
> > writes:
> > 
> > 	CPU 0			CPU 1			CPU 2
> > 
> > 	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
> > 	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	r3 = READ_ONCE(x);
> > 
> > 	BUG_ON(r1 == 1 && r2 == 1 && r3 == 0);
> > 
> > Arbitrary ordering can be added to all of these litmus-test patterns,
> > for example, the writes to y and z might become smp_store_release()
> > and the read from z might become smp_load_acquire().  Or, alternatively,
> > smp_mb() might be placed between accesses on all thread CPUs.  The key
> > point is that "ISA2" identifies not a specific litmus test, but instead a
> > family of them with the same pattern of reads, writes and RMW operations,
> > but with different ordering properties.
> > 
> > Z6.3 is the second one on page 2:
> > 
> > 	CPU 0			CPU 1			CPU 2
> > 
> > 	WRITE_ONCE(x, 2);	r1 = READ_ONCE(y);	r2 = READ_ONCE(z);
> > 	WRITE_ONCE(y, 1);	WRITE_ONCE(z, 1);	WRITE_ONCE(x, 1);
> > 
> > 	BUG_ON(r1 == 1 && r2 == 1 && x == 2);
> 
> But why are these useful to include in a recipes list? I would imagine
> those should cover the simple 2 threads stuff. Once you go fancy and
> need 3 CPUs I feel people had better know wth they're on about.

Release-acquire chains are good material for beginners, and they are
one of the few memory-ordering patterns that extend nicely to more than
two CPUs, hence the three-CPU examples.

> > LB is the last on on the extreme left of page 1.  "LB" stands for
> > "load buffering", and each CPU's first access is a load and last
> > access is a store:
> > 
> > 	CPU 0			CPU 1
> > 
> > 	r1 = READ_ONCE(x);	r2 = READ_ONCE(y);
> > 	WRITE_ONCE(y, 1);	WRITE_ONCE(x, 1);
> > 
> > 	BUG_ON(r1 == 1 && r2 == 1);
> 
> > MP is the second on the extreme left of page 1.  "MP" stands for "message
> > passing", and is used very heavily.  The idea is that "x" is the message
> > (sent by CPU 0), and "y" is a flag saying that the message is ready to
> > be received (by CPU 1).
> > 
> > 	CPU 0			CPU 1
> > 
> > 	WRITE_ONCE(x, 1);	r1 = READ_ONCE(y);
> > 	WRITE_ONCE(y, 1);	r1 = READ_ONCE(x);
> > 
> > 	BUG_ON(r1 == 1 && r2 == 0);
> 
> Right, these two are fairly common patterns.
> 
> > SB is the fourth on the extreme left of page 1.  "SB" stands for "store
> > buffering" because systems without store buffers won't reorder this one.
> > 
> > 	CPU 0			CPU 1
> > 
> > 	WRITE_ONCE(x, 1);	WRITE_ONCE(y, 1);
> > 	r1 = READ_ONCE(y);	r2 = READ_ONCE(x);
> > 
> > 	BUG_ON(r1 == 0 && r2 == 0);
> > 
> > Does that help?
> > 
> > Oh, and the actual recipes would include ordering as indicated by
> > the sub-bullets.
> 
> Which just generates a terrible lot of noise. Why would people be
> interested in these permutations? Why not the minimal set that makes the
> guarantee?

Hmmm...  Exactly what do you consider this minimal set to be?
Exactly which of the above scenarios would you leave out?  Are there
other scenarios that you would want to include?

> Also, none of these cover 'simple' stuff like a ring-buffer.

Control dependencies are now 'simple'?  ;-)

> So I have to ask, what is the purpose of this recipes list?

To give people common usage patterns to commit to memory, as an
alternative to running the tool every time they turn around.  For the
people who generate code quickly, frequently running the tool would be
a huge productivity hit.  But it is a probabilities game.  If a given
usage pattern is rare enough, it is better to leave it to the tool.

Also, to guide code style, urging people to use straightforward designs
instead of the much more strange ones that they might otherwise
gravitate to.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-21 16:40       ` Paul E. McKenney
@ 2017-09-22  9:29         ` Peter Zijlstra
  2017-09-22 21:06           ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2017-09-22  9:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Thu, Sep 21, 2017 at 09:40:46AM -0700, Paul E. McKenney wrote:
> > Also, none of these cover 'simple' stuff like a ring-buffer.
> 
> Control dependencies are now 'simple'?  ;-)

They're not particularly harder than any other barrier I find.

But again, this comes back to the purpose of this recipes thing. I'm
thinking its meant to be how-to crib sheet for people who really don't
want to understand this stuff.

So it should include common, robust patterns and leave it at that. And
explain them through the code, not through litmus tests -- because those
are a royal friggin pain to learn to read in any case ;-)

So, 'how do I do a lockless ring-buffer' is a simple enough question
with a fairly straight forward answer. The how do I prove that answer is
correct is a whole other thing, but one I suspect most people really
don't care about.

So, once again, what is the intended purpose of his document? A gentle
introduction to memory ordering, or a crib sheet for the working code
monkey?

Because if people _want_ to understand this stuff, memory-barriers.txt
should be their document. If its become too hard to read, we should look
at fixing that, but adding another document doesn't seem like a solution
to that problem.

For the people who really don't care and just want to get their thing
done; we could have a document explaining common patterns.. maybe.

So once again, what's the purpose of this new document?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-22  9:29         ` Peter Zijlstra
@ 2017-09-22 21:06           ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-09-22 21:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: j.alglave, luc.maranget, parri.andrea, stern, dhowells,
	will.deacon, boqun.feng, npiggin, linux-kernel

On Fri, Sep 22, 2017 at 11:29:18AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 21, 2017 at 09:40:46AM -0700, Paul E. McKenney wrote:
> > > Also, none of these cover 'simple' stuff like a ring-buffer.
> > 
> > Control dependencies are now 'simple'?  ;-)
> 
> They're not particularly harder than any other barrier I find.

Other than keeping the compiler from breaking them, I suppose.  ;-)

> But again, this comes back to the purpose of this recipes thing. I'm
> thinking its meant to be how-to crib sheet for people who really don't
> want to understand this stuff.

Other use cases include learning about memory ordering to begin with,
a convenient repository of the most common cases, and helping guide
people to the more maintainable concurrency designs.

> So it should include common, robust patterns and leave it at that. And
> explain them through the code, not through litmus tests -- because those
> are a royal friggin pain to learn to read in any case ;-)

Litmus tests are no harder to learn than is kernel code, and we do need
to allow people starting from scratch.  Plus the big advantage of the
litmus-test code is that you can check it with the tool, which is not yet
feasible for general kernel code.  Though I believe we will get there,
just not in the next 5 years.

But I agree that pointing to examples in the kernel is a good thing,
though left to myself, I would include the kernel version and not sign
up to change the examples as the kernel changes.

> So, 'how do I do a lockless ring-buffer' is a simple enough question
> with a fairly straight forward answer. The how do I prove that answer is
> correct is a whole other thing, but one I suspect most people really
> don't care about.

At least some of us had jolly better care about showing that it is
correct!  But yes, most people are going to cargo-cult known-good
patterns.  And these people are an important audience for the
recipes document.

And "how do I do a lockless ring buffer" might have a straightforward
answer now, but that code had at least its share of bugs along the way.
On the other hand, if your point is that the memory model would cover
more cases if it handled arrays, no argument here.  I would prioritize
efficiently handling locking higher, but arrays would be good.

> So, once again, what is the intended purpose of his document? A gentle
> introduction to memory ordering, or a crib sheet for the working code
> monkey?

Those are definitely two of the use cases.

> Because if people _want_ to understand this stuff, memory-barriers.txt
> should be their document. If its become too hard to read, we should look
> at fixing that, but adding another document doesn't seem like a solution
> to that problem.

There is no denying that memory-barriers.txt exceeded critical mass some
years ago (https://lwn.net/Articles/575835/), and that was before the
_acquire, _release, and _relaxed variants of atomic RMW operations,
among other things.  Plus it is incomplete, as it is essentially a
compendium of answers to specific questions that have come up over
the years.  For example, right now if you want to know how some
specific memory-ordering mechanism interacts with RCU, too bad, it
is not documented at all.  I suspect that most people do understand at
some level that synchronize_rcu() acts like a full memory barrier, and
there are the comment headers for the various grace-period functions,
but how does all this interact with (say) an atomic_add_return_acquire()
in the middle of some random RCU read-side critical section?  This sort
of thing transforms memory-barriers.txt into "Combinatorial Explosion
Я Us", hence the formal model.

This whole memory-model effort is in fact the way we are looking to fix
this.  From what I can see, memory-barriers.txt (or a document derived
from it) would have some explanation of what the hardware gets up to,
discussion of I/O barriers, care and feeding of control dependencies,
and perhaps a few other things.

> For the people who really don't care and just want to get their thing
> done; we could have a document explaining common patterns.. maybe.

And that is another use case for this recipes document.

> So once again, what's the purpose of this new document?

Let's see...

1.	By-example introduction to the most common memory-ordering
	patterns for people starting out.

2.	Repository of common cases for easy reference.

3.	Repository of good examples for "don't make up some strange
	thing if these common cases work for you" purposes.

4.	Source of cargo-culting for people who just want to get their
	job done and get on with their lives, without necessarily
	understanding the full glory of memory ordering.

This document is not expected to stand alone.  More complex cases, up
to a point, can be handled by the memory model.  And comments in the
source code are very good things as well.

Does that help, or am I missing your point?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-09-17 23:05 Memory-ordering recipes Paul E. McKenney
  2017-09-18  7:52 ` Boqun Feng
  2017-09-21 12:45 ` Peter Zijlstra
@ 2017-10-17 21:01 ` Paul E. McKenney
  2017-10-18  1:07   ` Andrea Parri
  2 siblings, 1 reply; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-17 21:01 UTC (permalink / raw)
  To: j.alglave, luc.maranget, parri.andrea, stern, dhowells, peterz,
	will.deacon, boqun.feng, npiggin
  Cc: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 11417 bytes --]

On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> The topic of memory-ordering recipes came up at the Linux Plumbers
> Conference microconference on Friday, so I thought that I should summarize
> what is currently "out there":

And here is an updated list of potential Linux-kernel examples for a
"recipes" document, and thank you for the feedback.  Please feel free
to counterpropose better examples.  In addition, if there is some other
pattern that is commonplace and important enough to be included in a
recipes document, please point it out.

							Thanx, Paul

------------------------------------------------------------------------

This document lists the litmus-test patterns that we have been discussing,
along with examples from the Linux kernel.  This is intended to feed into
the recipes document.  All examples are from v4.13.

0.	Simple special cases

	If there is only one CPU on the one hand or only one variable
	on the other, the code will execute in order.  There are (as
	usual) some things to be careful of:

	a.	There are some aspects of the C language that are
		unordered.  For example, in the expression "f(x) + g(y)",
		the order in which f and g are called is not defined;
		the object code is allowed to use either order or even
		to interleave the computations.

	b.	Compilers are permitted to use the "as-if" rule.  That is,
		a compiler can emit whatever code it likes, as long as
		the results of a single-threaded execution appear just
		as if the compiler had followed all the relevant rules.
		To see this, compile with a high level of optimization
		and run the debugger on the resulting binary.

	c.	If there is only one variable but multiple CPUs, all
		that variable must be properly aligned and all accesses
		to that variable must be full sized.  Variables that
		straddle cachelines or pages void your full-ordering
		warranty, as do undersized accesses that load from or
		store to only part of the variable.

1.	Another simple case: Locking.  [ Assuming you don't think too
	hard about it, that is! ]

	Any CPU that has acquired a given lock sees any changes previously
	made by any CPU prior to having released that same lock.

	[ Should we discuss chaining back through different locks,
	  sort of like release-acquire chains? ]

2.	MP (see test6.pdf for nickname translation)

	a.	smp_store_release() / smp_load_acquire()

		init_stack_slab() in lib/stackdepot.c uses release-acquire
		to handle initialization of a slab of the stack.  Working
		out the mutual-exclusion design is left as an exercise for
		the reader.

	b.	rcu_assign_pointer() / rcu_dereference()

		expand_to_next_prime() does the rcu_assign_pointer(),
		and next_prime_number() does the rcu_dereference().
		This mediates access to a bit vector that is expanded
		as additional primes are needed.  These two functions
		are in lib/prime_numbers.c.

	c.	smp_wmb() / smp_rmb()

		xlog_state_switch_iclogs() contains the following:

			log->l_curr_block -= log->l_logBBsize;
			ASSERT(log->l_curr_block >= 0);
			smp_wmb();
			log->l_curr_cycle++;

		And xlog_valid_lsn() contains the following:

			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
			smp_rmb();
			cur_block = ACCESS_ONCE(log->l_curr_block);

		Alternatively, from the comment in perf_output_put_handle()
		in kernel/events/ring_buffer.c:

		 *   kernel				user
		 *
		 *   if (LOAD ->data_tail) {		LOAD ->data_head
		 *			(A)		smp_rmb()	(C)
		 *	STORE $data			LOAD $data
		 *	smp_wmb()	(B)		smp_mb()	(D)
		 *	STORE ->data_head		STORE ->data_tail
		 *   }
		 *
		 * Where A pairs with D, and B pairs with C.

		The B/C pairing is MP with smp_wmb() and smp_rmb().

	d.	Replacing either of the above with smp_mb()

		Holding off on this one for the moment...

3.	LB

	a.	LB+ctrl+mb

		Again, from the comment in perf_output_put_handle()
		in kernel/events/ring_buffer.c:

		 *   kernel				user
		 *
		 *   if (LOAD ->data_tail) {		LOAD ->data_head
		 *			(A)		smp_rmb()	(C)
		 *	STORE $data			LOAD $data
		 *	smp_wmb()	(B)		smp_mb()	(D)
		 *	STORE ->data_head		STORE ->data_tail
		 *   }
		 *
		 * Where A pairs with D, and B pairs with C.

		The A/D pairing covers this one.

4.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB

	Lots of variety here, can in some cases substitute:
	
	a.	READ_ONCE() for smp_load_acquire()
	b.	WRITE_ONCE() for smp_store_release()
	c.	Dependencies for both smp_load_acquire() and
		smp_store_release().
	d.	smp_wmb() for smp_store_release() in first thread
		of ISA2 and Z6.2.
	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.

	The canonical illustration of LB involves the various memory
	allocators, where you don't want a load from about-to-be-freed
	memory to see a store initializing a later incarnation of that
	same memory area.  But the per-CPU caches make this a very
	long and complicated example.

	I am not aware of any three-CPU release-acquire chains in the
	Linux kernel.  There are three-CPU lock-based chains in RCU,
	but these are not at all simple, either.

	Thoughts?

5.	SB

	a.	smp_mb(), as in lockless wait-wakeup coordination.
		And as in sys_membarrier()-scheduler coordination,
		for that matter.

		Examples seem to be lacking.  Most cases use locking.
		Here is one rather strange one from RCU:

		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
		{
			unsigned long flags;
			bool needwake;
			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);

			rhp->next = NULL;
			rhp->func = func;
			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
			needwake = !rcu_tasks_cbs_head;
			*rcu_tasks_cbs_tail = rhp;
			rcu_tasks_cbs_tail = &rhp->next;
			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
			/* We can't create the thread unless interrupts are enabled. */
			if ((needwake && havetask) ||
			    (!havetask && !irqs_disabled_flags(flags))) {
				rcu_spawn_tasks_kthread();
				wake_up(&rcu_tasks_cbs_wq);
			}
		}

		And for the wait side, using synchronize_sched() to supply
		the barrier for both ends, with the preemption disabling
		due to raw_spin_lock_irqsave() serving as the read-side
		critical section:

		if (!list) {
			wait_event_interruptible(rcu_tasks_cbs_wq,
						 rcu_tasks_cbs_head);
			if (!rcu_tasks_cbs_head) {
				WARN_ON(signal_pending(current));
				schedule_timeout_interruptible(HZ/10);
			}
			continue;
		}
		synchronize_sched();

		-----------------

		Here is another one that uses atomic_cmpxchg() as a
		full memory barrier:

		if (!wait_event_timeout(*wait, !atomic_read(stopping),
					msecs_to_jiffies(1000))) {
			atomic_set(stopping, 0);
			smp_mb();
			return -ETIMEDOUT;
		}

		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
						     atomic_t *stopping)
		{
			if (atomic_cmpxchg(stopping, 1, 0)) {
				wake_up(wait);
				return 1;
			}

			return 0;
		}

		-----------------

		And here is the generic pattern for the above two examples
		taken from waitqueue_active() in include/linux/wait.h:

 *      CPU0 - waker                    CPU1 - waiter
 *
 *                                      for (;;) {
 *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
 *      smp_mb();                         // smp_mb() from set_current_state()
 *      if (waitqueue_active(wq_head))         if (@cond)
 *        wake_up(wq_head);                      break;
 *                                        schedule();
 *                                      }
 *                                      finish_wait(&wq_head, &wait);

		Note that prepare_to_wait() does the both the write
		and the set_current_state() that contains the smp_mb().
		The read is the waitqueue_active() on the one hand and
		the "if (@cond)" on the other.

6.	W+RWC+porel+mb+mb

	See recipes-LKcode-63cae12bce986.txt.

	Mostly of historical interest -- as far as I know, this commit
	was the first to contain a litmus test.

7.	Context switch and migration.  A bit specialized, so might leave
	this one out.

	When a thread moves from one CPU to another to another, the
	scheduler is required to do whatever is necessary for the thread
	to see any prior accesses that it executed on other CPUs.  This
	includes "interesting" interactions with wake_up() shown in the
	following comment from try_to_wake_up() in kernel/sched/core.c:

 * Notes on Program-Order guarantees on SMP systems.
 *
 *  MIGRATION
 *
 * The basic program-order guarantee on SMP systems is that when a task [t]
 * migrates, all its activity on its old CPU [c0] happens-before any subsequent
 * execution on its new CPU [c1].
 *
 * For migration (of runnable tasks) this is provided by the following means:
 *
 *  A) UNLOCK of the rq(c0)->lock scheduling out task t
 *  B) migration for t is required to synchronize *both* rq(c0)->lock and
 *     rq(c1)->lock (if not at the same time, then in that order).
 *  C) LOCK of the rq(c1)->lock scheduling in task
 *
 * Transitivity guarantees that B happens after A and C after B.
 * Note: we only require RCpc transitivity.
 * Note: the CPU doing B need not be c0 or c1
 *
 * Example:
 *
 *   CPU0            CPU1            CPU2
 *
 *   LOCK rq(0)->lock
 *   sched-out X
 *   sched-in Y
 *   UNLOCK rq(0)->lock
 *
 *                                   LOCK rq(0)->lock // orders against CPU0
 *                                   dequeue X
 *                                   UNLOCK rq(0)->lock
 *
 *                                   LOCK rq(1)->lock
 *                                   enqueue X
 *                                   UNLOCK rq(1)->lock
 *
 *                   LOCK rq(1)->lock // orders against CPU2
 *                   sched-out Z
 *                   sched-in X
 *                   UNLOCK rq(1)->lock
 *
 *
 *  BLOCKING -- aka. SLEEP + WAKEUP
 *
 * For blocking we (obviously) need to provide the same guarantee as for
 * migration. However the means are completely different as there is no lock
 * chain to provide order. Instead we do:
 *
 *   1) smp_store_release(X->on_cpu, 0)
 *   2) smp_cond_load_acquire(!X->on_cpu)
 *
 * Example:
 *
 *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
 *
 *   LOCK rq(0)->lock LOCK X->pi_lock
 *   dequeue X
 *   sched-out X
 *   smp_store_release(X->on_cpu, 0);
 *
 *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
 *                    X->state = WAKING
 *                    set_task_cpu(X,2)
 *
 *                    LOCK rq(2)->lock
 *                    enqueue X
 *                    X->state = RUNNING
 *                    UNLOCK rq(2)->lock
 *
 *                                          LOCK rq(2)->lock // orders against CPU1
 *                                          sched-out Z
 *                                          sched-in X
 *                                          UNLOCK rq(2)->lock
 *
 *                    UNLOCK X->pi_lock
 *   UNLOCK rq(0)->lock
 *
 *
 * However; for wakeups there is a second guarantee we must provide, namely we
 * must observe the state that lead to our wakeup. That is, not only must our
 * task observe its own prior state, it must also observe the stores prior to
 * its wakeup.
 *
 * This means that any means of doing remote wakeups must order the CPU doing
 * the wakeup against the CPU the task is going to end up running on. This,
 * however, is already required for the regular Program-Order guarantee above,
 * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

[-- Attachment #2: recipes-LKcode-63cae12bce986.txt --]
[-- Type: text/plain, Size: 9202 bytes --]

commit 63cae12bce9861cec309798d34701cf3da20bc71
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Dec 9 14:59:00 2016 +0100

    perf/core: Fix sys_perf_event_open() vs. hotplug
    
    There is problem with installing an event in a task that is 'stuck' on
    an offline CPU.
    
    Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
    blocked task doesn't run and doesn't require a CPU etc.. Only on
    wakeup do we ammend the situation and place the task on a available
    CPU.
    
    If we hit such a task with perf_install_in_context() we'll loop until
    either that task wakes up or the CPU comes back online, if the task
    waking depends on the event being installed, we're stuck.
    
    While looking into this issue, I also spotted another problem, if we
    hit a task with perf_install_in_context() that is in the middle of
    being migrated, that is we observe the old CPU before sending the IPI,
    but run the IPI (on the old CPU) while the task is already running on
    the new CPU, things also go sideways.
    
    Rework things to rely on task_curr() -- outside of rq->lock -- which
    is rather tricky. Imagine the following scenario where we're trying to
    install the first event into our task 't':
    
    CPU0            CPU1            CPU2
    
                    (current == t)
    
    t->perf_event_ctxp[] = ctx;
    smp_mb();
    cpu = task_cpu(t);
    
                    switch(t, n);
                                    migrate(t, 2);
                                    switch(p, t);
    
                                    ctx = t->perf_event_ctxp[]; // must not be NULL
    
    smp_function_call(cpu, ..);
    
                    generic_exec_single()
                      func();
                        spin_lock(ctx->lock);
                        if (task_curr(t)) // false
    
                        add_event_to_ctx();
                        spin_unlock(ctx->lock);
    
                                    perf_event_context_sched_in();
                                      spin_lock(ctx->lock);
                                      // sees event
    
    So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
    Because if CPU2's load of that variable were to observe NULL, it would
    not try to schedule the ctx and we'd have a task running without its
    counter, which would be 'bad'.
    
    As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
    first and not see the event yet, then CPU0 must observe task_curr()
    and retry. If the install happens first, then we must see the event on
    sched-in and all is well.
    
    I think we can translate the first part (until the 'must not be NULL')
    of the scenario to a litmus test like:
    
      C C-peterz
    
      {
      }
    
      P0(int *x, int *y)
      {
              int r1;
    
              WRITE_ONCE(*x, 1);
              smp_mb();
              r1 = READ_ONCE(*y);
      }
    
      P1(int *y, int *z)
      {
              WRITE_ONCE(*y, 1);
              smp_store_release(z, 1);
      }
    
      P2(int *x, int *z)
      {
              int r1;
              int r2;
    
              r1 = smp_load_acquire(z);
    	  smp_mb();
              r2 = READ_ONCE(*x);
      }
    
      exists
      (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
    
    Where:
      x is perf_event_ctxp[],
      y is our tasks's CPU, and
      z is our task being placed on the rq of CPU2.
    
    The P0 smp_mb() is the one added by this patch, ordering the store to
    perf_event_ctxp[] from find_get_context() and the load of task_cpu()
    in task_function_call().
    
    The smp_store_release/smp_load_acquire model the RCpc locking of the
    rq->lock and the smp_mb() of P2 is the context switch switching from
    whatever CPU2 was running to our task 't'.
    
    This litmus test evaluates into:
    
      Test C-peterz Allowed
      States 7
      0:r1=0; 2:r1=0; 2:r2=0;
      0:r1=0; 2:r1=0; 2:r2=1;
      0:r1=0; 2:r1=1; 2:r2=1;
      0:r1=1; 2:r1=0; 2:r2=0;
      0:r1=1; 2:r1=0; 2:r2=1;
      0:r1=1; 2:r1=1; 2:r2=0;
      0:r1=1; 2:r1=1; 2:r2=1;
      No
      Witnesses
      Positive: 0 Negative: 7
      Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
      Observation C-peterz Never 0 7
      Hash=e427f41d9146b2a5445101d3e2fcaa34
    
    And the strong and weak model agree.
    
    Reported-by: Mark Rutland <mark.rutland@arm.com>
    Tested-by: Mark Rutland <mark.rutland@arm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
    Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
    Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
    Cc: Jiri Olsa <jolsa@redhat.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
    Cc: Stephane Eranian <eranian@google.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: Vince Weaver <vincent.weaver@maine.edu>
    Cc: Will Deacon <will.deacon@arm.com>
    Cc: jeremy.linton@arm.com
    Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

diff --git a/kernel/events/core.c b/kernel/events/core.c
index ab15509fab8c..72ce7d63e561 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
 	struct perf_event_context *ctx = event->ctx;
 	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
 	struct perf_event_context *task_ctx = cpuctx->task_ctx;
-	bool activate = true;
+	bool reprogram = true;
 	int ret = 0;
 
 	raw_spin_lock(&cpuctx->ctx.lock);
@@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
 		raw_spin_lock(&ctx->lock);
 		task_ctx = ctx;
 
-		/* If we're on the wrong CPU, try again */
-		if (task_cpu(ctx->task) != smp_processor_id()) {
-			ret = -ESRCH;
-			goto unlock;
-		}
+		reprogram = (ctx->task == current);
 
 		/*
-		 * If we're on the right CPU, see if the task we target is
-		 * current, if not we don't have to activate the ctx, a future
-		 * context switch will do that for us.
+		 * If the task is running, it must be running on this CPU,
+		 * otherwise we cannot reprogram things.
+		 *
+		 * If its not running, we don't care, ctx->lock will
+		 * serialize against it becoming runnable.
 		 */
-		if (ctx->task != current)
-			activate = false;
-		else
-			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
+		if (task_curr(ctx->task) && !reprogram) {
+			ret = -ESRCH;
+			goto unlock;
+		}
 
+		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
 	} else if (task_ctx) {
 		raw_spin_lock(&task_ctx->lock);
 	}
 
-	if (activate) {
+	if (reprogram) {
 		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
 		add_event_to_ctx(event, ctx);
 		ctx_resched(cpuctx, task_ctx);
@@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
 	/*
 	 * Installing events is tricky because we cannot rely on ctx->is_active
 	 * to be set in case this is the nr_events 0 -> 1 transition.
+	 *
+	 * Instead we use task_curr(), which tells us if the task is running.
+	 * However, since we use task_curr() outside of rq::lock, we can race
+	 * against the actual state. This means the result can be wrong.
+	 *
+	 * If we get a false positive, we retry, this is harmless.
+	 *
+	 * If we get a false negative, things are complicated. If we are after
+	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
+	 * value must be correct. If we're before, it doesn't matter since
+	 * perf_event_context_sched_in() will program the counter.
+	 *
+	 * However, this hinges on the remote context switch having observed
+	 * our task->perf_event_ctxp[] store, such that it will in fact take
+	 * ctx::lock in perf_event_context_sched_in().
+	 *
+	 * We do this by task_function_call(), if the IPI fails to hit the task
+	 * we know any future context switch of task must see the
+	 * perf_event_ctpx[] store.
 	 */
-again:
+
 	/*
-	 * Cannot use task_function_call() because we need to run on the task's
-	 * CPU regardless of whether its current or not.
+	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
+	 * task_cpu() load, such that if the IPI then does not find the task
+	 * running, a future context switch of that task must observe the
+	 * store.
 	 */
-	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
+	smp_mb();
+again:
+	if (!task_function_call(task, __perf_install_in_context, event))
 		return;
 
 	raw_spin_lock_irq(&ctx->lock);
@@ -2348,12 +2370,16 @@ again:
 		raw_spin_unlock_irq(&ctx->lock);
 		return;
 	}
-	raw_spin_unlock_irq(&ctx->lock);
 	/*
-	 * Since !ctx->is_active doesn't mean anything, we must IPI
-	 * unconditionally.
+	 * If the task is not running, ctx->lock will avoid it becoming so,
+	 * thus we can safely install the event.
 	 */
-	goto again;
+	if (task_curr(task)) {
+		raw_spin_unlock_irq(&ctx->lock);
+		goto again;
+	}
+	add_event_to_ctx(event, ctx);
+	raw_spin_unlock_irq(&ctx->lock);
 }
 
 /*

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-10-17 21:01 ` Paul E. McKenney
@ 2017-10-18  1:07   ` Andrea Parri
  2017-10-18 20:48     ` Paul E. McKenney
  0 siblings, 1 reply; 13+ messages in thread
From: Andrea Parri @ 2017-10-18  1:07 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: j.alglave, luc.maranget, stern, dhowells, peterz, will.deacon,
	boqun.feng, npiggin, linux-kernel

On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote:
> On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > The topic of memory-ordering recipes came up at the Linux Plumbers
> > Conference microconference on Friday, so I thought that I should summarize
> > what is currently "out there":
> 
> And here is an updated list of potential Linux-kernel examples for a
> "recipes" document, and thank you for the feedback.  Please feel free
> to counterpropose better examples.  In addition, if there is some other
> pattern that is commonplace and important enough to be included in a
> recipes document, please point it out.
> 
> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> This document lists the litmus-test patterns that we have been discussing,
> along with examples from the Linux kernel.  This is intended to feed into
> the recipes document.  All examples are from v4.13.
> 
> 0.	Simple special cases
> 
> 	If there is only one CPU on the one hand or only one variable
> 	on the other, the code will execute in order.  There are (as
> 	usual) some things to be careful of:
> 
> 	a.	There are some aspects of the C language that are
> 		unordered.  For example, in the expression "f(x) + g(y)",
> 		the order in which f and g are called is not defined;
> 		the object code is allowed to use either order or even
> 		to interleave the computations.
> 
> 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> 		a compiler can emit whatever code it likes, as long as
> 		the results of a single-threaded execution appear just
> 		as if the compiler had followed all the relevant rules.
> 		To see this, compile with a high level of optimization
> 		and run the debugger on the resulting binary.
> 
> 	c.	If there is only one variable but multiple CPUs, all
> 		that variable must be properly aligned and all accesses
> 		to that variable must be full sized.  Variables that
> 		straddle cachelines or pages void your full-ordering
> 		warranty, as do undersized accesses that load from or
> 		store to only part of the variable.
> 
> 1.	Another simple case: Locking.  [ Assuming you don't think too
> 	hard about it, that is! ]
> 
> 	Any CPU that has acquired a given lock sees any changes previously
> 	made by any CPU prior to having released that same lock.
> 
> 	[ Should we discuss chaining back through different locks,
> 	  sort of like release-acquire chains? ]
> 
> 2.	MP (see test6.pdf for nickname translation)
> 
> 	a.	smp_store_release() / smp_load_acquire()
> 
> 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> 		to handle initialization of a slab of the stack.  Working
> 		out the mutual-exclusion design is left as an exercise for
> 		the reader.
> 
> 	b.	rcu_assign_pointer() / rcu_dereference()
> 
> 		expand_to_next_prime() does the rcu_assign_pointer(),
> 		and next_prime_number() does the rcu_dereference().
> 		This mediates access to a bit vector that is expanded
> 		as additional primes are needed.  These two functions
> 		are in lib/prime_numbers.c.
> 
> 	c.	smp_wmb() / smp_rmb()
> 
> 		xlog_state_switch_iclogs() contains the following:
> 
> 			log->l_curr_block -= log->l_logBBsize;
> 			ASSERT(log->l_curr_block >= 0);
> 			smp_wmb();
> 			log->l_curr_cycle++;
> 
> 		And xlog_valid_lsn() contains the following:
> 
> 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> 			smp_rmb();
> 			cur_block = ACCESS_ONCE(log->l_curr_block);
> 
> 		Alternatively, from the comment in perf_output_put_handle()
> 		in kernel/events/ring_buffer.c:
> 
> 		 *   kernel				user
> 		 *
> 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> 		 *			(A)		smp_rmb()	(C)
> 		 *	STORE $data			LOAD $data
> 		 *	smp_wmb()	(B)		smp_mb()	(D)
> 		 *	STORE ->data_head		STORE ->data_tail
> 		 *   }
> 		 *
> 		 * Where A pairs with D, and B pairs with C.
> 
> 		The B/C pairing is MP with smp_wmb() and smp_rmb().
> 
> 	d.	Replacing either of the above with smp_mb()
> 
> 		Holding off on this one for the moment...
> 
> 3.	LB
> 
> 	a.	LB+ctrl+mb
> 
> 		Again, from the comment in perf_output_put_handle()
> 		in kernel/events/ring_buffer.c:
> 
> 		 *   kernel				user
> 		 *
> 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> 		 *			(A)		smp_rmb()	(C)
> 		 *	STORE $data			LOAD $data
> 		 *	smp_wmb()	(B)		smp_mb()	(D)
> 		 *	STORE ->data_head		STORE ->data_tail
> 		 *   }
> 		 *
> 		 * Where A pairs with D, and B pairs with C.
> 
> 		The A/D pairing covers this one.
> 
> 4.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> 
> 	Lots of variety here, can in some cases substitute:
> 	
> 	a.	READ_ONCE() for smp_load_acquire()
> 	b.	WRITE_ONCE() for smp_store_release()
> 	c.	Dependencies for both smp_load_acquire() and
> 		smp_store_release().
> 	d.	smp_wmb() for smp_store_release() in first thread
> 		of ISA2 and Z6.2.
> 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> 
> 	The canonical illustration of LB involves the various memory
> 	allocators, where you don't want a load from about-to-be-freed
> 	memory to see a store initializing a later incarnation of that
> 	same memory area.  But the per-CPU caches make this a very
> 	long and complicated example.
> 
> 	I am not aware of any three-CPU release-acquire chains in the
> 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> 	but these are not at all simple, either.
> 
> 	Thoughts?
> 
> 5.	SB
> 
> 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> 		And as in sys_membarrier()-scheduler coordination,
> 		for that matter.
> 
> 		Examples seem to be lacking.  Most cases use locking.
> 		Here is one rather strange one from RCU:
> 
> 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> 		{
> 			unsigned long flags;
> 			bool needwake;
> 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> 
> 			rhp->next = NULL;
> 			rhp->func = func;
> 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> 			needwake = !rcu_tasks_cbs_head;
> 			*rcu_tasks_cbs_tail = rhp;
> 			rcu_tasks_cbs_tail = &rhp->next;
> 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> 			/* We can't create the thread unless interrupts are enabled. */
> 			if ((needwake && havetask) ||
> 			    (!havetask && !irqs_disabled_flags(flags))) {
> 				rcu_spawn_tasks_kthread();
> 				wake_up(&rcu_tasks_cbs_wq);
> 			}
> 		}
> 
> 		And for the wait side, using synchronize_sched() to supply
> 		the barrier for both ends, with the preemption disabling
> 		due to raw_spin_lock_irqsave() serving as the read-side
> 		critical section:
> 
> 		if (!list) {
> 			wait_event_interruptible(rcu_tasks_cbs_wq,
> 						 rcu_tasks_cbs_head);
> 			if (!rcu_tasks_cbs_head) {
> 				WARN_ON(signal_pending(current));
> 				schedule_timeout_interruptible(HZ/10);
> 			}
> 			continue;
> 		}
> 		synchronize_sched();
> 
> 		-----------------
> 
> 		Here is another one that uses atomic_cmpxchg() as a
> 		full memory barrier:
> 
> 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> 					msecs_to_jiffies(1000))) {
> 			atomic_set(stopping, 0);
> 			smp_mb();
> 			return -ETIMEDOUT;
> 		}
> 
> 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> 						     atomic_t *stopping)
> 		{
> 			if (atomic_cmpxchg(stopping, 1, 0)) {
> 				wake_up(wait);
> 				return 1;
> 			}
> 
> 			return 0;
> 		}
> 
> 		-----------------
> 
> 		And here is the generic pattern for the above two examples
> 		taken from waitqueue_active() in include/linux/wait.h:
> 
>  *      CPU0 - waker                    CPU1 - waiter
>  *
>  *                                      for (;;) {
>  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
>  *      smp_mb();                         // smp_mb() from set_current_state()
>  *      if (waitqueue_active(wq_head))         if (@cond)
>  *        wake_up(wq_head);                      break;
>  *                                        schedule();
>  *                                      }
>  *                                      finish_wait(&wq_head, &wait);
> 
> 		Note that prepare_to_wait() does the both the write
> 		and the set_current_state() that contains the smp_mb().
> 		The read is the waitqueue_active() on the one hand and
> 		the "if (@cond)" on the other.
> 
> 6.	W+RWC+porel+mb+mb
> 
> 	See recipes-LKcode-63cae12bce986.txt.
> 
> 	Mostly of historical interest -- as far as I know, this commit
> 	was the first to contain a litmus test.
> 
> 7.	Context switch and migration.  A bit specialized, so might leave
> 	this one out.
> 
> 	When a thread moves from one CPU to another to another, the
> 	scheduler is required to do whatever is necessary for the thread
> 	to see any prior accesses that it executed on other CPUs.  This
> 	includes "interesting" interactions with wake_up() shown in the
> 	following comment from try_to_wake_up() in kernel/sched/core.c:
> 
>  * Notes on Program-Order guarantees on SMP systems.
>  *
>  *  MIGRATION
>  *
>  * The basic program-order guarantee on SMP systems is that when a task [t]
>  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
>  * execution on its new CPU [c1].
>  *
>  * For migration (of runnable tasks) this is provided by the following means:
>  *
>  *  A) UNLOCK of the rq(c0)->lock scheduling out task t
>  *  B) migration for t is required to synchronize *both* rq(c0)->lock and
>  *     rq(c1)->lock (if not at the same time, then in that order).
>  *  C) LOCK of the rq(c1)->lock scheduling in task
>  *
>  * Transitivity guarantees that B happens after A and C after B.
>  * Note: we only require RCpc transitivity.
>  * Note: the CPU doing B need not be c0 or c1
>  *
>  * Example:
>  *
>  *   CPU0            CPU1            CPU2
>  *
>  *   LOCK rq(0)->lock
>  *   sched-out X
>  *   sched-in Y
>  *   UNLOCK rq(0)->lock
>  *
>  *                                   LOCK rq(0)->lock // orders against CPU0
>  *                                   dequeue X
>  *                                   UNLOCK rq(0)->lock
>  *
>  *                                   LOCK rq(1)->lock
>  *                                   enqueue X
>  *                                   UNLOCK rq(1)->lock
>  *
>  *                   LOCK rq(1)->lock // orders against CPU2
>  *                   sched-out Z
>  *                   sched-in X
>  *                   UNLOCK rq(1)->lock

We might want to "specialize" this snippet to emphasize (or illustrate)
a particular _cycle_; one possible way of "closing the chain" would be:

   CPU0

   smp_store_release(&X->on_cpu, 0); /* from sched-out X */
   UNLOCK rq(0)->lock


   CPU2

   LOCK rq(0)->lock
   UNLOCK rq(1)->lock


   CPU1

   LOCK rq(1)->lock
   X->on_cpu = 1; /* from sched-in X */


   BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., Z6.2 in test6.pdf)


>  *
>  *
>  *  BLOCKING -- aka. SLEEP + WAKEUP
>  *
>  * For blocking we (obviously) need to provide the same guarantee as for
>  * migration. However the means are completely different as there is no lock
>  * chain to provide order. Instead we do:
>  *
>  *   1) smp_store_release(X->on_cpu, 0)
>  *   2) smp_cond_load_acquire(!X->on_cpu)
>  *
>  * Example:
>  *
>  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
>  *
>  *   LOCK rq(0)->lock LOCK X->pi_lock
>  *   dequeue X
>  *   sched-out X
>  *   smp_store_release(X->on_cpu, 0);
>  *
>  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
>  *                    X->state = WAKING
>  *                    set_task_cpu(X,2)
>  *
>  *                    LOCK rq(2)->lock
>  *                    enqueue X
>  *                    X->state = RUNNING
>  *                    UNLOCK rq(2)->lock
>  *
>  *                                          LOCK rq(2)->lock // orders against CPU1
>  *                                          sched-out Z
>  *                                          sched-in X
>  *                                          UNLOCK rq(2)->lock
>  *
>  *                    UNLOCK X->pi_lock
>  *   UNLOCK rq(0)->lock

Similarly:

   CPU0

   smp_store_release(&X->on_cpu, 0); /* from sched-out X */


   CPU1

   smp_cond_load_acquire(&X->on_cpu, !VAL);
   UNLOCK rq(2)->lock


   CPU2

   LOCK rq(2)->lock
   X->on_cpu = 1; // from sched-in X


   BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);

(c.f., WWC in test6.pdf)

  Andrea


>  *
>  *
>  * However; for wakeups there is a second guarantee we must provide, namely we
>  * must observe the state that lead to our wakeup. That is, not only must our
>  * task observe its own prior state, it must also observe the stores prior to
>  * its wakeup.
>  *
>  * This means that any means of doing remote wakeups must order the CPU doing
>  * the wakeup against the CPU the task is going to end up running on. This,
>  * however, is already required for the regular Program-Order guarantee above,
>  * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).

> commit 63cae12bce9861cec309798d34701cf3da20bc71
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Dec 9 14:59:00 2016 +0100
> 
>     perf/core: Fix sys_perf_event_open() vs. hotplug
>     
>     There is problem with installing an event in a task that is 'stuck' on
>     an offline CPU.
>     
>     Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
>     blocked task doesn't run and doesn't require a CPU etc.. Only on
>     wakeup do we ammend the situation and place the task on a available
>     CPU.
>     
>     If we hit such a task with perf_install_in_context() we'll loop until
>     either that task wakes up or the CPU comes back online, if the task
>     waking depends on the event being installed, we're stuck.
>     
>     While looking into this issue, I also spotted another problem, if we
>     hit a task with perf_install_in_context() that is in the middle of
>     being migrated, that is we observe the old CPU before sending the IPI,
>     but run the IPI (on the old CPU) while the task is already running on
>     the new CPU, things also go sideways.
>     
>     Rework things to rely on task_curr() -- outside of rq->lock -- which
>     is rather tricky. Imagine the following scenario where we're trying to
>     install the first event into our task 't':
>     
>     CPU0            CPU1            CPU2
>     
>                     (current == t)
>     
>     t->perf_event_ctxp[] = ctx;
>     smp_mb();
>     cpu = task_cpu(t);
>     
>                     switch(t, n);
>                                     migrate(t, 2);
>                                     switch(p, t);
>     
>                                     ctx = t->perf_event_ctxp[]; // must not be NULL
>     
>     smp_function_call(cpu, ..);
>     
>                     generic_exec_single()
>                       func();
>                         spin_lock(ctx->lock);
>                         if (task_curr(t)) // false
>     
>                         add_event_to_ctx();
>                         spin_unlock(ctx->lock);
>     
>                                     perf_event_context_sched_in();
>                                       spin_lock(ctx->lock);
>                                       // sees event
>     
>     So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
>     Because if CPU2's load of that variable were to observe NULL, it would
>     not try to schedule the ctx and we'd have a task running without its
>     counter, which would be 'bad'.
>     
>     As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
>     first and not see the event yet, then CPU0 must observe task_curr()
>     and retry. If the install happens first, then we must see the event on
>     sched-in and all is well.
>     
>     I think we can translate the first part (until the 'must not be NULL')
>     of the scenario to a litmus test like:
>     
>       C C-peterz
>     
>       {
>       }
>     
>       P0(int *x, int *y)
>       {
>               int r1;
>     
>               WRITE_ONCE(*x, 1);
>               smp_mb();
>               r1 = READ_ONCE(*y);
>       }
>     
>       P1(int *y, int *z)
>       {
>               WRITE_ONCE(*y, 1);
>               smp_store_release(z, 1);
>       }
>     
>       P2(int *x, int *z)
>       {
>               int r1;
>               int r2;
>     
>               r1 = smp_load_acquire(z);
>     	  smp_mb();
>               r2 = READ_ONCE(*x);
>       }
>     
>       exists
>       (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
>     
>     Where:
>       x is perf_event_ctxp[],
>       y is our tasks's CPU, and
>       z is our task being placed on the rq of CPU2.
>     
>     The P0 smp_mb() is the one added by this patch, ordering the store to
>     perf_event_ctxp[] from find_get_context() and the load of task_cpu()
>     in task_function_call().
>     
>     The smp_store_release/smp_load_acquire model the RCpc locking of the
>     rq->lock and the smp_mb() of P2 is the context switch switching from
>     whatever CPU2 was running to our task 't'.
>     
>     This litmus test evaluates into:
>     
>       Test C-peterz Allowed
>       States 7
>       0:r1=0; 2:r1=0; 2:r2=0;
>       0:r1=0; 2:r1=0; 2:r2=1;
>       0:r1=0; 2:r1=1; 2:r2=1;
>       0:r1=1; 2:r1=0; 2:r2=0;
>       0:r1=1; 2:r1=0; 2:r2=1;
>       0:r1=1; 2:r1=1; 2:r2=0;
>       0:r1=1; 2:r1=1; 2:r2=1;
>       No
>       Witnesses
>       Positive: 0 Negative: 7
>       Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
>       Observation C-peterz Never 0 7
>       Hash=e427f41d9146b2a5445101d3e2fcaa34
>     
>     And the strong and weak model agree.
>     
>     Reported-by: Mark Rutland <mark.rutland@arm.com>
>     Tested-by: Mark Rutland <mark.rutland@arm.com>
>     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>     Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
>     Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
>     Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
>     Cc: Jiri Olsa <jolsa@redhat.com>
>     Cc: Linus Torvalds <torvalds@linux-foundation.org>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
>     Cc: Stephane Eranian <eranian@google.com>
>     Cc: Thomas Gleixner <tglx@linutronix.de>
>     Cc: Vince Weaver <vincent.weaver@maine.edu>
>     Cc: Will Deacon <will.deacon@arm.com>
>     Cc: jeremy.linton@arm.com
>     Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index ab15509fab8c..72ce7d63e561 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
>  	struct perf_event_context *ctx = event->ctx;
>  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
>  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
> -	bool activate = true;
> +	bool reprogram = true;
>  	int ret = 0;
>  
>  	raw_spin_lock(&cpuctx->ctx.lock);
> @@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
>  		raw_spin_lock(&ctx->lock);
>  		task_ctx = ctx;
>  
> -		/* If we're on the wrong CPU, try again */
> -		if (task_cpu(ctx->task) != smp_processor_id()) {
> -			ret = -ESRCH;
> -			goto unlock;
> -		}
> +		reprogram = (ctx->task == current);
>  
>  		/*
> -		 * If we're on the right CPU, see if the task we target is
> -		 * current, if not we don't have to activate the ctx, a future
> -		 * context switch will do that for us.
> +		 * If the task is running, it must be running on this CPU,
> +		 * otherwise we cannot reprogram things.
> +		 *
> +		 * If its not running, we don't care, ctx->lock will
> +		 * serialize against it becoming runnable.
>  		 */
> -		if (ctx->task != current)
> -			activate = false;
> -		else
> -			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> +		if (task_curr(ctx->task) && !reprogram) {
> +			ret = -ESRCH;
> +			goto unlock;
> +		}
>  
> +		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
>  	} else if (task_ctx) {
>  		raw_spin_lock(&task_ctx->lock);
>  	}
>  
> -	if (activate) {
> +	if (reprogram) {
>  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
>  		add_event_to_ctx(event, ctx);
>  		ctx_resched(cpuctx, task_ctx);
> @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
>  	/*
>  	 * Installing events is tricky because we cannot rely on ctx->is_active
>  	 * to be set in case this is the nr_events 0 -> 1 transition.
> +	 *
> +	 * Instead we use task_curr(), which tells us if the task is running.
> +	 * However, since we use task_curr() outside of rq::lock, we can race
> +	 * against the actual state. This means the result can be wrong.
> +	 *
> +	 * If we get a false positive, we retry, this is harmless.
> +	 *
> +	 * If we get a false negative, things are complicated. If we are after
> +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> +	 * value must be correct. If we're before, it doesn't matter since
> +	 * perf_event_context_sched_in() will program the counter.
> +	 *
> +	 * However, this hinges on the remote context switch having observed
> +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> +	 * ctx::lock in perf_event_context_sched_in().
> +	 *
> +	 * We do this by task_function_call(), if the IPI fails to hit the task
> +	 * we know any future context switch of task must see the
> +	 * perf_event_ctpx[] store.
>  	 */
> -again:
> +
>  	/*
> -	 * Cannot use task_function_call() because we need to run on the task's
> -	 * CPU regardless of whether its current or not.
> +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> +	 * task_cpu() load, such that if the IPI then does not find the task
> +	 * running, a future context switch of that task must observe the
> +	 * store.
>  	 */
> -	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> +	smp_mb();
> +again:
> +	if (!task_function_call(task, __perf_install_in_context, event))
>  		return;
>  
>  	raw_spin_lock_irq(&ctx->lock);
> @@ -2348,12 +2370,16 @@ again:
>  		raw_spin_unlock_irq(&ctx->lock);
>  		return;
>  	}
> -	raw_spin_unlock_irq(&ctx->lock);
>  	/*
> -	 * Since !ctx->is_active doesn't mean anything, we must IPI
> -	 * unconditionally.
> +	 * If the task is not running, ctx->lock will avoid it becoming so,
> +	 * thus we can safely install the event.
>  	 */
> -	goto again;
> +	if (task_curr(task)) {
> +		raw_spin_unlock_irq(&ctx->lock);
> +		goto again;
> +	}
> +	add_event_to_ctx(event, ctx);
> +	raw_spin_unlock_irq(&ctx->lock);
>  }
>  
>  /*

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Memory-ordering recipes
  2017-10-18  1:07   ` Andrea Parri
@ 2017-10-18 20:48     ` Paul E. McKenney
  0 siblings, 0 replies; 13+ messages in thread
From: Paul E. McKenney @ 2017-10-18 20:48 UTC (permalink / raw)
  To: Andrea Parri
  Cc: j.alglave, luc.maranget, stern, dhowells, peterz, will.deacon,
	boqun.feng, npiggin, linux-kernel

On Wed, Oct 18, 2017 at 03:07:48AM +0200, Andrea Parri wrote:
> On Tue, Oct 17, 2017 at 02:01:37PM -0700, Paul E. McKenney wrote:
> > On Sun, Sep 17, 2017 at 04:05:09PM -0700, Paul E. McKenney wrote:
> > > Hello!
> > > 
> > > The topic of memory-ordering recipes came up at the Linux Plumbers
> > > Conference microconference on Friday, so I thought that I should summarize
> > > what is currently "out there":
> > 
> > And here is an updated list of potential Linux-kernel examples for a
> > "recipes" document, and thank you for the feedback.  Please feel free
> > to counterpropose better examples.  In addition, if there is some other
> > pattern that is commonplace and important enough to be included in a
> > recipes document, please point it out.
> > 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > This document lists the litmus-test patterns that we have been discussing,
> > along with examples from the Linux kernel.  This is intended to feed into
> > the recipes document.  All examples are from v4.13.
> > 
> > 0.	Simple special cases
> > 
> > 	If there is only one CPU on the one hand or only one variable
> > 	on the other, the code will execute in order.  There are (as
> > 	usual) some things to be careful of:
> > 
> > 	a.	There are some aspects of the C language that are
> > 		unordered.  For example, in the expression "f(x) + g(y)",
> > 		the order in which f and g are called is not defined;
> > 		the object code is allowed to use either order or even
> > 		to interleave the computations.
> > 
> > 	b.	Compilers are permitted to use the "as-if" rule.  That is,
> > 		a compiler can emit whatever code it likes, as long as
> > 		the results of a single-threaded execution appear just
> > 		as if the compiler had followed all the relevant rules.
> > 		To see this, compile with a high level of optimization
> > 		and run the debugger on the resulting binary.
> > 
> > 	c.	If there is only one variable but multiple CPUs, all
> > 		that variable must be properly aligned and all accesses
> > 		to that variable must be full sized.  Variables that
> > 		straddle cachelines or pages void your full-ordering
> > 		warranty, as do undersized accesses that load from or
> > 		store to only part of the variable.
> > 
> > 1.	Another simple case: Locking.  [ Assuming you don't think too
> > 	hard about it, that is! ]
> > 
> > 	Any CPU that has acquired a given lock sees any changes previously
> > 	made by any CPU prior to having released that same lock.
> > 
> > 	[ Should we discuss chaining back through different locks,
> > 	  sort of like release-acquire chains? ]
> > 
> > 2.	MP (see test6.pdf for nickname translation)
> > 
> > 	a.	smp_store_release() / smp_load_acquire()
> > 
> > 		init_stack_slab() in lib/stackdepot.c uses release-acquire
> > 		to handle initialization of a slab of the stack.  Working
> > 		out the mutual-exclusion design is left as an exercise for
> > 		the reader.
> > 
> > 	b.	rcu_assign_pointer() / rcu_dereference()
> > 
> > 		expand_to_next_prime() does the rcu_assign_pointer(),
> > 		and next_prime_number() does the rcu_dereference().
> > 		This mediates access to a bit vector that is expanded
> > 		as additional primes are needed.  These two functions
> > 		are in lib/prime_numbers.c.
> > 
> > 	c.	smp_wmb() / smp_rmb()
> > 
> > 		xlog_state_switch_iclogs() contains the following:
> > 
> > 			log->l_curr_block -= log->l_logBBsize;
> > 			ASSERT(log->l_curr_block >= 0);
> > 			smp_wmb();
> > 			log->l_curr_cycle++;
> > 
> > 		And xlog_valid_lsn() contains the following:
> > 
> > 			cur_cycle = ACCESS_ONCE(log->l_curr_cycle);
> > 			smp_rmb();
> > 			cur_block = ACCESS_ONCE(log->l_curr_block);
> > 
> > 		Alternatively, from the comment in perf_output_put_handle()
> > 		in kernel/events/ring_buffer.c:
> > 
> > 		 *   kernel				user
> > 		 *
> > 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> > 		 *			(A)		smp_rmb()	(C)
> > 		 *	STORE $data			LOAD $data
> > 		 *	smp_wmb()	(B)		smp_mb()	(D)
> > 		 *	STORE ->data_head		STORE ->data_tail
> > 		 *   }
> > 		 *
> > 		 * Where A pairs with D, and B pairs with C.
> > 
> > 		The B/C pairing is MP with smp_wmb() and smp_rmb().
> > 
> > 	d.	Replacing either of the above with smp_mb()
> > 
> > 		Holding off on this one for the moment...
> > 
> > 3.	LB
> > 
> > 	a.	LB+ctrl+mb
> > 
> > 		Again, from the comment in perf_output_put_handle()
> > 		in kernel/events/ring_buffer.c:
> > 
> > 		 *   kernel				user
> > 		 *
> > 		 *   if (LOAD ->data_tail) {		LOAD ->data_head
> > 		 *			(A)		smp_rmb()	(C)
> > 		 *	STORE $data			LOAD $data
> > 		 *	smp_wmb()	(B)		smp_mb()	(D)
> > 		 *	STORE ->data_head		STORE ->data_tail
> > 		 *   }
> > 		 *
> > 		 * Where A pairs with D, and B pairs with C.
> > 
> > 		The A/D pairing covers this one.
> > 
> > 4.	Release-acquire chains, AKA ISA2, Z6.2, LB, and 3.LB
> > 
> > 	Lots of variety here, can in some cases substitute:
> > 	
> > 	a.	READ_ONCE() for smp_load_acquire()
> > 	b.	WRITE_ONCE() for smp_store_release()
> > 	c.	Dependencies for both smp_load_acquire() and
> > 		smp_store_release().
> > 	d.	smp_wmb() for smp_store_release() in first thread
> > 		of ISA2 and Z6.2.
> > 	e.	smp_rmb() for smp_load_acquire() in last thread of ISA2.
> > 
> > 	The canonical illustration of LB involves the various memory
> > 	allocators, where you don't want a load from about-to-be-freed
> > 	memory to see a store initializing a later incarnation of that
> > 	same memory area.  But the per-CPU caches make this a very
> > 	long and complicated example.
> > 
> > 	I am not aware of any three-CPU release-acquire chains in the
> > 	Linux kernel.  There are three-CPU lock-based chains in RCU,
> > 	but these are not at all simple, either.
> > 
> > 	Thoughts?
> > 
> > 5.	SB
> > 
> > 	a.	smp_mb(), as in lockless wait-wakeup coordination.
> > 		And as in sys_membarrier()-scheduler coordination,
> > 		for that matter.
> > 
> > 		Examples seem to be lacking.  Most cases use locking.
> > 		Here is one rather strange one from RCU:
> > 
> > 		void call_rcu_tasks(struct rcu_head *rhp, rcu_callback_t func)
> > 		{
> > 			unsigned long flags;
> > 			bool needwake;
> > 			bool havetask = READ_ONCE(rcu_tasks_kthread_ptr);
> > 
> > 			rhp->next = NULL;
> > 			rhp->func = func;
> > 			raw_spin_lock_irqsave(&rcu_tasks_cbs_lock, flags);
> > 			needwake = !rcu_tasks_cbs_head;
> > 			*rcu_tasks_cbs_tail = rhp;
> > 			rcu_tasks_cbs_tail = &rhp->next;
> > 			raw_spin_unlock_irqrestore(&rcu_tasks_cbs_lock, flags);
> > 			/* We can't create the thread unless interrupts are enabled. */
> > 			if ((needwake && havetask) ||
> > 			    (!havetask && !irqs_disabled_flags(flags))) {
> > 				rcu_spawn_tasks_kthread();
> > 				wake_up(&rcu_tasks_cbs_wq);
> > 			}
> > 		}
> > 
> > 		And for the wait side, using synchronize_sched() to supply
> > 		the barrier for both ends, with the preemption disabling
> > 		due to raw_spin_lock_irqsave() serving as the read-side
> > 		critical section:
> > 
> > 		if (!list) {
> > 			wait_event_interruptible(rcu_tasks_cbs_wq,
> > 						 rcu_tasks_cbs_head);
> > 			if (!rcu_tasks_cbs_head) {
> > 				WARN_ON(signal_pending(current));
> > 				schedule_timeout_interruptible(HZ/10);
> > 			}
> > 			continue;
> > 		}
> > 		synchronize_sched();
> > 
> > 		-----------------
> > 
> > 		Here is another one that uses atomic_cmpxchg() as a
> > 		full memory barrier:
> > 
> > 		if (!wait_event_timeout(*wait, !atomic_read(stopping),
> > 					msecs_to_jiffies(1000))) {
> > 			atomic_set(stopping, 0);
> > 			smp_mb();
> > 			return -ETIMEDOUT;
> > 		}
> > 
> > 		int omap3isp_module_sync_is_stopping(wait_queue_head_t *wait,
> > 						     atomic_t *stopping)
> > 		{
> > 			if (atomic_cmpxchg(stopping, 1, 0)) {
> > 				wake_up(wait);
> > 				return 1;
> > 			}
> > 
> > 			return 0;
> > 		}
> > 
> > 		-----------------
> > 
> > 		And here is the generic pattern for the above two examples
> > 		taken from waitqueue_active() in include/linux/wait.h:
> > 
> >  *      CPU0 - waker                    CPU1 - waiter
> >  *
> >  *                                      for (;;) {
> >  *      @cond = true;                     prepare_to_wait(&wq_head, &wait, state);
> >  *      smp_mb();                         // smp_mb() from set_current_state()
> >  *      if (waitqueue_active(wq_head))         if (@cond)
> >  *        wake_up(wq_head);                      break;
> >  *                                        schedule();
> >  *                                      }
> >  *                                      finish_wait(&wq_head, &wait);
> > 
> > 		Note that prepare_to_wait() does the both the write
> > 		and the set_current_state() that contains the smp_mb().
> > 		The read is the waitqueue_active() on the one hand and
> > 		the "if (@cond)" on the other.
> > 
> > 6.	W+RWC+porel+mb+mb
> > 
> > 	See recipes-LKcode-63cae12bce986.txt.
> > 
> > 	Mostly of historical interest -- as far as I know, this commit
> > 	was the first to contain a litmus test.
> > 
> > 7.	Context switch and migration.  A bit specialized, so might leave
> > 	this one out.
> > 
> > 	When a thread moves from one CPU to another to another, the
> > 	scheduler is required to do whatever is necessary for the thread
> > 	to see any prior accesses that it executed on other CPUs.  This
> > 	includes "interesting" interactions with wake_up() shown in the
> > 	following comment from try_to_wake_up() in kernel/sched/core.c:
> > 
> >  * Notes on Program-Order guarantees on SMP systems.
> >  *
> >  *  MIGRATION
> >  *
> >  * The basic program-order guarantee on SMP systems is that when a task [t]
> >  * migrates, all its activity on its old CPU [c0] happens-before any subsequent
> >  * execution on its new CPU [c1].
> >  *
> >  * For migration (of runnable tasks) this is provided by the following means:
> >  *
> >  *  A) UNLOCK of the rq(c0)->lock scheduling out task t
> >  *  B) migration for t is required to synchronize *both* rq(c0)->lock and
> >  *     rq(c1)->lock (if not at the same time, then in that order).
> >  *  C) LOCK of the rq(c1)->lock scheduling in task
> >  *
> >  * Transitivity guarantees that B happens after A and C after B.
> >  * Note: we only require RCpc transitivity.
> >  * Note: the CPU doing B need not be c0 or c1
> >  *
> >  * Example:
> >  *
> >  *   CPU0            CPU1            CPU2
> >  *
> >  *   LOCK rq(0)->lock
> >  *   sched-out X
> >  *   sched-in Y
> >  *   UNLOCK rq(0)->lock
> >  *
> >  *                                   LOCK rq(0)->lock // orders against CPU0
> >  *                                   dequeue X
> >  *                                   UNLOCK rq(0)->lock
> >  *
> >  *                                   LOCK rq(1)->lock
> >  *                                   enqueue X
> >  *                                   UNLOCK rq(1)->lock
> >  *
> >  *                   LOCK rq(1)->lock // orders against CPU2
> >  *                   sched-out Z
> >  *                   sched-in X
> >  *                   UNLOCK rq(1)->lock
> 
> We might want to "specialize" this snippet to emphasize (or illustrate)
> a particular _cycle_; one possible way of "closing the chain" would be:
> 
>    CPU0
> 
>    smp_store_release(&X->on_cpu, 0); /* from sched-out X */
>    UNLOCK rq(0)->lock
> 
> 
>    CPU2
> 
>    LOCK rq(0)->lock
>    UNLOCK rq(1)->lock
> 
> 
>    CPU1
> 
>    LOCK rq(1)->lock
>    X->on_cpu = 1; /* from sched-in X */
> 
> 
>    BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);
> 
> (c.f., Z6.2 in test6.pdf)

I added this as a summary, pretty much in the form you have above.

> >  *
> >  *
> >  *  BLOCKING -- aka. SLEEP + WAKEUP
> >  *
> >  * For blocking we (obviously) need to provide the same guarantee as for
> >  * migration. However the means are completely different as there is no lock
> >  * chain to provide order. Instead we do:
> >  *
> >  *   1) smp_store_release(X->on_cpu, 0)
> >  *   2) smp_cond_load_acquire(!X->on_cpu)
> >  *
> >  * Example:
> >  *
> >  *   CPU0 (schedule)  CPU1 (try_to_wake_up) CPU2 (schedule)
> >  *
> >  *   LOCK rq(0)->lock LOCK X->pi_lock
> >  *   dequeue X
> >  *   sched-out X
> >  *   smp_store_release(X->on_cpu, 0);
> >  *
> >  *                    smp_cond_load_acquire(&X->on_cpu, !VAL);
> >  *                    X->state = WAKING
> >  *                    set_task_cpu(X,2)
> >  *
> >  *                    LOCK rq(2)->lock
> >  *                    enqueue X
> >  *                    X->state = RUNNING
> >  *                    UNLOCK rq(2)->lock
> >  *
> >  *                                          LOCK rq(2)->lock // orders against CPU1
> >  *                                          sched-out Z
> >  *                                          sched-in X
> >  *                                          UNLOCK rq(2)->lock
> >  *
> >  *                    UNLOCK X->pi_lock
> >  *   UNLOCK rq(0)->lock
> 
> Similarly:
> 
>    CPU0
> 
>    smp_store_release(&X->on_cpu, 0); /* from sched-out X */
> 
> 
>    CPU1
> 
>    smp_cond_load_acquire(&X->on_cpu, !VAL);
>    UNLOCK rq(2)->lock
> 
> 
>    CPU2
> 
>    LOCK rq(2)->lock
>    X->on_cpu = 1; // from sched-in X
> 
> 
>    BUG_ON("each LOCK reads from its UNLOCK" && X->on_cpu == 0);
> 
> (c.f., WWC in test6.pdf)

And same here.

							Thanx, Paul

>   Andrea
> 
> 
> >  *
> >  *
> >  * However; for wakeups there is a second guarantee we must provide, namely we
> >  * must observe the state that lead to our wakeup. That is, not only must our
> >  * task observe its own prior state, it must also observe the stores prior to
> >  * its wakeup.
> >  *
> >  * This means that any means of doing remote wakeups must order the CPU doing
> >  * the wakeup against the CPU the task is going to end up running on. This,
> >  * however, is already required for the regular Program-Order guarantee above,
> >  * since the waking CPU is the one issueing the ACQUIRE (smp_cond_load_acquire).
> 
> > commit 63cae12bce9861cec309798d34701cf3da20bc71
> > Author: Peter Zijlstra <peterz@infradead.org>
> > Date:   Fri Dec 9 14:59:00 2016 +0100
> > 
> >     perf/core: Fix sys_perf_event_open() vs. hotplug
> >     
> >     There is problem with installing an event in a task that is 'stuck' on
> >     an offline CPU.
> >     
> >     Blocked tasks are not dis-assosciated from offlined CPUs, after all, a
> >     blocked task doesn't run and doesn't require a CPU etc.. Only on
> >     wakeup do we ammend the situation and place the task on a available
> >     CPU.
> >     
> >     If we hit such a task with perf_install_in_context() we'll loop until
> >     either that task wakes up or the CPU comes back online, if the task
> >     waking depends on the event being installed, we're stuck.
> >     
> >     While looking into this issue, I also spotted another problem, if we
> >     hit a task with perf_install_in_context() that is in the middle of
> >     being migrated, that is we observe the old CPU before sending the IPI,
> >     but run the IPI (on the old CPU) while the task is already running on
> >     the new CPU, things also go sideways.
> >     
> >     Rework things to rely on task_curr() -- outside of rq->lock -- which
> >     is rather tricky. Imagine the following scenario where we're trying to
> >     install the first event into our task 't':
> >     
> >     CPU0            CPU1            CPU2
> >     
> >                     (current == t)
> >     
> >     t->perf_event_ctxp[] = ctx;
> >     smp_mb();
> >     cpu = task_cpu(t);
> >     
> >                     switch(t, n);
> >                                     migrate(t, 2);
> >                                     switch(p, t);
> >     
> >                                     ctx = t->perf_event_ctxp[]; // must not be NULL
> >     
> >     smp_function_call(cpu, ..);
> >     
> >                     generic_exec_single()
> >                       func();
> >                         spin_lock(ctx->lock);
> >                         if (task_curr(t)) // false
> >     
> >                         add_event_to_ctx();
> >                         spin_unlock(ctx->lock);
> >     
> >                                     perf_event_context_sched_in();
> >                                       spin_lock(ctx->lock);
> >                                       // sees event
> >     
> >     So its CPU0's store of t->perf_event_ctxp[] that must not go 'missing'.
> >     Because if CPU2's load of that variable were to observe NULL, it would
> >     not try to schedule the ctx and we'd have a task running without its
> >     counter, which would be 'bad'.
> >     
> >     As long as we observe !NULL, we'll acquire ctx->lock. If we acquire it
> >     first and not see the event yet, then CPU0 must observe task_curr()
> >     and retry. If the install happens first, then we must see the event on
> >     sched-in and all is well.
> >     
> >     I think we can translate the first part (until the 'must not be NULL')
> >     of the scenario to a litmus test like:
> >     
> >       C C-peterz
> >     
> >       {
> >       }
> >     
> >       P0(int *x, int *y)
> >       {
> >               int r1;
> >     
> >               WRITE_ONCE(*x, 1);
> >               smp_mb();
> >               r1 = READ_ONCE(*y);
> >       }
> >     
> >       P1(int *y, int *z)
> >       {
> >               WRITE_ONCE(*y, 1);
> >               smp_store_release(z, 1);
> >       }
> >     
> >       P2(int *x, int *z)
> >       {
> >               int r1;
> >               int r2;
> >     
> >               r1 = smp_load_acquire(z);
> >     	  smp_mb();
> >               r2 = READ_ONCE(*x);
> >       }
> >     
> >       exists
> >       (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> >     
> >     Where:
> >       x is perf_event_ctxp[],
> >       y is our tasks's CPU, and
> >       z is our task being placed on the rq of CPU2.
> >     
> >     The P0 smp_mb() is the one added by this patch, ordering the store to
> >     perf_event_ctxp[] from find_get_context() and the load of task_cpu()
> >     in task_function_call().
> >     
> >     The smp_store_release/smp_load_acquire model the RCpc locking of the
> >     rq->lock and the smp_mb() of P2 is the context switch switching from
> >     whatever CPU2 was running to our task 't'.
> >     
> >     This litmus test evaluates into:
> >     
> >       Test C-peterz Allowed
> >       States 7
> >       0:r1=0; 2:r1=0; 2:r2=0;
> >       0:r1=0; 2:r1=0; 2:r2=1;
> >       0:r1=0; 2:r1=1; 2:r2=1;
> >       0:r1=1; 2:r1=0; 2:r2=0;
> >       0:r1=1; 2:r1=0; 2:r2=1;
> >       0:r1=1; 2:r1=1; 2:r2=0;
> >       0:r1=1; 2:r1=1; 2:r2=1;
> >       No
> >       Witnesses
> >       Positive: 0 Negative: 7
> >       Condition exists (0:r1=0 /\ 2:r1=1 /\ 2:r2=0)
> >       Observation C-peterz Never 0 7
> >       Hash=e427f41d9146b2a5445101d3e2fcaa34
> >     
> >     And the strong and weak model agree.
> >     
> >     Reported-by: Mark Rutland <mark.rutland@arm.com>
> >     Tested-by: Mark Rutland <mark.rutland@arm.com>
> >     Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> >     Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
> >     Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >     Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
> >     Cc: Jiri Olsa <jolsa@redhat.com>
> >     Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >     Cc: Peter Zijlstra <peterz@infradead.org>
> >     Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> >     Cc: Stephane Eranian <eranian@google.com>
> >     Cc: Thomas Gleixner <tglx@linutronix.de>
> >     Cc: Vince Weaver <vincent.weaver@maine.edu>
> >     Cc: Will Deacon <will.deacon@arm.com>
> >     Cc: jeremy.linton@arm.com
> >     Link: http://lkml.kernel.org/r/20161209135900.GU3174@twins.programming.kicks-ass.net
> >     Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > 
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index ab15509fab8c..72ce7d63e561 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -2249,7 +2249,7 @@ static int  __perf_install_in_context(void *info)
> >  	struct perf_event_context *ctx = event->ctx;
> >  	struct perf_cpu_context *cpuctx = __get_cpu_context(ctx);
> >  	struct perf_event_context *task_ctx = cpuctx->task_ctx;
> > -	bool activate = true;
> > +	bool reprogram = true;
> >  	int ret = 0;
> >  
> >  	raw_spin_lock(&cpuctx->ctx.lock);
> > @@ -2257,27 +2257,26 @@ static int  __perf_install_in_context(void *info)
> >  		raw_spin_lock(&ctx->lock);
> >  		task_ctx = ctx;
> >  
> > -		/* If we're on the wrong CPU, try again */
> > -		if (task_cpu(ctx->task) != smp_processor_id()) {
> > -			ret = -ESRCH;
> > -			goto unlock;
> > -		}
> > +		reprogram = (ctx->task == current);
> >  
> >  		/*
> > -		 * If we're on the right CPU, see if the task we target is
> > -		 * current, if not we don't have to activate the ctx, a future
> > -		 * context switch will do that for us.
> > +		 * If the task is running, it must be running on this CPU,
> > +		 * otherwise we cannot reprogram things.
> > +		 *
> > +		 * If its not running, we don't care, ctx->lock will
> > +		 * serialize against it becoming runnable.
> >  		 */
> > -		if (ctx->task != current)
> > -			activate = false;
> > -		else
> > -			WARN_ON_ONCE(cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> > +		if (task_curr(ctx->task) && !reprogram) {
> > +			ret = -ESRCH;
> > +			goto unlock;
> > +		}
> >  
> > +		WARN_ON_ONCE(reprogram && cpuctx->task_ctx && cpuctx->task_ctx != ctx);
> >  	} else if (task_ctx) {
> >  		raw_spin_lock(&task_ctx->lock);
> >  	}
> >  
> > -	if (activate) {
> > +	if (reprogram) {
> >  		ctx_sched_out(ctx, cpuctx, EVENT_TIME);
> >  		add_event_to_ctx(event, ctx);
> >  		ctx_resched(cpuctx, task_ctx);
> > @@ -2328,13 +2327,36 @@ perf_install_in_context(struct perf_event_context *ctx,
> >  	/*
> >  	 * Installing events is tricky because we cannot rely on ctx->is_active
> >  	 * to be set in case this is the nr_events 0 -> 1 transition.
> > +	 *
> > +	 * Instead we use task_curr(), which tells us if the task is running.
> > +	 * However, since we use task_curr() outside of rq::lock, we can race
> > +	 * against the actual state. This means the result can be wrong.
> > +	 *
> > +	 * If we get a false positive, we retry, this is harmless.
> > +	 *
> > +	 * If we get a false negative, things are complicated. If we are after
> > +	 * perf_event_context_sched_in() ctx::lock will serialize us, and the
> > +	 * value must be correct. If we're before, it doesn't matter since
> > +	 * perf_event_context_sched_in() will program the counter.
> > +	 *
> > +	 * However, this hinges on the remote context switch having observed
> > +	 * our task->perf_event_ctxp[] store, such that it will in fact take
> > +	 * ctx::lock in perf_event_context_sched_in().
> > +	 *
> > +	 * We do this by task_function_call(), if the IPI fails to hit the task
> > +	 * we know any future context switch of task must see the
> > +	 * perf_event_ctpx[] store.
> >  	 */
> > -again:
> > +
> >  	/*
> > -	 * Cannot use task_function_call() because we need to run on the task's
> > -	 * CPU regardless of whether its current or not.
> > +	 * This smp_mb() orders the task->perf_event_ctxp[] store with the
> > +	 * task_cpu() load, such that if the IPI then does not find the task
> > +	 * running, a future context switch of that task must observe the
> > +	 * store.
> >  	 */
> > -	if (!cpu_function_call(task_cpu(task), __perf_install_in_context, event))
> > +	smp_mb();
> > +again:
> > +	if (!task_function_call(task, __perf_install_in_context, event))
> >  		return;
> >  
> >  	raw_spin_lock_irq(&ctx->lock);
> > @@ -2348,12 +2370,16 @@ again:
> >  		raw_spin_unlock_irq(&ctx->lock);
> >  		return;
> >  	}
> > -	raw_spin_unlock_irq(&ctx->lock);
> >  	/*
> > -	 * Since !ctx->is_active doesn't mean anything, we must IPI
> > -	 * unconditionally.
> > +	 * If the task is not running, ctx->lock will avoid it becoming so,
> > +	 * thus we can safely install the event.
> >  	 */
> > -	goto again;
> > +	if (task_curr(task)) {
> > +		raw_spin_unlock_irq(&ctx->lock);
> > +		goto again;
> > +	}
> > +	add_event_to_ctx(event, ctx);
> > +	raw_spin_unlock_irq(&ctx->lock);
> >  }
> >  
> >  /*
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-10-18 20:48 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-17 23:05 Memory-ordering recipes Paul E. McKenney
2017-09-18  7:52 ` Boqun Feng
2017-09-18 14:25   ` Paul E. McKenney
2017-09-19  2:00     ` Boqun Feng
2017-09-21 12:45 ` Peter Zijlstra
2017-09-21 15:26   ` Paul E. McKenney
2017-09-21 16:15     ` Peter Zijlstra
2017-09-21 16:40       ` Paul E. McKenney
2017-09-22  9:29         ` Peter Zijlstra
2017-09-22 21:06           ` Paul E. McKenney
2017-10-17 21:01 ` Paul E. McKenney
2017-10-18  1:07   ` Andrea Parri
2017-10-18 20:48     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).