linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC git tree] Userspace RCU (urcu) for Linux
@ 2009-02-06  3:05 Mathieu Desnoyers
  2009-02-06  4:58 ` [RFC git tree] Userspace RCU (urcu) for Linux (repost) Mathieu Desnoyers
  2009-02-06  8:55 ` [RFC git tree] Userspace RCU (urcu) for Linux Bert Wesarg
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-06  3:05 UTC (permalink / raw)
  To: Paul E. McKenney, ltt-dev, linux-kernel

Hi Paul,

I figured out I needed some userspace RCU for the userspace tracing part
of LTTng (for quick read access to the control variables) to trace
userspace pthread applications. So I've done a quick-and-dirty userspace
RCU implementation.

It works so far, but I have not gone through any formal verification
phase. It seems to work on paper, and the tests are also OK (so far),
but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
want to comment on it, it would be welcome. It's a userland-only
library. It's also currently x86-only, but only a few basic definitions
must be adapted in urcu.h to port it.

Here is the link to my git tree :

git://lttng.org/userspace-rcu.git

http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06  3:05 [RFC git tree] Userspace RCU (urcu) for Linux Mathieu Desnoyers
@ 2009-02-06  4:58 ` Mathieu Desnoyers
  2009-02-06 13:06   ` Paul E. McKenney
  2009-02-07 22:56   ` Kyle Moffett
  2009-02-06  8:55 ` [RFC git tree] Userspace RCU (urcu) for Linux Bert Wesarg
  1 sibling, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-06  4:58 UTC (permalink / raw)
  To: Paul E. McKenney, ltt-dev, linux-kernel; +Cc: Robert Wisniewski

(sorry for repost, I got the ltt-dev email wrong in the previous one)

Hi Paul,

I figured out I needed some userspace RCU for the userspace tracing part
of LTTng (for quick read access to the control variables) to trace
userspace pthread applications. So I've done a quick-and-dirty userspace
RCU implementation.

It works so far, but I have not gone through any formal verification
phase. It seems to work on paper, and the tests are also OK (so far),
but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
want to comment on it, it would be welcome. It's a userland-only
library. It's also currently x86-only, but only a few basic definitions
must be adapted in urcu.h to port it.

Here is the link to my git tree :

git://lttng.org/userspace-rcu.git

http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux
  2009-02-06  3:05 [RFC git tree] Userspace RCU (urcu) for Linux Mathieu Desnoyers
  2009-02-06  4:58 ` [RFC git tree] Userspace RCU (urcu) for Linux (repost) Mathieu Desnoyers
@ 2009-02-06  8:55 ` Bert Wesarg
  2009-02-06 11:36   ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Bert Wesarg @ 2009-02-06  8:55 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Paul E. McKenney, ltt-dev, linux-kernel

On Fri, Feb 6, 2009 at 04:05, Mathieu Desnoyers
<compudj@krystal.dyndns.org> wrote:
> Hi Paul,
>
> I figured out I needed some userspace RCU for the userspace tracing part
> of LTTng (for quick read access to the control variables) to trace
> userspace pthread applications. So I've done a quick-and-dirty userspace
> RCU implementation.
>
> It works so far, but I have not gone through any formal verification
> phase. It seems to work on paper, and the tests are also OK (so far),
> but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> want to comment on it, it would be welcome. It's a userland-only
> library. It's also currently x86-only, but only a few basic definitions
> must be adapted in urcu.h to port it.
>
> Here is the link to my git tree :
>
> git://lttng.org/userspace-rcu.git
>
> http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
>
Really interesting, thanks.

But you should use pthread_equal() for you equality test of pthread_t.

Regards,
Bert

> Thanks,
>
> Mathieu

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux
  2009-02-06  8:55 ` [RFC git tree] Userspace RCU (urcu) for Linux Bert Wesarg
@ 2009-02-06 11:36   ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-06 11:36 UTC (permalink / raw)
  To: Bert Wesarg; +Cc: Paul E. McKenney, ltt-dev, linux-kernel

* Bert Wesarg (bert.wesarg@googlemail.com) wrote:
> On Fri, Feb 6, 2009 at 04:05, Mathieu Desnoyers
> <compudj@krystal.dyndns.org> wrote:
> > Hi Paul,
> >
> > I figured out I needed some userspace RCU for the userspace tracing part
> > of LTTng (for quick read access to the control variables) to trace
> > userspace pthread applications. So I've done a quick-and-dirty userspace
> > RCU implementation.
> >
> > It works so far, but I have not gone through any formal verification
> > phase. It seems to work on paper, and the tests are also OK (so far),
> > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > want to comment on it, it would be welcome. It's a userland-only
> > library. It's also currently x86-only, but only a few basic definitions
> > must be adapted in urcu.h to port it.
> >
> > Here is the link to my git tree :
> >
> > git://lttng.org/userspace-rcu.git
> >
> > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> >
> Really interesting, thanks.
> 
> But you should use pthread_equal() for you equality test of pthread_t.
> 

It's merged, thanks !

Mathieu

> Regards,
> Bert
> 
> > Thanks,
> >
> > Mathieu
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06  4:58 ` [RFC git tree] Userspace RCU (urcu) for Linux (repost) Mathieu Desnoyers
@ 2009-02-06 13:06   ` Paul E. McKenney
  2009-02-06 16:34     ` Paul E. McKenney
  2009-02-07 22:56   ` Kyle Moffett
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-06 13:06 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> (sorry for repost, I got the ltt-dev email wrong in the previous one)
> 
> Hi Paul,
> 
> I figured out I needed some userspace RCU for the userspace tracing part
> of LTTng (for quick read access to the control variables) to trace
> userspace pthread applications. So I've done a quick-and-dirty userspace
> RCU implementation.
> 
> It works so far, but I have not gone through any formal verification
> phase. It seems to work on paper, and the tests are also OK (so far),
> but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> want to comment on it, it would be welcome. It's a userland-only
> library. It's also currently x86-only, but only a few basic definitions
> must be adapted in urcu.h to port it.
> 
> Here is the link to my git tree :
> 
> git://lttng.org/userspace-rcu.git
> 
> http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary

Very cool!!!  I will take a look!

I will also point you at a few that I have put together:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

(In the CodeSamples/defer directory.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06 13:06   ` Paul E. McKenney
@ 2009-02-06 16:34     ` Paul E. McKenney
  2009-02-07 15:10       ` Paul E. McKenney
  2009-02-08 22:44       ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-06 16:34 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > 
> > Hi Paul,
> > 
> > I figured out I needed some userspace RCU for the userspace tracing part
> > of LTTng (for quick read access to the control variables) to trace
> > userspace pthread applications. So I've done a quick-and-dirty userspace
> > RCU implementation.
> > 
> > It works so far, but I have not gone through any formal verification
> > phase. It seems to work on paper, and the tests are also OK (so far),
> > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > want to comment on it, it would be welcome. It's a userland-only
> > library. It's also currently x86-only, but only a few basic definitions
> > must be adapted in urcu.h to port it.
> > 
> > Here is the link to my git tree :
> > 
> > git://lttng.org/userspace-rcu.git
> > 
> > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> 
> Very cool!!!  I will take a look!
> 
> I will also point you at a few that I have put together:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 
> (In the CodeSamples/defer directory.)

Interesting approach, using the signal to force memory-barrier execution!

o	One possible optimization would be to avoid sending a signal to
	a blocked thread, as the context switch leading to blocking
	will have implied a memory barrier -- otherwise it would not
	be safe to resume the thread on some other CPU.  That said,
	not sure whether checking to see whether a thread is blocked is
	any faster than sending it a signal and forcing it to wake up.

	Of course, this approach does require that the enclosing
	application be willing to give up a signal.  I suspect that most
	applications would be OK with this, though some might not.

	Of course, I cannot resist pointing to an old LKML thread:

		http://lkml.org/lkml/2001/10/8/189

	But I think that the time is now right.  ;-)

o	I don't understand the purpose of rcu_write_lock() and
	rcu_write_unlock().  I am concerned that it will lead people
	to decide that a single global lock must protect RCU updates,
	which is of course absolutely not the case.  I strongly
	suggest making these internal to the urcu.c file.  Yes,
	uses of urcu_publish_content() would then hit two locks (the
	internal-to-urcu.c one and whatever they are using to protect
	their data structure), but let's face it, if you are sending a
	signal to each and every thread, the additional overhead of the
	extra lock is the least of your worries.

	If you really want to heavily optimize this, I would suggest
	setting up a state machine that permits multiple concurrent
	calls to urcu_publish_content() to share the same set of signal
	invocations.  That way, if the caller has partitioned the
	data structure, global locking might be avoided completely
	(or at least greatly restricted in scope).

	Of course, if updates are rare, the optimization would not
	help, but in that case, acquiring two locks would be even less
	of a problem.

o	Is urcu_qparity relying on initialization to zero?  Or on the
	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
	used to index urcu_active_readers[], you must be relying on
	initialization to zero.

o	In rcu_read_lock(), why is a non-atomic increment of the
	urcu_active_readers[urcu_parity] element safe?  Are you
	relying on the compiler generating an x86 add-to-memory
	instruction?

	Ditto for rcu_read_unlock().

	Ah, never mind!!!  I now see the __thread specification,
	and the keeping of references to it in the reader_data list.

o	Combining the equivalent of rcu_assign_pointer() and
	synchronize_rcu() into urcu_publish_content() is an interesting
	approach.  Not yet sure whether or not it is a good idea.  I
	guess trying it out on several applications would be the way
	to find out.  ;-)

	That said, I suspect that it would be very convenient in a
	number of situations.

o	It would be good to avoid having to pass the return value
	of rcu_read_lock() into rcu_read_unlock().  It should be
	possible to avoid this via counter value tricks, though this
	would add a bit more code in rcu_read_lock() on 32-bit machines.
	(64-bit machines don't have to worry about counter overflow.)

	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
	in the aforementioned git archive for a way to do this.
	(And perhaps I should apply this change to SRCU...)

o	Your test looks a bit strange, not sure why you test all the
	different variables.  It would be nice to take a test duration
	as an argument and run the test for that time.

	I killed the test after better part of an hour on my laptop,
	will retry on a larger machine (after noting the 18 threads
	created!).  (And yes, I first tried Power, which objected
	strenously to the "mfence" and "lock; incl" instructions,
	so getting an x86 machine to try on.)

Again, looks interesting!  Looks plausible, although I have not 100%
convinced myself that it is perfectly bug-free.  But I do maintain
a healthy skepticism of purported RCU algorithms, especially ones that
I have written.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06 16:34     ` Paul E. McKenney
@ 2009-02-07 15:10       ` Paul E. McKenney
  2009-02-07 22:16         ` Paul E. McKenney
  2009-02-07 23:38         ` Mathieu Desnoyers
  2009-02-08 22:44       ` Mathieu Desnoyers
  1 sibling, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-07 15:10 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > 
> > > Hi Paul,
> > > 
> > > I figured out I needed some userspace RCU for the userspace tracing part
> > > of LTTng (for quick read access to the control variables) to trace
> > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > RCU implementation.
> > > 
> > > It works so far, but I have not gone through any formal verification
> > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > want to comment on it, it would be welcome. It's a userland-only
> > > library. It's also currently x86-only, but only a few basic definitions
> > > must be adapted in urcu.h to port it.
> > > 
> > > Here is the link to my git tree :
> > > 
> > > git://lttng.org/userspace-rcu.git
> > > 
> > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > 
> > Very cool!!!  I will take a look!
> > 
> > I will also point you at a few that I have put together:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > (In the CodeSamples/defer directory.)
> 
> Interesting approach, using the signal to force memory-barrier execution!
> 
> o	One possible optimization would be to avoid sending a signal to
> 	a blocked thread, as the context switch leading to blocking
> 	will have implied a memory barrier -- otherwise it would not
> 	be safe to resume the thread on some other CPU.  That said,
> 	not sure whether checking to see whether a thread is blocked is
> 	any faster than sending it a signal and forcing it to wake up.
> 
> 	Of course, this approach does require that the enclosing
> 	application be willing to give up a signal.  I suspect that most
> 	applications would be OK with this, though some might not.
> 
> 	Of course, I cannot resist pointing to an old LKML thread:
> 
> 		http://lkml.org/lkml/2001/10/8/189
> 
> 	But I think that the time is now right.  ;-)
> 
> o	I don't understand the purpose of rcu_write_lock() and
> 	rcu_write_unlock().  I am concerned that it will lead people
> 	to decide that a single global lock must protect RCU updates,
> 	which is of course absolutely not the case.  I strongly
> 	suggest making these internal to the urcu.c file.  Yes,
> 	uses of urcu_publish_content() would then hit two locks (the
> 	internal-to-urcu.c one and whatever they are using to protect
> 	their data structure), but let's face it, if you are sending a
> 	signal to each and every thread, the additional overhead of the
> 	extra lock is the least of your worries.
> 
> 	If you really want to heavily optimize this, I would suggest
> 	setting up a state machine that permits multiple concurrent
> 	calls to urcu_publish_content() to share the same set of signal
> 	invocations.  That way, if the caller has partitioned the
> 	data structure, global locking might be avoided completely
> 	(or at least greatly restricted in scope).
> 
> 	Of course, if updates are rare, the optimization would not
> 	help, but in that case, acquiring two locks would be even less
> 	of a problem.
> 
> o	Is urcu_qparity relying on initialization to zero?  Or on the
> 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> 	used to index urcu_active_readers[], you must be relying on
> 	initialization to zero.
> 
> o	In rcu_read_lock(), why is a non-atomic increment of the
> 	urcu_active_readers[urcu_parity] element safe?  Are you
> 	relying on the compiler generating an x86 add-to-memory
> 	instruction?
> 
> 	Ditto for rcu_read_unlock().
> 
> 	Ah, never mind!!!  I now see the __thread specification,
> 	and the keeping of references to it in the reader_data list.
> 
> o	Combining the equivalent of rcu_assign_pointer() and
> 	synchronize_rcu() into urcu_publish_content() is an interesting
> 	approach.  Not yet sure whether or not it is a good idea.  I
> 	guess trying it out on several applications would be the way
> 	to find out.  ;-)
> 
> 	That said, I suspect that it would be very convenient in a
> 	number of situations.
> 
> o	It would be good to avoid having to pass the return value
> 	of rcu_read_lock() into rcu_read_unlock().  It should be
> 	possible to avoid this via counter value tricks, though this
> 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> 	(64-bit machines don't have to worry about counter overflow.)
> 
> 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> 	in the aforementioned git archive for a way to do this.
> 	(And perhaps I should apply this change to SRCU...)
> 
> o	Your test looks a bit strange, not sure why you test all the
> 	different variables.  It would be nice to take a test duration
> 	as an argument and run the test for that time.
> 
> 	I killed the test after better part of an hour on my laptop,
> 	will retry on a larger machine (after noting the 18 threads
> 	created!).  (And yes, I first tried Power, which objected
> 	strenously to the "mfence" and "lock; incl" instructions,
> 	so getting an x86 machine to try on.)
> 
> Again, looks interesting!  Looks plausible, although I have not 100%
> convinced myself that it is perfectly bug-free.  But I do maintain
> a healthy skepticism of purported RCU algorithms, especially ones that
> I have written.  ;-)

OK, here is one sequence of concern...

o	Thread 0 starts rcu_read_lock(), picking up the current
	get_urcu_qparity() into the local variable urcu_parity().
	Assume that the value returned is zero.

o	Thread 0 is now preempted.

o	Thread 1 invokes urcu_publish_content():

	o	It substitutes the pointer.

	o	It forces all threads to execute a memory barrier
		(thread 0 runs just long enough to process its signal
		and then is immediately preempted again).

	o	It switches the parity, which is now one.

	o	It waits for all readers on parity zero, and there are
		none, because thread 0 has not yet registered itself.

	o	It therefore returns the old pointer.  So far, so good.

o	Thread 0 now resumes:

	o	It increments its urcu_active_readers[0].

	o	It forces a compiler barrier.

	o	It returns zero (why not store this in thread-local
		storage rather than returning?).

	o	It enters its critical section, obtaining a reference
		to the new pointer that thread 1 just published.

o	Thread 1 now again invokes urcu_publish_content():
 
	o	It substitutes the pointer.

	o	It forces all threads to execute a memory barrier,
		including thread 0.

	o	It switches the parity, which is now zero.

	o	It waits for all readers on parity one, and there are
		none, because thread 0 has registered itself on parity
		zero!!!

	o	Thread 1 therefore returns the old pointer.

	o	Thread 1 frees the old pointer, which thread 0 is still
		using!!!

So, how to fix?  Here are some approaches:

o	Make urcu_publish_content() do two parity flips rather than one.
	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
	algorithms in CodeSamples/defer.

o	Use a single free-running counter, in a manner similar to rcu_nest,
	as suggested earlier.  This one is interesting, as I rely on a
	read-side memory barrier to handle the long-preemption case.
	However, if you believe that any thread that waits several minutes
	between executing adjacent instructions must have been preempted
	(which the memory barriers that are required to do a context
	switch), then a compiler barrier suffices.  ;-)

Of course, the probability of seeing this failure during test is quite
low, since it is unlikely that thread 0 would run just long enough to
execute its signal handler.  However, it could happen.  And if you were
to adapt this algorithm for use in a real-time application, then priority
boosting could cause this to happen naturally.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 15:10       ` Paul E. McKenney
@ 2009-02-07 22:16         ` Paul E. McKenney
  2009-02-08  0:19           ` Mathieu Desnoyers
  2009-02-07 23:38         ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-07 22:16 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sat, Feb 07, 2009 at 07:10:28AM -0800, Paul E. McKenney wrote:
> So, how to fix?  Here are some approaches:
> 
> o	Make urcu_publish_content() do two parity flips rather than one.
> 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> 	algorithms in CodeSamples/defer.
> 
> o	Use a single free-running counter, in a manner similar to rcu_nest,
> 	as suggested earlier.  This one is interesting, as I rely on a
> 	read-side memory barrier to handle the long-preemption case.
> 	However, if you believe that any thread that waits several minutes
> 	between executing adjacent instructions must have been preempted
> 	(which the memory barriers that are required to do a context
> 	switch), then a compiler barrier suffices.  ;-)
> 
> Of course, the probability of seeing this failure during test is quite
> low, since it is unlikely that thread 0 would run just long enough to
> execute its signal handler.  However, it could happen.  And if you were
> to adapt this algorithm for use in a real-time application, then priority
> boosting could cause this to happen naturally.

And here is a patch, taking the first approach.  It also exposes a
synchronize_rcu() API that is used by the existing urcu_publish_content()
API.  This allows easier handling of structures that are referenced by
more than one pointer.  And should also allow to be more easily plugged
into my rcutorture test.  ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 urcu.c |   39 ++++++++++++++++++++++++++-------------
 1 file changed, 26 insertions(+), 13 deletions(-)

diff --git a/urcu.c b/urcu.c
index e401d8d..1a276ce 100644
--- a/urcu.c
+++ b/urcu.c
@@ -113,13 +113,35 @@ void wait_for_quiescent_state(int parity)
 	force_mb_all_threads();
 }
 
+static void switch_qparity(void)
+{
+	int prev_parity;
+
+	/* All threads should read qparity before accessing data structure. */
+	/* Write ptr before changing the qparity */
+	force_mb_all_threads();
+	prev_parity = switch_next_urcu_qparity();
+
+	/*
+	 * Wait for previous parity to be empty of readers.
+	 */
+	wait_for_quiescent_state(prev_parity);
+}
+
+void synchronize_rcu(void)
+{
+	rcu_write_lock();
+	switch_qparity();
+	switch_qparity();
+	rcu_write_unlock();
+}
+
 /*
  * Return old pointer, OK to free, no more reference exist.
  * Called under rcu_write_lock.
  */
 void *urcu_publish_content(void **ptr, void *new)
 {
-	int prev_parity;
 	void *oldptr;
 
 	/*
@@ -134,19 +156,10 @@ void *urcu_publish_content(void **ptr, void *new)
 	 */
 	oldptr = *ptr;
 	*ptr = new;
-	/* All threads should read qparity before ptr */
-	/* Write ptr before changing the qparity */
-	force_mb_all_threads();
-	prev_parity = switch_next_urcu_qparity();
 
-	/*
-	 * Wait for previous parity to be empty of readers.
-	 */
-	wait_for_quiescent_state(prev_parity);
-	/*
-	 * Deleting old data is ok !
-	 */
-	
+	switch_qparity();
+	switch_qparity();
+
 	return oldptr;
 }
 

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06  4:58 ` [RFC git tree] Userspace RCU (urcu) for Linux (repost) Mathieu Desnoyers
  2009-02-06 13:06   ` Paul E. McKenney
@ 2009-02-07 22:56   ` Kyle Moffett
  2009-02-07 23:50     ` Mathieu Desnoyers
  2009-02-08  0:13     ` Paul E. McKenney
  1 sibling, 2 replies; 116+ messages in thread
From: Kyle Moffett @ 2009-02-07 22:56 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, ltt-dev, linux-kernel, Robert Wisniewski

On Thu, Feb 5, 2009 at 11:58 PM, Mathieu Desnoyers
<compudj@krystal.dyndns.org> wrote:
> I figured out I needed some userspace RCU for the userspace tracing part
> of LTTng (for quick read access to the control variables) to trace
> userspace pthread applications. So I've done a quick-and-dirty userspace
> RCU implementation.
>
> It works so far, but I have not gone through any formal verification
> phase. It seems to work on paper, and the tests are also OK (so far),
> but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> want to comment on it, it would be welcome. It's a userland-only
> library. It's also currently x86-only, but only a few basic definitions
> must be adapted in urcu.h to port it.

I have actually been fiddling with an RCU-esque design for a
multithreaded event-driven userspace server process.  Essentially all
threads using RCU-protected data run through a central event loop
which drives my entirely-userspace RCU state machine.  I actually have
a cooperative scheduler for groups of events to allow me to
load-balance a large number of clients without the full overhead of a
kernel thread per client.  This does rely on
clock_gettime(CLOCK_THREAD_CPUTIME_ID) returning a useful monotonic
value, however.

By building the whole internal system as an
event-driven-state-machine, I don't need to keep a stack for blocked
events.  The events which do large amounts of work call a
"need_resched()"-ish function every so often, and if it returns true
they return up the stack.  Relatively few threads (1 per physical CPU,
plus a few for blocking event polling) are needed to completely
saturate the system.

For RCU I simply treat event-handler threads the way the kernel treats
CPUs, I report a Quiescent State every so often in-between processing
events.

The event-handling mechanism is entirely agnostic to the way that
events are generated.  It has built-in mechanisms for FD, signal, and
AIO-based events, and it's trivial to add another event-polling thread
for GTK/Qt/etc.

I'm still only halfway through laying out the framework for this
library, but once it's done I'll make sure to post it somewhere for
those who are interested.

Cheers,
Kyle Moffett

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 15:10       ` Paul E. McKenney
  2009-02-07 22:16         ` Paul E. McKenney
@ 2009-02-07 23:38         ` Mathieu Desnoyers
  2009-02-08  0:44           ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-07 23:38 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > 
> > > > Hi Paul,
> > > > 
> > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > of LTTng (for quick read access to the control variables) to trace
> > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > RCU implementation.
> > > > 
> > > > It works so far, but I have not gone through any formal verification
> > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > want to comment on it, it would be welcome. It's a userland-only
> > > > library. It's also currently x86-only, but only a few basic definitions
> > > > must be adapted in urcu.h to port it.
> > > > 
> > > > Here is the link to my git tree :
> > > > 
> > > > git://lttng.org/userspace-rcu.git
> > > > 
> > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > 
> > > Very cool!!!  I will take a look!
> > > 
> > > I will also point you at a few that I have put together:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > (In the CodeSamples/defer directory.)
> > 
> > Interesting approach, using the signal to force memory-barrier execution!
> > 
> > o	One possible optimization would be to avoid sending a signal to
> > 	a blocked thread, as the context switch leading to blocking
> > 	will have implied a memory barrier -- otherwise it would not
> > 	be safe to resume the thread on some other CPU.  That said,
> > 	not sure whether checking to see whether a thread is blocked is
> > 	any faster than sending it a signal and forcing it to wake up.
> > 
> > 	Of course, this approach does require that the enclosing
> > 	application be willing to give up a signal.  I suspect that most
> > 	applications would be OK with this, though some might not.
> > 
> > 	Of course, I cannot resist pointing to an old LKML thread:
> > 
> > 		http://lkml.org/lkml/2001/10/8/189
> > 
> > 	But I think that the time is now right.  ;-)
> > 
> > o	I don't understand the purpose of rcu_write_lock() and
> > 	rcu_write_unlock().  I am concerned that it will lead people
> > 	to decide that a single global lock must protect RCU updates,
> > 	which is of course absolutely not the case.  I strongly
> > 	suggest making these internal to the urcu.c file.  Yes,
> > 	uses of urcu_publish_content() would then hit two locks (the
> > 	internal-to-urcu.c one and whatever they are using to protect
> > 	their data structure), but let's face it, if you are sending a
> > 	signal to each and every thread, the additional overhead of the
> > 	extra lock is the least of your worries.
> > 
> > 	If you really want to heavily optimize this, I would suggest
> > 	setting up a state machine that permits multiple concurrent
> > 	calls to urcu_publish_content() to share the same set of signal
> > 	invocations.  That way, if the caller has partitioned the
> > 	data structure, global locking might be avoided completely
> > 	(or at least greatly restricted in scope).
> > 
> > 	Of course, if updates are rare, the optimization would not
> > 	help, but in that case, acquiring two locks would be even less
> > 	of a problem.
> > 
> > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > 	used to index urcu_active_readers[], you must be relying on
> > 	initialization to zero.
> > 
> > o	In rcu_read_lock(), why is a non-atomic increment of the
> > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > 	relying on the compiler generating an x86 add-to-memory
> > 	instruction?
> > 
> > 	Ditto for rcu_read_unlock().
> > 
> > 	Ah, never mind!!!  I now see the __thread specification,
> > 	and the keeping of references to it in the reader_data list.
> > 
> > o	Combining the equivalent of rcu_assign_pointer() and
> > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > 	approach.  Not yet sure whether or not it is a good idea.  I
> > 	guess trying it out on several applications would be the way
> > 	to find out.  ;-)
> > 
> > 	That said, I suspect that it would be very convenient in a
> > 	number of situations.
> > 
> > o	It would be good to avoid having to pass the return value
> > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > 	possible to avoid this via counter value tricks, though this
> > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > 	(64-bit machines don't have to worry about counter overflow.)
> > 
> > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > 	in the aforementioned git archive for a way to do this.
> > 	(And perhaps I should apply this change to SRCU...)
> > 
> > o	Your test looks a bit strange, not sure why you test all the
> > 	different variables.  It would be nice to take a test duration
> > 	as an argument and run the test for that time.
> > 
> > 	I killed the test after better part of an hour on my laptop,
> > 	will retry on a larger machine (after noting the 18 threads
> > 	created!).  (And yes, I first tried Power, which objected
> > 	strenously to the "mfence" and "lock; incl" instructions,
> > 	so getting an x86 machine to try on.)
> > 
> > Again, looks interesting!  Looks plausible, although I have not 100%
> > convinced myself that it is perfectly bug-free.  But I do maintain
> > a healthy skepticism of purported RCU algorithms, especially ones that
> > I have written.  ;-)
> 
> OK, here is one sequence of concern...
> 

Let's see..

> o	Thread 0 starts rcu_read_lock(), picking up the current
> 	get_urcu_qparity() into the local variable urcu_parity().
> 	Assume that the value returned is zero.
> 
> o	Thread 0 is now preempted.
> 
> o	Thread 1 invokes urcu_publish_content():
> 
> 	o	It substitutes the pointer.
> 
> 	o	It forces all threads to execute a memory barrier
> 		(thread 0 runs just long enough to process its signal
> 		and then is immediately preempted again).
> 
> 	o	It switches the parity, which is now one.
> 
> 	o	It waits for all readers on parity zero, and there are
> 		none, because thread 0 has not yet registered itself.
> 
> 	o	It therefore returns the old pointer.  So far, so good.
> 
> o	Thread 0 now resumes:
> 
> 	o	It increments its urcu_active_readers[0].
> 
> 	o	It forces a compiler barrier.
> 
> 	o	It returns zero (why not store this in thread-local
> 		storage rather than returning?).
> 

To support nested rcu_read_locks. (that's the only reason)

> 	o	It enters its critical section, obtaining a reference
> 		to the new pointer that thread 1 just published.
> 
> o	Thread 1 now again invokes urcu_publish_content():
>  
> 	o	It substitutes the pointer.
> 
> 	o	It forces all threads to execute a memory barrier,
> 		including thread 0.
> 
> 	o	It switches the parity, which is now zero.
> 
> 	o	It waits for all readers on parity one, and there are
> 		none, because thread 0 has registered itself on parity
> 		zero!!!
> 
> 	o	Thread 1 therefore returns the old pointer.
> 
> 	o	Thread 1 frees the old pointer, which thread 0 is still
> 		using!!!
> 

Ah, yes, you are right.

> So, how to fix?  Here are some approaches:
> 
> o	Make urcu_publish_content() do two parity flips rather than one.
> 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> 	algorithms in CodeSamples/defer.
> 

This approach seems very interesting.

> o	Use a single free-running counter, in a manner similar to rcu_nest,
> 	as suggested earlier.  This one is interesting, as I rely on a
> 	read-side memory barrier to handle the long-preemption case.
> 	However, if you believe that any thread that waits several minutes
> 	between executing adjacent instructions must have been preempted
> 	(which the memory barriers that are required to do a context
> 	switch), then a compiler barrier suffices.  ;-)

Hrm, I'm trying to figure out what kind of memory backend you need to
put your counters for each quiescent state period. Is this free-running
counter indexing a very large array ? I doubt it does. Then how does it
make sure we don't roll back to the old array entries ?

This latter solution could break jump-based probing of programs
soon-to-be available in gcc. The probes are meant to be of short
duration, but the fact is that this design lets the debugger inject code
without resorting to a breakpoint, which might therefore break your
"short time between instructions" assumption. It's very unlikely, but
possible.


> 
> Of course, the probability of seeing this failure during test is quite
> low, since it is unlikely that thread 0 would run just long enough to
> execute its signal handler.  However, it could happen.  And if you were
> to adapt this algorithm for use in a real-time application, then priority
> boosting could cause this to happen naturally.
> 

Yes. It's not because we are not able to create the faulty condition
that it will _never_ happen. It must therefore be taken care of.

Mathieu

> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 22:56   ` Kyle Moffett
@ 2009-02-07 23:50     ` Mathieu Desnoyers
  2009-02-08  0:13     ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-07 23:50 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: Paul E. McKenney, ltt-dev, linux-kernel, Robert Wisniewski

* Kyle Moffett (kyle@moffetthome.net) wrote:
> On Thu, Feb 5, 2009 at 11:58 PM, Mathieu Desnoyers
> <compudj@krystal.dyndns.org> wrote:
> > I figured out I needed some userspace RCU for the userspace tracing part
> > of LTTng (for quick read access to the control variables) to trace
> > userspace pthread applications. So I've done a quick-and-dirty userspace
> > RCU implementation.
> >
> > It works so far, but I have not gone through any formal verification
> > phase. It seems to work on paper, and the tests are also OK (so far),
> > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > want to comment on it, it would be welcome. It's a userland-only
> > library. It's also currently x86-only, but only a few basic definitions
> > must be adapted in urcu.h to port it.
> 
> I have actually been fiddling with an RCU-esque design for a
> multithreaded event-driven userspace server process.  Essentially all
> threads using RCU-protected data run through a central event loop
> which drives my entirely-userspace RCU state machine.  I actually have
> a cooperative scheduler for groups of events to allow me to
> load-balance a large number of clients without the full overhead of a
> kernel thread per client.  This does rely on
> clock_gettime(CLOCK_THREAD_CPUTIME_ID) returning a useful monotonic
> value, however.
> 
> By building the whole internal system as an
> event-driven-state-machine, I don't need to keep a stack for blocked
> events.  The events which do large amounts of work call a
> "need_resched()"-ish function every so often, and if it returns true
> they return up the stack.  Relatively few threads (1 per physical CPU,
> plus a few for blocking event polling) are needed to completely
> saturate the system.
> 
> For RCU I simply treat event-handler threads the way the kernel treats
> CPUs, I report a Quiescent State every so often in-between processing
> events.
> 
> The event-handling mechanism is entirely agnostic to the way that
> events are generated.  It has built-in mechanisms for FD, signal, and
> AIO-based events, and it's trivial to add another event-polling thread
> for GTK/Qt/etc.
> 
> I'm still only halfway through laying out the framework for this
> library, but once it's done I'll make sure to post it somewhere for
> those who are interested.
> 

That would be interesting to look at. It would indeed be very efficient
at the reader site, because no barriers would be required. However, it
might not be appropriate for use-cases like userspace tracing, where we
ideally want to add tracing functionnality to applications as a library,
without requiring to modify the application behavior (e.g. adding a
"quiescent state" call into the application loop). I also think Paul
already has such application quiescent state notification implementation
in the links he gave us, we might want to compare those two.

Mathieu

> Cheers,
> Kyle Moffett
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 22:56   ` Kyle Moffett
  2009-02-07 23:50     ` Mathieu Desnoyers
@ 2009-02-08  0:13     ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-08  0:13 UTC (permalink / raw)
  To: Kyle Moffett; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Sat, Feb 07, 2009 at 05:56:31PM -0500, Kyle Moffett wrote:
> On Thu, Feb 5, 2009 at 11:58 PM, Mathieu Desnoyers
> <compudj@krystal.dyndns.org> wrote:
> > I figured out I needed some userspace RCU for the userspace tracing part
> > of LTTng (for quick read access to the control variables) to trace
> > userspace pthread applications. So I've done a quick-and-dirty userspace
> > RCU implementation.
> >
> > It works so far, but I have not gone through any formal verification
> > phase. It seems to work on paper, and the tests are also OK (so far),
> > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > want to comment on it, it would be welcome. It's a userland-only
> > library. It's also currently x86-only, but only a few basic definitions
> > must be adapted in urcu.h to port it.
> 
> I have actually been fiddling with an RCU-esque design for a
> multithreaded event-driven userspace server process.  Essentially all
> threads using RCU-protected data run through a central event loop
> which drives my entirely-userspace RCU state machine.  I actually have
> a cooperative scheduler for groups of events to allow me to
> load-balance a large number of clients without the full overhead of a
> kernel thread per client.  This does rely on
> clock_gettime(CLOCK_THREAD_CPUTIME_ID) returning a useful monotonic
> value, however.
> 
> By building the whole internal system as an
> event-driven-state-machine, I don't need to keep a stack for blocked
> events.  The events which do large amounts of work call a
> "need_resched()"-ish function every so often, and if it returns true
> they return up the stack.  Relatively few threads (1 per physical CPU,
> plus a few for blocking event polling) are needed to completely
> saturate the system.
> 
> For RCU I simply treat event-handler threads the way the kernel treats
> CPUs, I report a Quiescent State every so often in-between processing
> events.
> 
> The event-handling mechanism is entirely agnostic to the way that
> events are generated.  It has built-in mechanisms for FD, signal, and
> AIO-based events, and it's trivial to add another event-polling thread
> for GTK/Qt/etc.
> 
> I'm still only halfway through laying out the framework for this
> library, but once it's done I'll make sure to post it somewhere for
> those who are interested.

I look forward to seeing it!  Perhaps user-level RCU is an idea whose
time has come?  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 22:16         ` Paul E. McKenney
@ 2009-02-08  0:19           ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-08  0:19 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Feb 07, 2009 at 07:10:28AM -0800, Paul E. McKenney wrote:
> > So, how to fix?  Here are some approaches:
> > 
> > o	Make urcu_publish_content() do two parity flips rather than one.
> > 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> > 	algorithms in CodeSamples/defer.
> > 
> > o	Use a single free-running counter, in a manner similar to rcu_nest,
> > 	as suggested earlier.  This one is interesting, as I rely on a
> > 	read-side memory barrier to handle the long-preemption case.
> > 	However, if you believe that any thread that waits several minutes
> > 	between executing adjacent instructions must have been preempted
> > 	(which the memory barriers that are required to do a context
> > 	switch), then a compiler barrier suffices.  ;-)
> > 
> > Of course, the probability of seeing this failure during test is quite
> > low, since it is unlikely that thread 0 would run just long enough to
> > execute its signal handler.  However, it could happen.  And if you were
> > to adapt this algorithm for use in a real-time application, then priority
> > boosting could cause this to happen naturally.
> 
> And here is a patch, taking the first approach.  It also exposes a
> synchronize_rcu() API that is used by the existing urcu_publish_content()
> API.  This allows easier handling of structures that are referenced by
> more than one pointer.  And should also allow to be more easily plugged
> into my rcutorture test.  ;-)
> 

Merged, thanks !

Mathieu

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  urcu.c |   39 ++++++++++++++++++++++++++-------------
>  1 file changed, 26 insertions(+), 13 deletions(-)
> 
> diff --git a/urcu.c b/urcu.c
> index e401d8d..1a276ce 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -113,13 +113,35 @@ void wait_for_quiescent_state(int parity)
>  	force_mb_all_threads();
>  }
>  
> +static void switch_qparity(void)
> +{
> +	int prev_parity;
> +
> +	/* All threads should read qparity before accessing data structure. */
> +	/* Write ptr before changing the qparity */
> +	force_mb_all_threads();
> +	prev_parity = switch_next_urcu_qparity();
> +
> +	/*
> +	 * Wait for previous parity to be empty of readers.
> +	 */
> +	wait_for_quiescent_state(prev_parity);
> +}
> +
> +void synchronize_rcu(void)
> +{
> +	rcu_write_lock();
> +	switch_qparity();
> +	switch_qparity();
> +	rcu_write_unlock();
> +}
> +
>  /*
>   * Return old pointer, OK to free, no more reference exist.
>   * Called under rcu_write_lock.
>   */
>  void *urcu_publish_content(void **ptr, void *new)
>  {
> -	int prev_parity;
>  	void *oldptr;
>  
>  	/*
> @@ -134,19 +156,10 @@ void *urcu_publish_content(void **ptr, void *new)
>  	 */
>  	oldptr = *ptr;
>  	*ptr = new;
> -	/* All threads should read qparity before ptr */
> -	/* Write ptr before changing the qparity */
> -	force_mb_all_threads();
> -	prev_parity = switch_next_urcu_qparity();
>  
> -	/*
> -	 * Wait for previous parity to be empty of readers.
> -	 */
> -	wait_for_quiescent_state(prev_parity);
> -	/*
> -	 * Deleting old data is ok !
> -	 */
> -	
> +	switch_qparity();
> +	switch_qparity();
> +
>  	return oldptr;
>  }
>  
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-07 23:38         ` Mathieu Desnoyers
@ 2009-02-08  0:44           ` Paul E. McKenney
  2009-02-08 21:46             ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-08  0:44 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sat, Feb 07, 2009 at 06:38:27PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > 
> > > > > Hi Paul,
> > > > > 
> > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > RCU implementation.
> > > > > 
> > > > > It works so far, but I have not gone through any formal verification
> > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > must be adapted in urcu.h to port it.
> > > > > 
> > > > > Here is the link to my git tree :
> > > > > 
> > > > > git://lttng.org/userspace-rcu.git
> > > > > 
> > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > 
> > > > Very cool!!!  I will take a look!
> > > > 
> > > > I will also point you at a few that I have put together:
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > 
> > > > (In the CodeSamples/defer directory.)
> > > 
> > > Interesting approach, using the signal to force memory-barrier execution!
> > > 
> > > o	One possible optimization would be to avoid sending a signal to
> > > 	a blocked thread, as the context switch leading to blocking
> > > 	will have implied a memory barrier -- otherwise it would not
> > > 	be safe to resume the thread on some other CPU.  That said,
> > > 	not sure whether checking to see whether a thread is blocked is
> > > 	any faster than sending it a signal and forcing it to wake up.
> > > 
> > > 	Of course, this approach does require that the enclosing
> > > 	application be willing to give up a signal.  I suspect that most
> > > 	applications would be OK with this, though some might not.
> > > 
> > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > 
> > > 		http://lkml.org/lkml/2001/10/8/189
> > > 
> > > 	But I think that the time is now right.  ;-)
> > > 
> > > o	I don't understand the purpose of rcu_write_lock() and
> > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > 	to decide that a single global lock must protect RCU updates,
> > > 	which is of course absolutely not the case.  I strongly
> > > 	suggest making these internal to the urcu.c file.  Yes,
> > > 	uses of urcu_publish_content() would then hit two locks (the
> > > 	internal-to-urcu.c one and whatever they are using to protect
> > > 	their data structure), but let's face it, if you are sending a
> > > 	signal to each and every thread, the additional overhead of the
> > > 	extra lock is the least of your worries.
> > > 
> > > 	If you really want to heavily optimize this, I would suggest
> > > 	setting up a state machine that permits multiple concurrent
> > > 	calls to urcu_publish_content() to share the same set of signal
> > > 	invocations.  That way, if the caller has partitioned the
> > > 	data structure, global locking might be avoided completely
> > > 	(or at least greatly restricted in scope).
> > > 
> > > 	Of course, if updates are rare, the optimization would not
> > > 	help, but in that case, acquiring two locks would be even less
> > > 	of a problem.
> > > 
> > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > 	used to index urcu_active_readers[], you must be relying on
> > > 	initialization to zero.
> > > 
> > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > 	relying on the compiler generating an x86 add-to-memory
> > > 	instruction?
> > > 
> > > 	Ditto for rcu_read_unlock().
> > > 
> > > 	Ah, never mind!!!  I now see the __thread specification,
> > > 	and the keeping of references to it in the reader_data list.
> > > 
> > > o	Combining the equivalent of rcu_assign_pointer() and
> > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > 	guess trying it out on several applications would be the way
> > > 	to find out.  ;-)
> > > 
> > > 	That said, I suspect that it would be very convenient in a
> > > 	number of situations.
> > > 
> > > o	It would be good to avoid having to pass the return value
> > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > 	possible to avoid this via counter value tricks, though this
> > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > 	(64-bit machines don't have to worry about counter overflow.)
> > > 
> > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > 	in the aforementioned git archive for a way to do this.
> > > 	(And perhaps I should apply this change to SRCU...)
> > > 
> > > o	Your test looks a bit strange, not sure why you test all the
> > > 	different variables.  It would be nice to take a test duration
> > > 	as an argument and run the test for that time.
> > > 
> > > 	I killed the test after better part of an hour on my laptop,
> > > 	will retry on a larger machine (after noting the 18 threads
> > > 	created!).  (And yes, I first tried Power, which objected
> > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > 	so getting an x86 machine to try on.)
> > > 
> > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > I have written.  ;-)
> > 
> > OK, here is one sequence of concern...
> > 
> 
> Let's see..
> 
> > o	Thread 0 starts rcu_read_lock(), picking up the current
> > 	get_urcu_qparity() into the local variable urcu_parity().
> > 	Assume that the value returned is zero.
> > 
> > o	Thread 0 is now preempted.
> > 
> > o	Thread 1 invokes urcu_publish_content():
> > 
> > 	o	It substitutes the pointer.
> > 
> > 	o	It forces all threads to execute a memory barrier
> > 		(thread 0 runs just long enough to process its signal
> > 		and then is immediately preempted again).
> > 
> > 	o	It switches the parity, which is now one.
> > 
> > 	o	It waits for all readers on parity zero, and there are
> > 		none, because thread 0 has not yet registered itself.
> > 
> > 	o	It therefore returns the old pointer.  So far, so good.
> > 
> > o	Thread 0 now resumes:
> > 
> > 	o	It increments its urcu_active_readers[0].
> > 
> > 	o	It forces a compiler barrier.
> > 
> > 	o	It returns zero (why not store this in thread-local
> > 		storage rather than returning?).
> > 
> 
> To support nested rcu_read_locks. (that's the only reason)

A patch below to allow nested rcu_read_lock() while keeping to the Linux
kernel API, just FYI.  One can argue that the overhead of accessing the
extra per-thread variables is offset by the fact that there no longer
needs to be a return value from rcu_read_lock() nor an argument to
rcu_read_unlock(), but hard to say.

> > 	o	It enters its critical section, obtaining a reference
> > 		to the new pointer that thread 1 just published.
> > 
> > o	Thread 1 now again invokes urcu_publish_content():
> >  
> > 	o	It substitutes the pointer.
> > 
> > 	o	It forces all threads to execute a memory barrier,
> > 		including thread 0.
> > 
> > 	o	It switches the parity, which is now zero.
> > 
> > 	o	It waits for all readers on parity one, and there are
> > 		none, because thread 0 has registered itself on parity
> > 		zero!!!
> > 
> > 	o	Thread 1 therefore returns the old pointer.
> > 
> > 	o	Thread 1 frees the old pointer, which thread 0 is still
> > 		using!!!
> > 
> 
> Ah, yes, you are right.
> 
> > So, how to fix?  Here are some approaches:
> > 
> > o	Make urcu_publish_content() do two parity flips rather than one.
> > 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> > 	algorithms in CodeSamples/defer.
> 
> This approach seems very interesting.

Patch in earlier email.  ;-)

> > o	Use a single free-running counter, in a manner similar to rcu_nest,
> > 	as suggested earlier.  This one is interesting, as I rely on a
> > 	read-side memory barrier to handle the long-preemption case.
> > 	However, if you believe that any thread that waits several minutes
> > 	between executing adjacent instructions must have been preempted
> > 	(which the memory barriers that are required to do a context
> > 	switch), then a compiler barrier suffices.  ;-)
> 
> Hrm, I'm trying to figure out what kind of memory backend you need to
> put your counters for each quiescent state period. Is this free-running
> counter indexing a very large array ? I doubt it does. Then how does it
> make sure we don't roll back to the old array entries ?

There is no array, just a global counter that is incremented by a modest
power of two for each grace period.  Then the outermost rcu_read_lock()
records the one greater than current value of the global counter in its
per-thread variable.

Now, rcu_read_lock() can tell that it is outermost by examining the
low-order bits of its per-thread variable -- if these bits are zero,
then this is the outermost rcu_read_lock().  So if rcu_read_lock() sees
that it is nested, it simply increments its per-thread counter.

Then rcu_read_unlock() simply decrements its per-thread variable.

If the counter is only 32 bits, it is subject to overflow.  In that case,
it is necessary to check for the counter having been incremented a huge
number of times between the time the outermost rcu_read_lock() fetched
the counter value and the time that it stored into its per-thread
variable.

An admittedly crude implementation of this approach may be found in
CodeSamples/defer/rcu_nest.[hc] in:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

Of course, if the counter is 64 bits, overflow can safely be ignored.
If you have a grace period every microsecond and allow RCU read-side
critical sections to be nested 255 deep, it would take more than 2,000
years to overflow.  ;-)

> This latter solution could break jump-based probing of programs
> soon-to-be available in gcc. The probes are meant to be of short
> duration, but the fact is that this design lets the debugger inject code
> without resorting to a breakpoint, which might therefore break your
> "short time between instructions" assumption. It's very unlikely, but
> possible.

But would the debugger's code injection take more than a minute without
doing a context switch?  Ah -- you are thinking of a probe that spins
for several minutes.  Yes, this would be strange, but not impossible.

OK, so for this usage, solution 1 it is!

> > Of course, the probability of seeing this failure during test is quite
> > low, since it is unlikely that thread 0 would run just long enough to
> > execute its signal handler.  However, it could happen.  And if you were
> > to adapt this algorithm for use in a real-time application, then priority
> > boosting could cause this to happen naturally.
> 
> Yes. It's not because we are not able to create the faulty condition
> that it will _never_ happen. It must therefore be taken care of.

No argument here!!!  ;-)  See the earlier patch for one way to fix.

The following patch makes rcu_read_lock() back into a void function
while still permitting nesting, for whatever it is worth.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 test_urcu.c |    6 +++---
 urcu.c      |    2 ++
 urcu.h      |   40 ++++++++++++++++++++++++----------------
 3 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/test_urcu.c b/test_urcu.c
index db0b68c..16b212b 100644
--- a/test_urcu.c
+++ b/test_urcu.c
@@ -33,7 +33,7 @@ static struct test_array *test_rcu_pointer;
 
 void *thr_reader(void *arg)
 {
-	int qparity, i, j;
+	int i, j;
 	struct test_array *local_ptr;
 
 	printf("thread %s, thread id : %lu, pid %lu\n",
@@ -44,14 +44,14 @@ void *thr_reader(void *arg)
 
 	for (i = 0; i < 100000; i++) {
 		for (j = 0; j < 100000000; j++) {
-			qparity = rcu_read_lock();
+			rcu_read_lock();
 			local_ptr = rcu_dereference(test_rcu_pointer);
 			if (local_ptr) {
 				assert(local_ptr->a == 8);
 				assert(local_ptr->b == 12);
 				assert(local_ptr->c[55] == 2);
 			}
-			rcu_read_unlock(qparity);
+			rcu_read_unlock();
 		}
 	}
 
diff --git a/urcu.c b/urcu.c
index 1a276ce..95eea4e 100644
--- a/urcu.c
+++ b/urcu.c
@@ -23,6 +23,8 @@ pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
 int urcu_qparity;
 
 int __thread urcu_active_readers[2];
+int __thread urcu_reader_nesting;
+int __thread urcu_reader_parity;
 
 /* Thread IDs of registered readers */
 #define INIT_NUM_THREADS 4
diff --git a/urcu.h b/urcu.h
index 9431da5..6d28ea2 100644
--- a/urcu.h
+++ b/urcu.h
@@ -70,6 +70,8 @@ static inline void atomic_inc(int *v)
 extern int urcu_qparity;
 
 extern int __thread urcu_active_readers[2];
+extern int __thread urcu_reader_nesting;
+extern int __thread urcu_reader_parity;
 
 static inline int get_urcu_qparity(void)
 {
@@ -79,26 +81,32 @@ static inline int get_urcu_qparity(void)
 /*
  * returns urcu_parity.
  */
-static inline int rcu_read_lock(void)
+static inline void rcu_read_lock(void)
 {
-	int urcu_parity = get_urcu_qparity();
-	urcu_active_readers[urcu_parity]++;
-	/*
-	 * Increment active readers count before accessing the pointer.
-	 * See force_mb_all_threads().
-	 */
-	barrier();
-	return urcu_parity;
+	int urcu_parity;
+
+	if (urcu_reader_nesting++ == 0) {
+		urcu_parity = get_urcu_qparity();
+		urcu_active_readers[urcu_parity]++;
+		urcu_reader_parity = urcu_parity;
+		/*
+		 * Increment active readers count before accessing the pointer.
+		 * See force_mb_all_threads().
+		 */
+		barrier();
+	}
 }
 
-static inline void rcu_read_unlock(int urcu_parity)
+static inline void rcu_read_unlock(void)
 {
-	barrier();
-	/*
-	 * Finish using rcu before decrementing the pointer.
-	 * See force_mb_all_threads().
-	 */
-	urcu_active_readers[urcu_parity]--;
+	if (--urcu_reader_nesting == 0) {
+		barrier();
+		/*
+		 * Finish using rcu before decrementing the pointer.
+		 * See force_mb_all_threads().
+		 */
+		urcu_active_readers[urcu_reader_parity]--;
+	}
 }
 
 extern void rcu_write_lock(void);

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-08  0:44           ` Paul E. McKenney
@ 2009-02-08 21:46             ` Mathieu Desnoyers
  2009-02-08 22:36               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-08 21:46 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Feb 07, 2009 at 06:38:27PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > 
> > > > > > Hi Paul,
> > > > > > 
> > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > RCU implementation.
> > > > > > 
> > > > > > It works so far, but I have not gone through any formal verification
> > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > must be adapted in urcu.h to port it.
> > > > > > 
> > > > > > Here is the link to my git tree :
> > > > > > 
> > > > > > git://lttng.org/userspace-rcu.git
> > > > > > 
> > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > 
> > > > > Very cool!!!  I will take a look!
> > > > > 
> > > > > I will also point you at a few that I have put together:
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > 
> > > > > (In the CodeSamples/defer directory.)
> > > > 
> > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > 
> > > > o	One possible optimization would be to avoid sending a signal to
> > > > 	a blocked thread, as the context switch leading to blocking
> > > > 	will have implied a memory barrier -- otherwise it would not
> > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > 	not sure whether checking to see whether a thread is blocked is
> > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > 
> > > > 	Of course, this approach does require that the enclosing
> > > > 	application be willing to give up a signal.  I suspect that most
> > > > 	applications would be OK with this, though some might not.
> > > > 
> > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > 
> > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > 
> > > > 	But I think that the time is now right.  ;-)
> > > > 
> > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > 	to decide that a single global lock must protect RCU updates,
> > > > 	which is of course absolutely not the case.  I strongly
> > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > 	their data structure), but let's face it, if you are sending a
> > > > 	signal to each and every thread, the additional overhead of the
> > > > 	extra lock is the least of your worries.
> > > > 
> > > > 	If you really want to heavily optimize this, I would suggest
> > > > 	setting up a state machine that permits multiple concurrent
> > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > 	invocations.  That way, if the caller has partitioned the
> > > > 	data structure, global locking might be avoided completely
> > > > 	(or at least greatly restricted in scope).
> > > > 
> > > > 	Of course, if updates are rare, the optimization would not
> > > > 	help, but in that case, acquiring two locks would be even less
> > > > 	of a problem.
> > > > 
> > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > 	used to index urcu_active_readers[], you must be relying on
> > > > 	initialization to zero.
> > > > 
> > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > 	relying on the compiler generating an x86 add-to-memory
> > > > 	instruction?
> > > > 
> > > > 	Ditto for rcu_read_unlock().
> > > > 
> > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > 	and the keeping of references to it in the reader_data list.
> > > > 
> > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > 	guess trying it out on several applications would be the way
> > > > 	to find out.  ;-)
> > > > 
> > > > 	That said, I suspect that it would be very convenient in a
> > > > 	number of situations.
> > > > 
> > > > o	It would be good to avoid having to pass the return value
> > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > 	possible to avoid this via counter value tricks, though this
> > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > 
> > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > 	in the aforementioned git archive for a way to do this.
> > > > 	(And perhaps I should apply this change to SRCU...)
> > > > 
> > > > o	Your test looks a bit strange, not sure why you test all the
> > > > 	different variables.  It would be nice to take a test duration
> > > > 	as an argument and run the test for that time.
> > > > 
> > > > 	I killed the test after better part of an hour on my laptop,
> > > > 	will retry on a larger machine (after noting the 18 threads
> > > > 	created!).  (And yes, I first tried Power, which objected
> > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > 	so getting an x86 machine to try on.)
> > > > 
> > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > I have written.  ;-)
> > > 
> > > OK, here is one sequence of concern...
> > > 
> > 
> > Let's see..
> > 
> > > o	Thread 0 starts rcu_read_lock(), picking up the current
> > > 	get_urcu_qparity() into the local variable urcu_parity().
> > > 	Assume that the value returned is zero.
> > > 
> > > o	Thread 0 is now preempted.
> > > 
> > > o	Thread 1 invokes urcu_publish_content():
> > > 
> > > 	o	It substitutes the pointer.
> > > 
> > > 	o	It forces all threads to execute a memory barrier
> > > 		(thread 0 runs just long enough to process its signal
> > > 		and then is immediately preempted again).
> > > 
> > > 	o	It switches the parity, which is now one.
> > > 
> > > 	o	It waits for all readers on parity zero, and there are
> > > 		none, because thread 0 has not yet registered itself.
> > > 
> > > 	o	It therefore returns the old pointer.  So far, so good.
> > > 
> > > o	Thread 0 now resumes:
> > > 
> > > 	o	It increments its urcu_active_readers[0].
> > > 
> > > 	o	It forces a compiler barrier.
> > > 
> > > 	o	It returns zero (why not store this in thread-local
> > > 		storage rather than returning?).
> > > 
> > 
> > To support nested rcu_read_locks. (that's the only reason)
> 
> A patch below to allow nested rcu_read_lock() while keeping to the Linux
> kernel API, just FYI.  One can argue that the overhead of accessing the
> extra per-thread variables is offset by the fact that there no longer
> needs to be a return value from rcu_read_lock() nor an argument to
> rcu_read_unlock(), but hard to say.
> 

I ran your modified version within my benchmarks :

with return value : 14.164 cycles per read
without return value : 16.4017 cycles per read

So we have a 14% performance decrease due to this. We also pollute the
branch prediction buffer and we add a cache access due to the added
variables in the TLS. Returning the value has the clear advantage of
letting the compiler keep it around in registers or on the stack, which
clearly costs less.

So I think the speed factor outweights the visual considerations. Maybe
we could switch to something like :

unsigned int qparity;

urcu_read_lock(&qparity);
...
urcu_read_unlock(&qparity);

That would be a bit like local_irq_save() in the kernel, except that we
could do it in a static inline because we pass the address. I
personnally dislike the local_irq_save() way of hiding the fact that it
writes to the variable in a "clever" macro. I'd really prefer to leave
the " & ".

What is your opinion ?

> > > 	o	It enters its critical section, obtaining a reference
> > > 		to the new pointer that thread 1 just published.
> > > 
> > > o	Thread 1 now again invokes urcu_publish_content():
> > >  
> > > 	o	It substitutes the pointer.
> > > 
> > > 	o	It forces all threads to execute a memory barrier,
> > > 		including thread 0.
> > > 
> > > 	o	It switches the parity, which is now zero.
> > > 
> > > 	o	It waits for all readers on parity one, and there are
> > > 		none, because thread 0 has registered itself on parity
> > > 		zero!!!
> > > 
> > > 	o	Thread 1 therefore returns the old pointer.
> > > 
> > > 	o	Thread 1 frees the old pointer, which thread 0 is still
> > > 		using!!!
> > > 
> > 
> > Ah, yes, you are right.
> > 
> > > So, how to fix?  Here are some approaches:
> > > 
> > > o	Make urcu_publish_content() do two parity flips rather than one.
> > > 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> > > 	algorithms in CodeSamples/defer.
> > 
> > This approach seems very interesting.
> 
> Patch in earlier email.  ;-)
> 
> > > o	Use a single free-running counter, in a manner similar to rcu_nest,
> > > 	as suggested earlier.  This one is interesting, as I rely on a
> > > 	read-side memory barrier to handle the long-preemption case.
> > > 	However, if you believe that any thread that waits several minutes
> > > 	between executing adjacent instructions must have been preempted
> > > 	(which the memory barriers that are required to do a context
> > > 	switch), then a compiler barrier suffices.  ;-)
> > 
> > Hrm, I'm trying to figure out what kind of memory backend you need to
> > put your counters for each quiescent state period. Is this free-running
> > counter indexing a very large array ? I doubt it does. Then how does it
> > make sure we don't roll back to the old array entries ?
> 
> There is no array, just a global counter that is incremented by a modest
> power of two for each grace period.  Then the outermost rcu_read_lock()
> records the one greater than current value of the global counter in its
> per-thread variable.
> 
> Now, rcu_read_lock() can tell that it is outermost by examining the
> low-order bits of its per-thread variable -- if these bits are zero,
> then this is the outermost rcu_read_lock().  So if rcu_read_lock() sees
> that it is nested, it simply increments its per-thread counter.
> 
> Then rcu_read_unlock() simply decrements its per-thread variable.
> 
> If the counter is only 32 bits, it is subject to overflow.  In that case,
> it is necessary to check for the counter having been incremented a huge
> number of times between the time the outermost rcu_read_lock() fetched
> the counter value and the time that it stored into its per-thread
> variable.
> 
> An admittedly crude implementation of this approach may be found in
> CodeSamples/defer/rcu_nest.[hc] in:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 
> Of course, if the counter is 64 bits, overflow can safely be ignored.
> If you have a grace period every microsecond and allow RCU read-side
> critical sections to be nested 255 deep, it would take more than 2,000
> years to overflow.  ;-)
> 

Looking at the code, my first thought is : if we find out that the
array-based solution and the counter-based solution have the same
performance, I would definitely prefer the array-based version because
there are far less overflow considerations. It's therefore more solid
algorithmically and can be proven formally.

Also, I'm not sure I fully understand where your overflow test is going.
So let's pretend we are a reader, nested inside other rcu read locks,
and we arrive much later after the outermost reader has read the
rcu_gp_ctr. After 255 increments actually :

static void rcu_read_lock(void)
{
        long tmp;
        long *rrgp;

        /*
         * If this is the outermost RCU read-side critical section,
         * copy the global grace-period counter.  In either case,
         * increment the nesting count held in the low-order bits.
         */

        rrgp = &__get_thread_var(rcu_reader_gp);
retry:
        tmp = *rrgp;
# we read the local rrgp
        if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
                tmp = rcu_gp_ctr;
# not executed, innermost and nested.
        tmp++;
        *rrgp = tmp;
# increment the local count and write it to the local rrgp
        smp_mb();
        if (((tmp & RCU_GP_CTR_NEST_MASK) == 1) &&
            ((rcu_gp_ctr - tmp) > (RCU_GP_CTR_NEST_MASK << 8)) != 0) {
                (*rrgp)--;
                goto retry;
# If we are more than 255 increments away from rcu_gp_ctr, decrement
# rrgp and loop
        }
}

The problem is : rcu_gp_ctr is advancing. So if we have tmp stucked at a
given value, and we are nested over the outermost read lock (therefore
we are making it impossible to go end its execution), then when the
rcu_gp_crt will advance (which is the only way things can eventually go
forward, because the local rrgp is set back to its original value), we
are just going to be _farther_ away from it (not closer). So we'll have
to wait for a complete type overflow (will take a while on 32-bits, and
a very long while on 64-bits) to have the test returning false and then
going forward.

Or there might be something I misunderstood ?

> > This latter solution could break jump-based probing of programs
> > soon-to-be available in gcc. The probes are meant to be of short
> > duration, but the fact is that this design lets the debugger inject code
> > without resorting to a breakpoint, which might therefore break your
> > "short time between instructions" assumption. It's very unlikely, but
> > possible.
> 
> But would the debugger's code injection take more than a minute without
> doing a context switch?  Ah -- you are thinking of a probe that spins
> for several minutes.  Yes, this would be strange, but not impossible.
> 
> OK, so for this usage, solution 1 it is!
> 

Yes, it's unlikely, but possible.. and I like to design things assuming
the worse case scenario, even if it's almost impossible.

Mathieu

> > > Of course, the probability of seeing this failure during test is quite
> > > low, since it is unlikely that thread 0 would run just long enough to
> > > execute its signal handler.  However, it could happen.  And if you were
> > > to adapt this algorithm for use in a real-time application, then priority
> > > boosting could cause this to happen naturally.
> > 
> > Yes. It's not because we are not able to create the faulty condition
> > that it will _never_ happen. It must therefore be taken care of.
> 
> No argument here!!!  ;-)  See the earlier patch for one way to fix.
> 
> The following patch makes rcu_read_lock() back into a void function
> while still permitting nesting, for whatever it is worth.
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  test_urcu.c |    6 +++---
>  urcu.c      |    2 ++
>  urcu.h      |   40 ++++++++++++++++++++++++----------------
>  3 files changed, 29 insertions(+), 19 deletions(-)
> 
> diff --git a/test_urcu.c b/test_urcu.c
> index db0b68c..16b212b 100644
> --- a/test_urcu.c
> +++ b/test_urcu.c
> @@ -33,7 +33,7 @@ static struct test_array *test_rcu_pointer;
>  
>  void *thr_reader(void *arg)
>  {
> -	int qparity, i, j;
> +	int i, j;
>  	struct test_array *local_ptr;
>  
>  	printf("thread %s, thread id : %lu, pid %lu\n",
> @@ -44,14 +44,14 @@ void *thr_reader(void *arg)
>  
>  	for (i = 0; i < 100000; i++) {
>  		for (j = 0; j < 100000000; j++) {
> -			qparity = rcu_read_lock();
> +			rcu_read_lock();
>  			local_ptr = rcu_dereference(test_rcu_pointer);
>  			if (local_ptr) {
>  				assert(local_ptr->a == 8);
>  				assert(local_ptr->b == 12);
>  				assert(local_ptr->c[55] == 2);
>  			}
> -			rcu_read_unlock(qparity);
> +			rcu_read_unlock();
>  		}
>  	}
>  
> diff --git a/urcu.c b/urcu.c
> index 1a276ce..95eea4e 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -23,6 +23,8 @@ pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
>  int urcu_qparity;
>  
>  int __thread urcu_active_readers[2];
> +int __thread urcu_reader_nesting;
> +int __thread urcu_reader_parity;
>  
>  /* Thread IDs of registered readers */
>  #define INIT_NUM_THREADS 4
> diff --git a/urcu.h b/urcu.h
> index 9431da5..6d28ea2 100644
> --- a/urcu.h
> +++ b/urcu.h
> @@ -70,6 +70,8 @@ static inline void atomic_inc(int *v)
>  extern int urcu_qparity;
>  
>  extern int __thread urcu_active_readers[2];
> +extern int __thread urcu_reader_nesting;
> +extern int __thread urcu_reader_parity;
>  
>  static inline int get_urcu_qparity(void)
>  {
> @@ -79,26 +81,32 @@ static inline int get_urcu_qparity(void)
>  /*
>   * returns urcu_parity.
>   */
> -static inline int rcu_read_lock(void)
> +static inline void rcu_read_lock(void)
>  {
> -	int urcu_parity = get_urcu_qparity();
> -	urcu_active_readers[urcu_parity]++;
> -	/*
> -	 * Increment active readers count before accessing the pointer.
> -	 * See force_mb_all_threads().
> -	 */
> -	barrier();
> -	return urcu_parity;
> +	int urcu_parity;
> +
> +	if (urcu_reader_nesting++ == 0) {
> +		urcu_parity = get_urcu_qparity();
> +		urcu_active_readers[urcu_parity]++;
> +		urcu_reader_parity = urcu_parity;
> +		/*
> +		 * Increment active readers count before accessing the pointer.
> +		 * See force_mb_all_threads().
> +		 */
> +		barrier();
> +	}
>  }
>  
> -static inline void rcu_read_unlock(int urcu_parity)
> +static inline void rcu_read_unlock(void)
>  {
> -	barrier();
> -	/*
> -	 * Finish using rcu before decrementing the pointer.
> -	 * See force_mb_all_threads().
> -	 */
> -	urcu_active_readers[urcu_parity]--;
> +	if (--urcu_reader_nesting == 0) {
> +		barrier();
> +		/*
> +		 * Finish using rcu before decrementing the pointer.
> +		 * See force_mb_all_threads().
> +		 */
> +		urcu_active_readers[urcu_reader_parity]--;
> +	}
>  }
>  
>  extern void rcu_write_lock(void);
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-08 21:46             ` Mathieu Desnoyers
@ 2009-02-08 22:36               ` Paul E. McKenney
  2009-02-09  0:24                 ` Paul E. McKenney
  2009-02-09  0:40                 ` [ltt-dev] " Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-08 22:36 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sat, Feb 07, 2009 at 06:38:27PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> > > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > > 
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > > RCU implementation.
> > > > > > > 
> > > > > > > It works so far, but I have not gone through any formal verification
> > > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > > must be adapted in urcu.h to port it.
> > > > > > > 
> > > > > > > Here is the link to my git tree :
> > > > > > > 
> > > > > > > git://lttng.org/userspace-rcu.git
> > > > > > > 
> > > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > > 
> > > > > > Very cool!!!  I will take a look!
> > > > > > 
> > > > > > I will also point you at a few that I have put together:
> > > > > > 
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > 
> > > > > > (In the CodeSamples/defer directory.)
> > > > > 
> > > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > > 
> > > > > o	One possible optimization would be to avoid sending a signal to
> > > > > 	a blocked thread, as the context switch leading to blocking
> > > > > 	will have implied a memory barrier -- otherwise it would not
> > > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > > 	not sure whether checking to see whether a thread is blocked is
> > > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > > 
> > > > > 	Of course, this approach does require that the enclosing
> > > > > 	application be willing to give up a signal.  I suspect that most
> > > > > 	applications would be OK with this, though some might not.
> > > > > 
> > > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > > 
> > > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > > 
> > > > > 	But I think that the time is now right.  ;-)
> > > > > 
> > > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > > 	to decide that a single global lock must protect RCU updates,
> > > > > 	which is of course absolutely not the case.  I strongly
> > > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > > 	their data structure), but let's face it, if you are sending a
> > > > > 	signal to each and every thread, the additional overhead of the
> > > > > 	extra lock is the least of your worries.
> > > > > 
> > > > > 	If you really want to heavily optimize this, I would suggest
> > > > > 	setting up a state machine that permits multiple concurrent
> > > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > > 	invocations.  That way, if the caller has partitioned the
> > > > > 	data structure, global locking might be avoided completely
> > > > > 	(or at least greatly restricted in scope).
> > > > > 
> > > > > 	Of course, if updates are rare, the optimization would not
> > > > > 	help, but in that case, acquiring two locks would be even less
> > > > > 	of a problem.
> > > > > 
> > > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > > 	used to index urcu_active_readers[], you must be relying on
> > > > > 	initialization to zero.
> > > > > 
> > > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > > 	relying on the compiler generating an x86 add-to-memory
> > > > > 	instruction?
> > > > > 
> > > > > 	Ditto for rcu_read_unlock().
> > > > > 
> > > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > > 	and the keeping of references to it in the reader_data list.
> > > > > 
> > > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > > 	guess trying it out on several applications would be the way
> > > > > 	to find out.  ;-)
> > > > > 
> > > > > 	That said, I suspect that it would be very convenient in a
> > > > > 	number of situations.
> > > > > 
> > > > > o	It would be good to avoid having to pass the return value
> > > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > > 	possible to avoid this via counter value tricks, though this
> > > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > > 
> > > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > > 	in the aforementioned git archive for a way to do this.
> > > > > 	(And perhaps I should apply this change to SRCU...)
> > > > > 
> > > > > o	Your test looks a bit strange, not sure why you test all the
> > > > > 	different variables.  It would be nice to take a test duration
> > > > > 	as an argument and run the test for that time.
> > > > > 
> > > > > 	I killed the test after better part of an hour on my laptop,
> > > > > 	will retry on a larger machine (after noting the 18 threads
> > > > > 	created!).  (And yes, I first tried Power, which objected
> > > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > > 	so getting an x86 machine to try on.)
> > > > > 
> > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > I have written.  ;-)
> > > > 
> > > > OK, here is one sequence of concern...
> > > > 
> > > 
> > > Let's see..
> > > 
> > > > o	Thread 0 starts rcu_read_lock(), picking up the current
> > > > 	get_urcu_qparity() into the local variable urcu_parity().
> > > > 	Assume that the value returned is zero.
> > > > 
> > > > o	Thread 0 is now preempted.
> > > > 
> > > > o	Thread 1 invokes urcu_publish_content():
> > > > 
> > > > 	o	It substitutes the pointer.
> > > > 
> > > > 	o	It forces all threads to execute a memory barrier
> > > > 		(thread 0 runs just long enough to process its signal
> > > > 		and then is immediately preempted again).
> > > > 
> > > > 	o	It switches the parity, which is now one.
> > > > 
> > > > 	o	It waits for all readers on parity zero, and there are
> > > > 		none, because thread 0 has not yet registered itself.
> > > > 
> > > > 	o	It therefore returns the old pointer.  So far, so good.
> > > > 
> > > > o	Thread 0 now resumes:
> > > > 
> > > > 	o	It increments its urcu_active_readers[0].
> > > > 
> > > > 	o	It forces a compiler barrier.
> > > > 
> > > > 	o	It returns zero (why not store this in thread-local
> > > > 		storage rather than returning?).
> > > > 
> > > 
> > > To support nested rcu_read_locks. (that's the only reason)
> > 
> > A patch below to allow nested rcu_read_lock() while keeping to the Linux
> > kernel API, just FYI.  One can argue that the overhead of accessing the
> > extra per-thread variables is offset by the fact that there no longer
> > needs to be a return value from rcu_read_lock() nor an argument to
> > rcu_read_unlock(), but hard to say.
> > 
> 
> I ran your modified version within my benchmarks :
> 
> with return value : 14.164 cycles per read
> without return value : 16.4017 cycles per read
> 
> So we have a 14% performance decrease due to this. We also pollute the
> branch prediction buffer and we add a cache access due to the added
> variables in the TLS. Returning the value has the clear advantage of
> letting the compiler keep it around in registers or on the stack, which
> clearly costs less.
> 
> So I think the speed factor outweights the visual considerations. Maybe
> we could switch to something like :
> 
> unsigned int qparity;
> 
> urcu_read_lock(&qparity);
> ...
> urcu_read_unlock(&qparity);
> 
> That would be a bit like local_irq_save() in the kernel, except that we
> could do it in a static inline because we pass the address. I
> personnally dislike the local_irq_save() way of hiding the fact that it
> writes to the variable in a "clever" macro. I'd really prefer to leave
> the " & ".
> 
> What is your opinion ?

My current opinion is that I can avoid the overflow problem and the
need to recheck, which might get rid of the need for both arguments
and return values while still maintaining good performance.  The trick
is to use only the topmost bit for the grace-period counter, and all
the rest of the bits for nesting.  That way, no matter what value of
global counter one picks up, it will be waited for (since there are but
two values that the global counter takes on).

But just now coding it, so will see if it actually works.

> > > > 	o	It enters its critical section, obtaining a reference
> > > > 		to the new pointer that thread 1 just published.
> > > > 
> > > > o	Thread 1 now again invokes urcu_publish_content():
> > > >  
> > > > 	o	It substitutes the pointer.
> > > > 
> > > > 	o	It forces all threads to execute a memory barrier,
> > > > 		including thread 0.
> > > > 
> > > > 	o	It switches the parity, which is now zero.
> > > > 
> > > > 	o	It waits for all readers on parity one, and there are
> > > > 		none, because thread 0 has registered itself on parity
> > > > 		zero!!!
> > > > 
> > > > 	o	Thread 1 therefore returns the old pointer.
> > > > 
> > > > 	o	Thread 1 frees the old pointer, which thread 0 is still
> > > > 		using!!!
> > > > 
> > > 
> > > Ah, yes, you are right.
> > > 
> > > > So, how to fix?  Here are some approaches:
> > > > 
> > > > o	Make urcu_publish_content() do two parity flips rather than one.
> > > > 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> > > > 	algorithms in CodeSamples/defer.
> > > 
> > > This approach seems very interesting.
> > 
> > Patch in earlier email.  ;-)
> > 
> > > > o	Use a single free-running counter, in a manner similar to rcu_nest,
> > > > 	as suggested earlier.  This one is interesting, as I rely on a
> > > > 	read-side memory barrier to handle the long-preemption case.
> > > > 	However, if you believe that any thread that waits several minutes
> > > > 	between executing adjacent instructions must have been preempted
> > > > 	(which the memory barriers that are required to do a context
> > > > 	switch), then a compiler barrier suffices.  ;-)
> > > 
> > > Hrm, I'm trying to figure out what kind of memory backend you need to
> > > put your counters for each quiescent state period. Is this free-running
> > > counter indexing a very large array ? I doubt it does. Then how does it
> > > make sure we don't roll back to the old array entries ?
> > 
> > There is no array, just a global counter that is incremented by a modest
> > power of two for each grace period.  Then the outermost rcu_read_lock()
> > records the one greater than current value of the global counter in its
> > per-thread variable.
> > 
> > Now, rcu_read_lock() can tell that it is outermost by examining the
> > low-order bits of its per-thread variable -- if these bits are zero,
> > then this is the outermost rcu_read_lock().  So if rcu_read_lock() sees
> > that it is nested, it simply increments its per-thread counter.
> > 
> > Then rcu_read_unlock() simply decrements its per-thread variable.
> > 
> > If the counter is only 32 bits, it is subject to overflow.  In that case,
> > it is necessary to check for the counter having been incremented a huge
> > number of times between the time the outermost rcu_read_lock() fetched
> > the counter value and the time that it stored into its per-thread
> > variable.
> > 
> > An admittedly crude implementation of this approach may be found in
> > CodeSamples/defer/rcu_nest.[hc] in:
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > Of course, if the counter is 64 bits, overflow can safely be ignored.
> > If you have a grace period every microsecond and allow RCU read-side
> > critical sections to be nested 255 deep, it would take more than 2,000
> > years to overflow.  ;-)
> > 
> 
> Looking at the code, my first thought is : if we find out that the
> array-based solution and the counter-based solution have the same
> performance, I would definitely prefer the array-based version because
> there are far less overflow considerations. It's therefore more solid
> algorithmically and can be proven formally.
> 
> Also, I'm not sure I fully understand where your overflow test is going.
> So let's pretend we are a reader, nested inside other rcu read locks,
> and we arrive much later after the outermost reader has read the
> rcu_gp_ctr. After 255 increments actually :
> 
> static void rcu_read_lock(void)
> {
>         long tmp;
>         long *rrgp;
> 
>         /*
>          * If this is the outermost RCU read-side critical section,
>          * copy the global grace-period counter.  In either case,
>          * increment the nesting count held in the low-order bits.
>          */
> 
>         rrgp = &__get_thread_var(rcu_reader_gp);
> retry:
>         tmp = *rrgp;
> # we read the local rrgp
>         if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
>                 tmp = rcu_gp_ctr;
> # not executed, innermost and nested.
>         tmp++;
>         *rrgp = tmp;
> # increment the local count and write it to the local rrgp
>         smp_mb();
>         if (((tmp & RCU_GP_CTR_NEST_MASK) == 1) &&
>             ((rcu_gp_ctr - tmp) > (RCU_GP_CTR_NEST_MASK << 8)) != 0) {
>                 (*rrgp)--;
>                 goto retry;
> # If we are more than 255 increments away from rcu_gp_ctr, decrement
> # rrgp and loop
>         }
> }
> 
> The problem is : rcu_gp_ctr is advancing. So if we have tmp stucked at a
> given value, and we are nested over the outermost read lock (therefore
> we are making it impossible to go end its execution), then when the
> rcu_gp_crt will advance (which is the only way things can eventually go
> forward, because the local rrgp is set back to its original value), we
> are just going to be _farther_ away from it (not closer). So we'll have
> to wait for a complete type overflow (will take a while on 32-bits, and
> a very long while on 64-bits) to have the test returning false and then
> going forward.
> 
> Or there might be something I misunderstood ?

The first clause of the "if" statement should prevent this -- if we are
not the outermost rcu_read_lock(), then we never retry.  (If I understand
your scenario.)

> > > This latter solution could break jump-based probing of programs
> > > soon-to-be available in gcc. The probes are meant to be of short
> > > duration, but the fact is that this design lets the debugger inject code
> > > without resorting to a breakpoint, which might therefore break your
> > > "short time between instructions" assumption. It's very unlikely, but
> > > possible.
> > 
> > But would the debugger's code injection take more than a minute without
> > doing a context switch?  Ah -- you are thinking of a probe that spins
> > for several minutes.  Yes, this would be strange, but not impossible.
> > 
> > OK, so for this usage, solution 1 it is!
> 
> Yes, it's unlikely, but possible.. and I like to design things assuming
> the worse case scenario, even if it's almost impossible.

That is indeed the only way to get even semi-reliable software!

							Thanx, Paul

> Mathieu
> 
> > > > Of course, the probability of seeing this failure during test is quite
> > > > low, since it is unlikely that thread 0 would run just long enough to
> > > > execute its signal handler.  However, it could happen.  And if you were
> > > > to adapt this algorithm for use in a real-time application, then priority
> > > > boosting could cause this to happen naturally.
> > > 
> > > Yes. It's not because we are not able to create the faulty condition
> > > that it will _never_ happen. It must therefore be taken care of.
> > 
> > No argument here!!!  ;-)  See the earlier patch for one way to fix.
> > 
> > The following patch makes rcu_read_lock() back into a void function
> > while still permitting nesting, for whatever it is worth.
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> > 
> >  test_urcu.c |    6 +++---
> >  urcu.c      |    2 ++
> >  urcu.h      |   40 ++++++++++++++++++++++++----------------
> >  3 files changed, 29 insertions(+), 19 deletions(-)
> > 
> > diff --git a/test_urcu.c b/test_urcu.c
> > index db0b68c..16b212b 100644
> > --- a/test_urcu.c
> > +++ b/test_urcu.c
> > @@ -33,7 +33,7 @@ static struct test_array *test_rcu_pointer;
> >  
> >  void *thr_reader(void *arg)
> >  {
> > -	int qparity, i, j;
> > +	int i, j;
> >  	struct test_array *local_ptr;
> >  
> >  	printf("thread %s, thread id : %lu, pid %lu\n",
> > @@ -44,14 +44,14 @@ void *thr_reader(void *arg)
> >  
> >  	for (i = 0; i < 100000; i++) {
> >  		for (j = 0; j < 100000000; j++) {
> > -			qparity = rcu_read_lock();
> > +			rcu_read_lock();
> >  			local_ptr = rcu_dereference(test_rcu_pointer);
> >  			if (local_ptr) {
> >  				assert(local_ptr->a == 8);
> >  				assert(local_ptr->b == 12);
> >  				assert(local_ptr->c[55] == 2);
> >  			}
> > -			rcu_read_unlock(qparity);
> > +			rcu_read_unlock();
> >  		}
> >  	}
> >  
> > diff --git a/urcu.c b/urcu.c
> > index 1a276ce..95eea4e 100644
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -23,6 +23,8 @@ pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
> >  int urcu_qparity;
> >  
> >  int __thread urcu_active_readers[2];
> > +int __thread urcu_reader_nesting;
> > +int __thread urcu_reader_parity;
> >  
> >  /* Thread IDs of registered readers */
> >  #define INIT_NUM_THREADS 4
> > diff --git a/urcu.h b/urcu.h
> > index 9431da5..6d28ea2 100644
> > --- a/urcu.h
> > +++ b/urcu.h
> > @@ -70,6 +70,8 @@ static inline void atomic_inc(int *v)
> >  extern int urcu_qparity;
> >  
> >  extern int __thread urcu_active_readers[2];
> > +extern int __thread urcu_reader_nesting;
> > +extern int __thread urcu_reader_parity;
> >  
> >  static inline int get_urcu_qparity(void)
> >  {
> > @@ -79,26 +81,32 @@ static inline int get_urcu_qparity(void)
> >  /*
> >   * returns urcu_parity.
> >   */
> > -static inline int rcu_read_lock(void)
> > +static inline void rcu_read_lock(void)
> >  {
> > -	int urcu_parity = get_urcu_qparity();
> > -	urcu_active_readers[urcu_parity]++;
> > -	/*
> > -	 * Increment active readers count before accessing the pointer.
> > -	 * See force_mb_all_threads().
> > -	 */
> > -	barrier();
> > -	return urcu_parity;
> > +	int urcu_parity;
> > +
> > +	if (urcu_reader_nesting++ == 0) {
> > +		urcu_parity = get_urcu_qparity();
> > +		urcu_active_readers[urcu_parity]++;
> > +		urcu_reader_parity = urcu_parity;
> > +		/*
> > +		 * Increment active readers count before accessing the pointer.
> > +		 * See force_mb_all_threads().
> > +		 */
> > +		barrier();
> > +	}
> >  }
> >  
> > -static inline void rcu_read_unlock(int urcu_parity)
> > +static inline void rcu_read_unlock(void)
> >  {
> > -	barrier();
> > -	/*
> > -	 * Finish using rcu before decrementing the pointer.
> > -	 * See force_mb_all_threads().
> > -	 */
> > -	urcu_active_readers[urcu_parity]--;
> > +	if (--urcu_reader_nesting == 0) {
> > +		barrier();
> > +		/*
> > +		 * Finish using rcu before decrementing the pointer.
> > +		 * See force_mb_all_threads().
> > +		 */
> > +		urcu_active_readers[urcu_reader_parity]--;
> > +	}
> >  }
> >  
> >  extern void rcu_write_lock(void);
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-06 16:34     ` Paul E. McKenney
  2009-02-07 15:10       ` Paul E. McKenney
@ 2009-02-08 22:44       ` Mathieu Desnoyers
  2009-02-09  4:11         ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-08 22:44 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > 
> > > Hi Paul,
> > > 
> > > I figured out I needed some userspace RCU for the userspace tracing part
> > > of LTTng (for quick read access to the control variables) to trace
> > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > RCU implementation.
> > > 
> > > It works so far, but I have not gone through any formal verification
> > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > want to comment on it, it would be welcome. It's a userland-only
> > > library. It's also currently x86-only, but only a few basic definitions
> > > must be adapted in urcu.h to port it.
> > > 
> > > Here is the link to my git tree :
> > > 
> > > git://lttng.org/userspace-rcu.git
> > > 
> > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > 
> > Very cool!!!  I will take a look!
> > 
> > I will also point you at a few that I have put together:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > (In the CodeSamples/defer directory.)
> 
> Interesting approach, using the signal to force memory-barrier execution!
> 
> o	One possible optimization would be to avoid sending a signal to
> 	a blocked thread, as the context switch leading to blocking
> 	will have implied a memory barrier -- otherwise it would not
> 	be safe to resume the thread on some other CPU.  That said,
> 	not sure whether checking to see whether a thread is blocked is
> 	any faster than sending it a signal and forcing it to wake up.
> 

I'm not sure it will be any faster, and it could be racy too. How would
you envision querying the execution state of another thread ?

> 	Of course, this approach does require that the enclosing
> 	application be willing to give up a signal.  I suspect that most
> 	applications would be OK with this, though some might not.
> 

If we want to make this transparent to the application, we'll have to
investigate further in sigaction() and signal() library override I
guess.

> 	Of course, I cannot resist pointing to an old LKML thread:
> 
> 		http://lkml.org/lkml/2001/10/8/189
> 
> 	But I think that the time is now right.  ;-)
> 
> o	I don't understand the purpose of rcu_write_lock() and
> 	rcu_write_unlock().  I am concerned that it will lead people
> 	to decide that a single global lock must protect RCU updates,
> 	which is of course absolutely not the case.  I strongly
> 	suggest making these internal to the urcu.c file.  Yes,
> 	uses of urcu_publish_content() would then hit two locks (the
> 	internal-to-urcu.c one and whatever they are using to protect
> 	their data structure), but let's face it, if you are sending a
> 	signal to each and every thread, the additional overhead of the
> 	extra lock is the least of your worries.
> 

Ok, just changed it.

> 	If you really want to heavily optimize this, I would suggest
> 	setting up a state machine that permits multiple concurrent
> 	calls to urcu_publish_content() to share the same set of signal
> 	invocations.  That way, if the caller has partitioned the
> 	data structure, global locking might be avoided completely
> 	(or at least greatly restricted in scope).
> 

That brings an interesting question about urcu_publish_content :


void *urcu_publish_content(void **ptr, void *new)
{
        void *oldptr;

        internal_urcu_lock();
        oldptr = *ptr;
        *ptr = new;

        switch_qparity();
        switch_qparity();
        internal_urcu_unlock();

        return oldptr;
}

Given that we take a global lock around the pointer assignment, we can
safely assume, from the caller's perspective, that the update will
happen as an "xchg" operation. So if the caller does not have to copy
the old data, it can simply publish the new data without taking any
lock itself.

So the question that arises if we want to remove global locking is :
should we change this 

        oldptr = *ptr;
        *ptr = new;

for an atomic xchg ?


> 	Of course, if updates are rare, the optimization would not
> 	help, but in that case, acquiring two locks would be even less
> 	of a problem.
> 

I plan updates to be quite rare, but it's always good to foresee how
that kind of infrastructure could be misused. :-)

> o	Is urcu_qparity relying on initialization to zero?  Or on the
> 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> 	used to index urcu_active_readers[], you must be relying on
> 	initialization to zero.

Yes, starts at 0.

> 
> o	In rcu_read_lock(), why is a non-atomic increment of the
> 	urcu_active_readers[urcu_parity] element safe?  Are you
> 	relying on the compiler generating an x86 add-to-memory
> 	instruction?
> 
> 	Ditto for rcu_read_unlock().
> 
> 	Ah, never mind!!!  I now see the __thread specification,
> 	and the keeping of references to it in the reader_data list.
> 

Exactly :)

> o	Combining the equivalent of rcu_assign_pointer() and
> 	synchronize_rcu() into urcu_publish_content() is an interesting
> 	approach.  Not yet sure whether or not it is a good idea.  I
> 	guess trying it out on several applications would be the way
> 	to find out.  ;-)
> 
> 	That said, I suspect that it would be very convenient in a
> 	number of situations.
> 

I thought so. It seemed to be a natural way to express it to me. Usage
will tell.

> o	It would be good to avoid having to pass the return value
> 	of rcu_read_lock() into rcu_read_unlock().  It should be
> 	possible to avoid this via counter value tricks, though this
> 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> 	(64-bit machines don't have to worry about counter overflow.)
> 
> 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> 	in the aforementioned git archive for a way to do this.
> 	(And perhaps I should apply this change to SRCU...)
> 

See my other mail about this.

> o	Your test looks a bit strange, not sure why you test all the
> 	different variables.  It would be nice to take a test duration
> 	as an argument and run the test for that time.
> 

I made a smaller version which only reads a single variable. I agree
that the initial test was a bit strange on that aspect.

I'll do a version which takes a duration as parameter.

> 	I killed the test after better part of an hour on my laptop,
> 	will retry on a larger machine (after noting the 18 threads
> 	created!).  (And yes, I first tried Power, which objected
> 	strenously to the "mfence" and "lock; incl" instructions,
> 	so getting an x86 machine to try on.)
> 

That should be easy enough to fix. A bit of primitive cut'n'paste would
do.

> Again, looks interesting!  Looks plausible, although I have not 100%
> convinced myself that it is perfectly bug-free.  But I do maintain
> a healthy skepticism of purported RCU algorithms, especially ones that
> I have written.  ;-)
> 

That's always good. I also tend to always be very skeptical about what I
write and review.

Thanks for the thorough review.

Mathieu

> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-08 22:36               ` Paul E. McKenney
@ 2009-02-09  0:24                 ` Paul E. McKenney
  2009-02-09  0:54                   ` Mathieu Desnoyers
  2009-02-09  0:40                 ` [ltt-dev] " Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09  0:24 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sun, Feb 08, 2009 at 02:36:06PM -0800, Paul E. McKenney wrote:
> On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:

[ . . . ]

> > I ran your modified version within my benchmarks :
> > 
> > with return value : 14.164 cycles per read
> > without return value : 16.4017 cycles per read
> > 
> > So we have a 14% performance decrease due to this. We also pollute the
> > branch prediction buffer and we add a cache access due to the added
> > variables in the TLS. Returning the value has the clear advantage of
> > letting the compiler keep it around in registers or on the stack, which
> > clearly costs less.
> > 
> > So I think the speed factor outweights the visual considerations. Maybe
> > we could switch to something like :
> > 
> > unsigned int qparity;
> > 
> > urcu_read_lock(&qparity);
> > ...
> > urcu_read_unlock(&qparity);
> > 
> > That would be a bit like local_irq_save() in the kernel, except that we
> > could do it in a static inline because we pass the address. I
> > personnally dislike the local_irq_save() way of hiding the fact that it
> > writes to the variable in a "clever" macro. I'd really prefer to leave
> > the " & ".
> > 
> > What is your opinion ?
> 
> My current opinion is that I can avoid the overflow problem and the
> need to recheck, which might get rid of the need for both arguments
> and return values while still maintaining good performance.  The trick
> is to use only the topmost bit for the grace-period counter, and all
> the rest of the bits for nesting.  That way, no matter what value of
> global counter one picks up, it will be waited for (since there are but
> two values that the global counter takes on).
> 
> But just now coding it, so will see if it actually works.

Seems to work, and seems to be pretty fast on my machine, anyway.
This one adapts itself to 32- and 64-bit machines, though almost
all of the code is common.  It does do a check, but avoids array
indexing, arguments, and return values.

How does it do on your hardware?

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 test_urcu.c        |    6 +++---
 test_urcu_timing.c |    6 +++---
 urcu.c             |   23 ++++++++++-------------
 urcu.h             |   42 +++++++++++++++++++++++++++++-------------
 4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/test_urcu.c b/test_urcu.c
index f6be45b..f115a4a 100644
--- a/test_urcu.c
+++ b/test_urcu.c
@@ -72,7 +72,7 @@ void rcu_copy_mutex_unlock(void)
 
 void *thr_reader(void *arg)
 {
-	int qparity, i, j;
+	int i, j;
 	struct test_array *local_ptr;
 
 	printf("thread %s, thread id : %lx, tid %lu\n",
@@ -83,14 +83,14 @@ void *thr_reader(void *arg)
 
 	for (i = 0; i < 100000; i++) {
 		for (j = 0; j < 100000000; j++) {
-			rcu_read_lock(&qparity);
+			rcu_read_lock();
 			local_ptr = rcu_dereference(test_rcu_pointer);
 			if (local_ptr) {
 				assert(local_ptr->a == 8);
 				assert(local_ptr->b == 12);
 				assert(local_ptr->c[55] == 2);
 			}
-			rcu_read_unlock(&qparity);
+			rcu_read_unlock();
 		}
 	}
 
diff --git a/test_urcu_timing.c b/test_urcu_timing.c
index 57fda4f..9903705 100644
--- a/test_urcu_timing.c
+++ b/test_urcu_timing.c
@@ -94,7 +94,7 @@ static cycles_t reader_time[NR_READ] __attribute__((aligned(128)));
 
 void *thr_reader(void *arg)
 {
-	int qparity, i, j;
+	int i, j;
 	struct test_array *local_ptr;
 	cycles_t time1, time2;
 
@@ -107,12 +107,12 @@ void *thr_reader(void *arg)
 	time1 = get_cycles();
 	for (i = 0; i < OUTER_READ_LOOP; i++) {
 		for (j = 0; j < INNER_READ_LOOP; j++) {
-			rcu_read_lock(&qparity);
+			rcu_read_lock();
 			local_ptr = rcu_dereference(test_rcu_pointer);
 			if (local_ptr) {
 				assert(local_ptr->a == 8);
 			}
-			rcu_read_unlock(&qparity);
+			rcu_read_unlock();
 		}
 	}
 	time2 = get_cycles();
diff --git a/urcu.c b/urcu.c
index 08fb75d..2914b66 100644
--- a/urcu.c
+++ b/urcu.c
@@ -19,17 +19,17 @@
 
 pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
 
-/* Global quiescent period parity */
-int urcu_qparity;
+/* Global grace period counter */
+long urcu_gp_ctr;
 
-int __thread urcu_active_readers[2];
+long __thread urcu_active_readers;
 
 /* Thread IDs of registered readers */
 #define INIT_NUM_THREADS 4
 
 struct reader_data {
 	pthread_t tid;
-	int *urcu_active_readers;
+	long *urcu_active_readers;
 };
 
 static struct reader_data *reader_data;
@@ -60,11 +60,9 @@ void internal_urcu_unlock(void)
 /*
  * called with urcu_mutex held.
  */
-static int switch_next_urcu_qparity(void)
+static void switch_next_urcu_qparity(void)
 {
-	int old_parity = urcu_qparity;
-	urcu_qparity = 1 - old_parity;
-	return old_parity;
+	urcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
 }
 
 static void force_mb_all_threads(void)
@@ -89,7 +87,7 @@ static void force_mb_all_threads(void)
 	mb();	/* read sig_done before ending the barrier */
 }
 
-void wait_for_quiescent_state(int parity)
+void wait_for_quiescent_state(void)
 {
 	struct reader_data *index;
 
@@ -101,7 +99,7 @@ void wait_for_quiescent_state(int parity)
 		/*
 		 * BUSY-LOOP.
 		 */
-		while (index->urcu_active_readers[parity] != 0)
+		while (rcu_old_gp_ongoing(index->urcu_active_readers))
 			barrier();
 	}
 	/*
@@ -115,17 +113,16 @@ void wait_for_quiescent_state(int parity)
 
 static void switch_qparity(void)
 {
-	int prev_parity;
 
 	/* All threads should read qparity before accessing data structure. */
 	/* Write ptr before changing the qparity */
 	force_mb_all_threads();
-	prev_parity = switch_next_urcu_qparity();
+	switch_next_urcu_qparity();
 
 	/*
 	 * Wait for previous parity to be empty of readers.
 	 */
-	wait_for_quiescent_state(prev_parity);
+	wait_for_quiescent_state();
 }
 
 void synchronize_rcu(void)
diff --git a/urcu.h b/urcu.h
index b6b5c7b..e83c69f 100644
--- a/urcu.h
+++ b/urcu.h
@@ -66,23 +66,39 @@ static inline void atomic_inc(int *v)
 
 #define SIGURCU SIGUSR1
 
-/* Global quiescent period parity */
-extern int urcu_qparity;
+#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x80000000 : 0x100L)
+#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
 
-extern int __thread urcu_active_readers[2];
+/* Global quiescent period counter with low-order bits unused. */
+extern long urcu_gp_ctr;
 
-static inline int get_urcu_qparity(void)
+extern long __thread urcu_active_readers;
+
+static inline int rcu_old_gp_ongoing(long *value)
 {
-	return urcu_qparity;
+	long v;
+
+	if (value == NULL)
+		return 0;
+	v = ACCESS_ONCE(*value);
+	if (sizeof(long) == 4) {
+		return (v & RCU_GP_CTR_NEST_MASK) &&
+		       ((v ^ ACCESS_ONCE(urcu_gp_ctr)) & ~RCU_GP_CTR_NEST_MASK);
+	} else {
+		return (v & RCU_GP_CTR_NEST_MASK) &&
+		       (v - ACCESS_ONCE(urcu_gp_ctr) < 0);
+	}
 }
 
-/*
- * urcu_parity should be declared on the caller's stack.
- */
-static inline void rcu_read_lock(int *urcu_parity)
+static inline void rcu_read_lock(void)
 {
-	*urcu_parity = get_urcu_qparity();
-	urcu_active_readers[*urcu_parity]++;
+	long tmp;
+
+	tmp = urcu_active_readers;
+	if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
+		urcu_active_readers = urcu_gp_ctr + 1;
+	else
+		urcu_active_readers = tmp + 1;
 	/*
 	 * Increment active readers count before accessing the pointer.
 	 * See force_mb_all_threads().
@@ -90,14 +106,14 @@ static inline void rcu_read_lock(int *urcu_parity)
 	barrier();
 }
 
-static inline void rcu_read_unlock(int *urcu_parity)
+static inline void rcu_read_unlock(void)
 {
 	barrier();
 	/*
 	 * Finish using rcu before decrementing the pointer.
 	 * See force_mb_all_threads().
 	 */
-	urcu_active_readers[*urcu_parity]--;
+	urcu_active_readers--;
 }
 
 extern void *urcu_publish_content(void **ptr, void *new);

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-08 22:36               ` Paul E. McKenney
  2009-02-09  0:24                 ` Paul E. McKenney
@ 2009-02-09  0:40                 ` Mathieu Desnoyers
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  0:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sat, Feb 07, 2009 at 06:38:27PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Feb 06, 2009 at 08:34:32AM -0800, Paul E. McKenney wrote:
> > > > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > > > 
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > > > RCU implementation.
> > > > > > > > 
> > > > > > > > It works so far, but I have not gone through any formal verification
> > > > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > > > must be adapted in urcu.h to port it.
> > > > > > > > 
> > > > > > > > Here is the link to my git tree :
> > > > > > > > 
> > > > > > > > git://lttng.org/userspace-rcu.git
> > > > > > > > 
> > > > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > > > 
> > > > > > > Very cool!!!  I will take a look!
> > > > > > > 
> > > > > > > I will also point you at a few that I have put together:
> > > > > > > 
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > > 
> > > > > > > (In the CodeSamples/defer directory.)
> > > > > > 
> > > > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > > > 
> > > > > > o	One possible optimization would be to avoid sending a signal to
> > > > > > 	a blocked thread, as the context switch leading to blocking
> > > > > > 	will have implied a memory barrier -- otherwise it would not
> > > > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > > > 	not sure whether checking to see whether a thread is blocked is
> > > > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > > > 
> > > > > > 	Of course, this approach does require that the enclosing
> > > > > > 	application be willing to give up a signal.  I suspect that most
> > > > > > 	applications would be OK with this, though some might not.
> > > > > > 
> > > > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > > > 
> > > > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > > > 
> > > > > > 	But I think that the time is now right.  ;-)
> > > > > > 
> > > > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > > > 	to decide that a single global lock must protect RCU updates,
> > > > > > 	which is of course absolutely not the case.  I strongly
> > > > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > > > 	their data structure), but let's face it, if you are sending a
> > > > > > 	signal to each and every thread, the additional overhead of the
> > > > > > 	extra lock is the least of your worries.
> > > > > > 
> > > > > > 	If you really want to heavily optimize this, I would suggest
> > > > > > 	setting up a state machine that permits multiple concurrent
> > > > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > > > 	invocations.  That way, if the caller has partitioned the
> > > > > > 	data structure, global locking might be avoided completely
> > > > > > 	(or at least greatly restricted in scope).
> > > > > > 
> > > > > > 	Of course, if updates are rare, the optimization would not
> > > > > > 	help, but in that case, acquiring two locks would be even less
> > > > > > 	of a problem.
> > > > > > 
> > > > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > > > 	used to index urcu_active_readers[], you must be relying on
> > > > > > 	initialization to zero.
> > > > > > 
> > > > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > > > 	relying on the compiler generating an x86 add-to-memory
> > > > > > 	instruction?
> > > > > > 
> > > > > > 	Ditto for rcu_read_unlock().
> > > > > > 
> > > > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > > > 	and the keeping of references to it in the reader_data list.
> > > > > > 
> > > > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > > > 	guess trying it out on several applications would be the way
> > > > > > 	to find out.  ;-)
> > > > > > 
> > > > > > 	That said, I suspect that it would be very convenient in a
> > > > > > 	number of situations.
> > > > > > 
> > > > > > o	It would be good to avoid having to pass the return value
> > > > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > > > 	possible to avoid this via counter value tricks, though this
> > > > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > > > 
> > > > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > > > 	in the aforementioned git archive for a way to do this.
> > > > > > 	(And perhaps I should apply this change to SRCU...)
> > > > > > 
> > > > > > o	Your test looks a bit strange, not sure why you test all the
> > > > > > 	different variables.  It would be nice to take a test duration
> > > > > > 	as an argument and run the test for that time.
> > > > > > 
> > > > > > 	I killed the test after better part of an hour on my laptop,
> > > > > > 	will retry on a larger machine (after noting the 18 threads
> > > > > > 	created!).  (And yes, I first tried Power, which objected
> > > > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > > > 	so getting an x86 machine to try on.)
> > > > > > 
> > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > I have written.  ;-)
> > > > > 
> > > > > OK, here is one sequence of concern...
> > > > > 
> > > > 
> > > > Let's see..
> > > > 
> > > > > o	Thread 0 starts rcu_read_lock(), picking up the current
> > > > > 	get_urcu_qparity() into the local variable urcu_parity().
> > > > > 	Assume that the value returned is zero.
> > > > > 
> > > > > o	Thread 0 is now preempted.
> > > > > 
> > > > > o	Thread 1 invokes urcu_publish_content():
> > > > > 
> > > > > 	o	It substitutes the pointer.
> > > > > 
> > > > > 	o	It forces all threads to execute a memory barrier
> > > > > 		(thread 0 runs just long enough to process its signal
> > > > > 		and then is immediately preempted again).
> > > > > 
> > > > > 	o	It switches the parity, which is now one.
> > > > > 
> > > > > 	o	It waits for all readers on parity zero, and there are
> > > > > 		none, because thread 0 has not yet registered itself.
> > > > > 
> > > > > 	o	It therefore returns the old pointer.  So far, so good.
> > > > > 
> > > > > o	Thread 0 now resumes:
> > > > > 
> > > > > 	o	It increments its urcu_active_readers[0].
> > > > > 
> > > > > 	o	It forces a compiler barrier.
> > > > > 
> > > > > 	o	It returns zero (why not store this in thread-local
> > > > > 		storage rather than returning?).
> > > > > 
> > > > 
> > > > To support nested rcu_read_locks. (that's the only reason)
> > > 
> > > A patch below to allow nested rcu_read_lock() while keeping to the Linux
> > > kernel API, just FYI.  One can argue that the overhead of accessing the
> > > extra per-thread variables is offset by the fact that there no longer
> > > needs to be a return value from rcu_read_lock() nor an argument to
> > > rcu_read_unlock(), but hard to say.
> > > 
> > 
> > I ran your modified version within my benchmarks :
> > 
> > with return value : 14.164 cycles per read
> > without return value : 16.4017 cycles per read
> > 
> > So we have a 14% performance decrease due to this. We also pollute the
> > branch prediction buffer and we add a cache access due to the added
> > variables in the TLS. Returning the value has the clear advantage of
> > letting the compiler keep it around in registers or on the stack, which
> > clearly costs less.
> > 
> > So I think the speed factor outweights the visual considerations. Maybe
> > we could switch to something like :
> > 
> > unsigned int qparity;
> > 
> > urcu_read_lock(&qparity);
> > ...
> > urcu_read_unlock(&qparity);
> > 
> > That would be a bit like local_irq_save() in the kernel, except that we
> > could do it in a static inline because we pass the address. I
> > personnally dislike the local_irq_save() way of hiding the fact that it
> > writes to the variable in a "clever" macro. I'd really prefer to leave
> > the " & ".
> > 
> > What is your opinion ?
> 
> My current opinion is that I can avoid the overflow problem and the
> need to recheck, which might get rid of the need for both arguments
> and return values while still maintaining good performance.  The trick
> is to use only the topmost bit for the grace-period counter, and all
> the rest of the bits for nesting.  That way, no matter what value of
> global counter one picks up, it will be waited for (since there are but
> two values that the global counter takes on).
> 
> But just now coding it, so will see if it actually works.
> 

I look forward to see and test it.

> > > > > 	o	It enters its critical section, obtaining a reference
> > > > > 		to the new pointer that thread 1 just published.
> > > > > 
> > > > > o	Thread 1 now again invokes urcu_publish_content():
> > > > >  
> > > > > 	o	It substitutes the pointer.
> > > > > 
> > > > > 	o	It forces all threads to execute a memory barrier,
> > > > > 		including thread 0.
> > > > > 
> > > > > 	o	It switches the parity, which is now zero.
> > > > > 
> > > > > 	o	It waits for all readers on parity one, and there are
> > > > > 		none, because thread 0 has registered itself on parity
> > > > > 		zero!!!
> > > > > 
> > > > > 	o	Thread 1 therefore returns the old pointer.
> > > > > 
> > > > > 	o	Thread 1 frees the old pointer, which thread 0 is still
> > > > > 		using!!!
> > > > > 
> > > > 
> > > > Ah, yes, you are right.
> > > > 
> > > > > So, how to fix?  Here are some approaches:
> > > > > 
> > > > > o	Make urcu_publish_content() do two parity flips rather than one.
> > > > > 	I use this approach in my rcu_rcpg, rcu_rcpl, and rcu_rcpls
> > > > > 	algorithms in CodeSamples/defer.
> > > > 
> > > > This approach seems very interesting.
> > > 
> > > Patch in earlier email.  ;-)
> > > 
> > > > > o	Use a single free-running counter, in a manner similar to rcu_nest,
> > > > > 	as suggested earlier.  This one is interesting, as I rely on a
> > > > > 	read-side memory barrier to handle the long-preemption case.
> > > > > 	However, if you believe that any thread that waits several minutes
> > > > > 	between executing adjacent instructions must have been preempted
> > > > > 	(which the memory barriers that are required to do a context
> > > > > 	switch), then a compiler barrier suffices.  ;-)
> > > > 
> > > > Hrm, I'm trying to figure out what kind of memory backend you need to
> > > > put your counters for each quiescent state period. Is this free-running
> > > > counter indexing a very large array ? I doubt it does. Then how does it
> > > > make sure we don't roll back to the old array entries ?
> > > 
> > > There is no array, just a global counter that is incremented by a modest
> > > power of two for each grace period.  Then the outermost rcu_read_lock()
> > > records the one greater than current value of the global counter in its
> > > per-thread variable.
> > > 
> > > Now, rcu_read_lock() can tell that it is outermost by examining the
> > > low-order bits of its per-thread variable -- if these bits are zero,
> > > then this is the outermost rcu_read_lock().  So if rcu_read_lock() sees
> > > that it is nested, it simply increments its per-thread counter.
> > > 
> > > Then rcu_read_unlock() simply decrements its per-thread variable.
> > > 
> > > If the counter is only 32 bits, it is subject to overflow.  In that case,
> > > it is necessary to check for the counter having been incremented a huge
> > > number of times between the time the outermost rcu_read_lock() fetched
> > > the counter value and the time that it stored into its per-thread
> > > variable.
> > > 
> > > An admittedly crude implementation of this approach may be found in
> > > CodeSamples/defer/rcu_nest.[hc] in:
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > Of course, if the counter is 64 bits, overflow can safely be ignored.
> > > If you have a grace period every microsecond and allow RCU read-side
> > > critical sections to be nested 255 deep, it would take more than 2,000
> > > years to overflow.  ;-)
> > > 
> > 
> > Looking at the code, my first thought is : if we find out that the
> > array-based solution and the counter-based solution have the same
> > performance, I would definitely prefer the array-based version because
> > there are far less overflow considerations. It's therefore more solid
> > algorithmically and can be proven formally.
> > 
> > Also, I'm not sure I fully understand where your overflow test is going.
> > So let's pretend we are a reader, nested inside other rcu read locks,
> > and we arrive much later after the outermost reader has read the
> > rcu_gp_ctr. After 255 increments actually :
> > 
> > static void rcu_read_lock(void)
> > {
> >         long tmp;
> >         long *rrgp;
> > 
> >         /*
> >          * If this is the outermost RCU read-side critical section,
> >          * copy the global grace-period counter.  In either case,
> >          * increment the nesting count held in the low-order bits.
> >          */
> > 
> >         rrgp = &__get_thread_var(rcu_reader_gp);
> > retry:
> >         tmp = *rrgp;
> > # we read the local rrgp
> >         if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
> >                 tmp = rcu_gp_ctr;
> > # not executed, innermost and nested.
> >         tmp++;
> >         *rrgp = tmp;
> > # increment the local count and write it to the local rrgp
> >         smp_mb();
> >         if (((tmp & RCU_GP_CTR_NEST_MASK) == 1) &&
> >             ((rcu_gp_ctr - tmp) > (RCU_GP_CTR_NEST_MASK << 8)) != 0) {
> >                 (*rrgp)--;
> >                 goto retry;
> > # If we are more than 255 increments away from rcu_gp_ctr, decrement
> > # rrgp and loop
> >         }
> > }
> > 
> > The problem is : rcu_gp_ctr is advancing. So if we have tmp stucked at a
> > given value, and we are nested over the outermost read lock (therefore
> > we are making it impossible to go end its execution), then when the
> > rcu_gp_crt will advance (which is the only way things can eventually go
> > forward, because the local rrgp is set back to its original value), we
> > are just going to be _farther_ away from it (not closer). So we'll have
> > to wait for a complete type overflow (will take a while on 32-bits, and
> > a very long while on 64-bits) to have the test returning false and then
> > going forward.
> > 
> > Or there might be something I misunderstood ?
> 
> The first clause of the "if" statement should prevent this -- if we are
> not the outermost rcu_read_lock(), then we never retry.  (If I understand
> your scenario.)
> 

Ah, yes. The if (((tmp & RCU_GP_CTR_NEST_MASK) == 1) &&  tests if this
is the outermost read lock due to the tmp++. My mistake.

> > > > This latter solution could break jump-based probing of programs
> > > > soon-to-be available in gcc. The probes are meant to be of short
> > > > duration, but the fact is that this design lets the debugger inject code
> > > > without resorting to a breakpoint, which might therefore break your
> > > > "short time between instructions" assumption. It's very unlikely, but
> > > > possible.
> > > 
> > > But would the debugger's code injection take more than a minute without
> > > doing a context switch?  Ah -- you are thinking of a probe that spins
> > > for several minutes.  Yes, this would be strange, but not impossible.
> > > 
> > > OK, so for this usage, solution 1 it is!
> > 
> > Yes, it's unlikely, but possible.. and I like to design things assuming
> > the worse case scenario, even if it's almost impossible.
> 
> That is indeed the only way to get even semi-reliable software!
> 

Yes. By the way, I just committed the "duration" modification to
rcu_test.c. I also added some debugging which calls sched_yield() either
for the reader, the writer, or both. I also integrated some randomness
to leave some going quickly and others slowly.

Mathieu

> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > > > Of course, the probability of seeing this failure during test is quite
> > > > > low, since it is unlikely that thread 0 would run just long enough to
> > > > > execute its signal handler.  However, it could happen.  And if you were
> > > > > to adapt this algorithm for use in a real-time application, then priority
> > > > > boosting could cause this to happen naturally.
> > > > 
> > > > Yes. It's not because we are not able to create the faulty condition
> > > > that it will _never_ happen. It must therefore be taken care of.
> > > 
> > > No argument here!!!  ;-)  See the earlier patch for one way to fix.
> > > 
> > > The following patch makes rcu_read_lock() back into a void function
> > > while still permitting nesting, for whatever it is worth.
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > ---
> > > 
> > >  test_urcu.c |    6 +++---
> > >  urcu.c      |    2 ++
> > >  urcu.h      |   40 ++++++++++++++++++++++++----------------
> > >  3 files changed, 29 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/test_urcu.c b/test_urcu.c
> > > index db0b68c..16b212b 100644
> > > --- a/test_urcu.c
> > > +++ b/test_urcu.c
> > > @@ -33,7 +33,7 @@ static struct test_array *test_rcu_pointer;
> > >  
> > >  void *thr_reader(void *arg)
> > >  {
> > > -	int qparity, i, j;
> > > +	int i, j;
> > >  	struct test_array *local_ptr;
> > >  
> > >  	printf("thread %s, thread id : %lu, pid %lu\n",
> > > @@ -44,14 +44,14 @@ void *thr_reader(void *arg)
> > >  
> > >  	for (i = 0; i < 100000; i++) {
> > >  		for (j = 0; j < 100000000; j++) {
> > > -			qparity = rcu_read_lock();
> > > +			rcu_read_lock();
> > >  			local_ptr = rcu_dereference(test_rcu_pointer);
> > >  			if (local_ptr) {
> > >  				assert(local_ptr->a == 8);
> > >  				assert(local_ptr->b == 12);
> > >  				assert(local_ptr->c[55] == 2);
> > >  			}
> > > -			rcu_read_unlock(qparity);
> > > +			rcu_read_unlock();
> > >  		}
> > >  	}
> > >  
> > > diff --git a/urcu.c b/urcu.c
> > > index 1a276ce..95eea4e 100644
> > > --- a/urcu.c
> > > +++ b/urcu.c
> > > @@ -23,6 +23,8 @@ pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
> > >  int urcu_qparity;
> > >  
> > >  int __thread urcu_active_readers[2];
> > > +int __thread urcu_reader_nesting;
> > > +int __thread urcu_reader_parity;
> > >  
> > >  /* Thread IDs of registered readers */
> > >  #define INIT_NUM_THREADS 4
> > > diff --git a/urcu.h b/urcu.h
> > > index 9431da5..6d28ea2 100644
> > > --- a/urcu.h
> > > +++ b/urcu.h
> > > @@ -70,6 +70,8 @@ static inline void atomic_inc(int *v)
> > >  extern int urcu_qparity;
> > >  
> > >  extern int __thread urcu_active_readers[2];
> > > +extern int __thread urcu_reader_nesting;
> > > +extern int __thread urcu_reader_parity;
> > >  
> > >  static inline int get_urcu_qparity(void)
> > >  {
> > > @@ -79,26 +81,32 @@ static inline int get_urcu_qparity(void)
> > >  /*
> > >   * returns urcu_parity.
> > >   */
> > > -static inline int rcu_read_lock(void)
> > > +static inline void rcu_read_lock(void)
> > >  {
> > > -	int urcu_parity = get_urcu_qparity();
> > > -	urcu_active_readers[urcu_parity]++;
> > > -	/*
> > > -	 * Increment active readers count before accessing the pointer.
> > > -	 * See force_mb_all_threads().
> > > -	 */
> > > -	barrier();
> > > -	return urcu_parity;
> > > +	int urcu_parity;
> > > +
> > > +	if (urcu_reader_nesting++ == 0) {
> > > +		urcu_parity = get_urcu_qparity();
> > > +		urcu_active_readers[urcu_parity]++;
> > > +		urcu_reader_parity = urcu_parity;
> > > +		/*
> > > +		 * Increment active readers count before accessing the pointer.
> > > +		 * See force_mb_all_threads().
> > > +		 */
> > > +		barrier();
> > > +	}
> > >  }
> > >  
> > > -static inline void rcu_read_unlock(int urcu_parity)
> > > +static inline void rcu_read_unlock(void)
> > >  {
> > > -	barrier();
> > > -	/*
> > > -	 * Finish using rcu before decrementing the pointer.
> > > -	 * See force_mb_all_threads().
> > > -	 */
> > > -	urcu_active_readers[urcu_parity]--;
> > > +	if (--urcu_reader_nesting == 0) {
> > > +		barrier();
> > > +		/*
> > > +		 * Finish using rcu before decrementing the pointer.
> > > +		 * See force_mb_all_threads().
> > > +		 */
> > > +		urcu_active_readers[urcu_reader_parity]--;
> > > +	}
> > >  }
> > >  
> > >  extern void rcu_write_lock(void);
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  0:24                 ` Paul E. McKenney
@ 2009-02-09  0:54                   ` Mathieu Desnoyers
  2009-02-09  1:08                     ` [ltt-dev] " Mathieu Desnoyers
  2009-02-09  3:42                     ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  0:54 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Feb 08, 2009 at 02:36:06PM -0800, Paul E. McKenney wrote:
> > On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> 
> [ . . . ]
> 
> > > I ran your modified version within my benchmarks :
> > > 
> > > with return value : 14.164 cycles per read
> > > without return value : 16.4017 cycles per read
> > > 
> > > So we have a 14% performance decrease due to this. We also pollute the
> > > branch prediction buffer and we add a cache access due to the added
> > > variables in the TLS. Returning the value has the clear advantage of
> > > letting the compiler keep it around in registers or on the stack, which
> > > clearly costs less.
> > > 
> > > So I think the speed factor outweights the visual considerations. Maybe
> > > we could switch to something like :
> > > 
> > > unsigned int qparity;
> > > 
> > > urcu_read_lock(&qparity);
> > > ...
> > > urcu_read_unlock(&qparity);
> > > 
> > > That would be a bit like local_irq_save() in the kernel, except that we
> > > could do it in a static inline because we pass the address. I
> > > personnally dislike the local_irq_save() way of hiding the fact that it
> > > writes to the variable in a "clever" macro. I'd really prefer to leave
> > > the " & ".
> > > 
> > > What is your opinion ?
> > 
> > My current opinion is that I can avoid the overflow problem and the
> > need to recheck, which might get rid of the need for both arguments
> > and return values while still maintaining good performance.  The trick
> > is to use only the topmost bit for the grace-period counter, and all
> > the rest of the bits for nesting.  That way, no matter what value of
> > global counter one picks up, it will be waited for (since there are but
> > two values that the global counter takes on).
> > 
> > But just now coding it, so will see if it actually works.
> 
> Seems to work, and seems to be pretty fast on my machine, anyway.
> This one adapts itself to 32- and 64-bit machines, though almost
> all of the code is common.  It does do a check, but avoids array
> indexing, arguments, and return values.
> 
> How does it do on your hardware?
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Wow...

Patch updated against HEAD.

Time per read : 7.53622 cycles

Half of what we had previously.. I'll have to look at the assembly. :)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---

 test_urcu.c        |    6 +++---
 test_urcu_timing.c |    6 +++---
 urcu.c             |   23 ++++++++++-------------
 urcu.h             |   42 +++++++++++++++++++++++++++++-------------
 4 files changed, 45 insertions(+), 32 deletions(-)

diff --git a/test_urcu.c b/test_urcu.c
index f6be45b..f115a4a 100644
--- a/test_urcu.c
+++ b/test_urcu.c
@@ -72,7 +72,7 @@ void rcu_copy_mutex_unlock(void)
 
 void *thr_reader(void *arg)
 {
-	int qparity, i, j;
+	int i, j;
 	struct test_array *local_ptr;
 
 	printf("thread %s, thread id : %lx, tid %lu\n",
@@ -83,14 +83,14 @@ void *thr_reader(void *arg)
 
 	for (i = 0; i < 100000; i++) {
 		for (j = 0; j < 100000000; j++) {
-			rcu_read_lock(&qparity);
+			rcu_read_lock();
 			local_ptr = rcu_dereference(test_rcu_pointer);
 			if (local_ptr) {
 				assert(local_ptr->a == 8);
 				assert(local_ptr->b == 12);
 				assert(local_ptr->c[55] == 2);
 			}
-			rcu_read_unlock(&qparity);
+			rcu_read_unlock();
 		}
 	}
 
diff --git a/test_urcu_timing.c b/test_urcu_timing.c
index 57fda4f..9903705 100644
--- a/test_urcu_timing.c
+++ b/test_urcu_timing.c
@@ -94,7 +94,7 @@ static cycles_t reader_time[NR_READ] __attribute__((aligned(128)));
 
 void *thr_reader(void *arg)
 {
-	int qparity, i, j;
+	int i, j;
 	struct test_array *local_ptr;
 	cycles_t time1, time2;
 
@@ -107,12 +107,12 @@ void *thr_reader(void *arg)
 	time1 = get_cycles();
 	for (i = 0; i < OUTER_READ_LOOP; i++) {
 		for (j = 0; j < INNER_READ_LOOP; j++) {
-			rcu_read_lock(&qparity);
+			rcu_read_lock();
 			local_ptr = rcu_dereference(test_rcu_pointer);
 			if (local_ptr) {
 				assert(local_ptr->a == 8);
 			}
-			rcu_read_unlock(&qparity);
+			rcu_read_unlock();
 		}
 	}
 	time2 = get_cycles();
diff --git a/urcu.c b/urcu.c
index 08fb75d..2914b66 100644
--- a/urcu.c
+++ b/urcu.c
@@ -19,17 +19,17 @@
 
 pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
 
-/* Global quiescent period parity */
-int urcu_qparity;
+/* Global grace period counter */
+long urcu_gp_ctr;
 
-int __thread urcu_active_readers[2];
+long __thread urcu_active_readers;
 
 /* Thread IDs of registered readers */
 #define INIT_NUM_THREADS 4
 
 struct reader_data {
 	pthread_t tid;
-	int *urcu_active_readers;
+	long *urcu_active_readers;
 };
 
 static struct reader_data *reader_data;
@@ -60,11 +60,9 @@ void internal_urcu_unlock(void)
 /*
  * called with urcu_mutex held.
  */
-static int switch_next_urcu_qparity(void)
+static void switch_next_urcu_qparity(void)
 {
-	int old_parity = urcu_qparity;
-	urcu_qparity = 1 - old_parity;
-	return old_parity;
+	urcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
 }
 
 static void force_mb_all_threads(void)
@@ -89,7 +87,7 @@ static void force_mb_all_threads(void)
 	mb();	/* read sig_done before ending the barrier */
 }
 
-void wait_for_quiescent_state(int parity)
+void wait_for_quiescent_state(void)
 {
 	struct reader_data *index;
 
@@ -101,7 +99,7 @@ void wait_for_quiescent_state(int parity)
 		/*
 		 * BUSY-LOOP.
 		 */
-		while (index->urcu_active_readers[parity] != 0)
+		while (rcu_old_gp_ongoing(index->urcu_active_readers))
 			barrier();
 	}
 	/*
@@ -115,17 +113,16 @@ void wait_for_quiescent_state(int parity)
 
 static void switch_qparity(void)
 {
-	int prev_parity;
 
 	/* All threads should read qparity before accessing data structure. */
 	/* Write ptr before changing the qparity */
 	force_mb_all_threads();
-	prev_parity = switch_next_urcu_qparity();
+	switch_next_urcu_qparity();
 
 	/*
 	 * Wait for previous parity to be empty of readers.
 	 */
-	wait_for_quiescent_state(prev_parity);
+	wait_for_quiescent_state();
 }
 
 void synchronize_rcu(void)
diff --git a/urcu.h b/urcu.h
index b6b5c7b..e83c69f 100644
--- a/urcu.h
+++ b/urcu.h
@@ -66,23 +66,39 @@ static inline void atomic_inc(int *v)
 
 #define SIGURCU SIGUSR1
 
-/* Global quiescent period parity */
-extern int urcu_qparity;
+#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x80000000 : 0x100L)
+#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
 
-extern int __thread urcu_active_readers[2];
+/* Global quiescent period counter with low-order bits unused. */
+extern long urcu_gp_ctr;
 
-static inline int get_urcu_qparity(void)
+extern long __thread urcu_active_readers;
+
+static inline int rcu_old_gp_ongoing(long *value)
 {
-	return urcu_qparity;
+	long v;
+
+	if (value == NULL)
+		return 0;
+	v = ACCESS_ONCE(*value);
+	if (sizeof(long) == 4) {
+		return (v & RCU_GP_CTR_NEST_MASK) &&
+		       ((v ^ ACCESS_ONCE(urcu_gp_ctr)) & ~RCU_GP_CTR_NEST_MASK);
+	} else {
+		return (v & RCU_GP_CTR_NEST_MASK) &&
+		       (v - ACCESS_ONCE(urcu_gp_ctr) < 0);
+	}
 }
 
-/*
- * urcu_parity should be declared on the caller's stack.
- */
-static inline void rcu_read_lock(int *urcu_parity)
+static inline void rcu_read_lock(void)
 {
-	*urcu_parity = get_urcu_qparity();
-	urcu_active_readers[*urcu_parity]++;
+	long tmp;
+
+	tmp = urcu_active_readers;
+	if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
+		urcu_active_readers = urcu_gp_ctr + 1;
+	else
+		urcu_active_readers = tmp + 1;
 	/*
 	 * Increment active readers count before accessing the pointer.
 	 * See force_mb_all_threads().
@@ -90,14 +106,14 @@ static inline void rcu_read_lock(int *urcu_parity)
 	barrier();
 }
 
-static inline void rcu_read_unlock(int *urcu_parity)
+static inline void rcu_read_unlock(void)
 {
 	barrier();
 	/*
 	 * Finish using rcu before decrementing the pointer.
 	 * See force_mb_all_threads().
 	 */
-	urcu_active_readers[*urcu_parity]--;
+	urcu_active_readers--;
 }
 
 extern void *urcu_publish_content(void **ptr, void *new);



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  0:54                   ` Mathieu Desnoyers
@ 2009-02-09  1:08                     ` Mathieu Desnoyers
  2009-02-09  3:47                       ` Paul E. McKenney
  2009-02-09  3:42                     ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  1:08 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Feb 08, 2009 at 02:36:06PM -0800, Paul E. McKenney wrote:
> > > On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > > I ran your modified version within my benchmarks :
> > > > 
> > > > with return value : 14.164 cycles per read
> > > > without return value : 16.4017 cycles per read
> > > > 
> > > > So we have a 14% performance decrease due to this. We also pollute the
> > > > branch prediction buffer and we add a cache access due to the added
> > > > variables in the TLS. Returning the value has the clear advantage of
> > > > letting the compiler keep it around in registers or on the stack, which
> > > > clearly costs less.
> > > > 
> > > > So I think the speed factor outweights the visual considerations. Maybe
> > > > we could switch to something like :
> > > > 
> > > > unsigned int qparity;
> > > > 
> > > > urcu_read_lock(&qparity);
> > > > ...
> > > > urcu_read_unlock(&qparity);
> > > > 
> > > > That would be a bit like local_irq_save() in the kernel, except that we
> > > > could do it in a static inline because we pass the address. I
> > > > personnally dislike the local_irq_save() way of hiding the fact that it
> > > > writes to the variable in a "clever" macro. I'd really prefer to leave
> > > > the " & ".
> > > > 
> > > > What is your opinion ?
> > > 
> > > My current opinion is that I can avoid the overflow problem and the
> > > need to recheck, which might get rid of the need for both arguments
> > > and return values while still maintaining good performance.  The trick
> > > is to use only the topmost bit for the grace-period counter, and all
> > > the rest of the bits for nesting.  That way, no matter what value of
> > > global counter one picks up, it will be waited for (since there are but
> > > two values that the global counter takes on).
> > > 
> > > But just now coding it, so will see if it actually works.
> > 
> > Seems to work, and seems to be pretty fast on my machine, anyway.
> > This one adapts itself to 32- and 64-bit machines, though almost
> > all of the code is common.  It does do a check, but avoids array
> > indexing, arguments, and return values.
> > 
> > How does it do on your hardware?
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Wow...
> 
> Patch updated against HEAD.
> 
> Time per read : 7.53622 cycles
> 
> Half of what we had previously.. I'll have to look at the assembly. :)
> 
> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> ---
> 
>  test_urcu.c        |    6 +++---
>  test_urcu_timing.c |    6 +++---
>  urcu.c             |   23 ++++++++++-------------
>  urcu.h             |   42 +++++++++++++++++++++++++++++-------------
>  4 files changed, 45 insertions(+), 32 deletions(-)
> 
> diff --git a/test_urcu.c b/test_urcu.c
> index f6be45b..f115a4a 100644
> --- a/test_urcu.c
> +++ b/test_urcu.c
> @@ -72,7 +72,7 @@ void rcu_copy_mutex_unlock(void)
>  
>  void *thr_reader(void *arg)
>  {
> -	int qparity, i, j;
> +	int i, j;
>  	struct test_array *local_ptr;
>  
>  	printf("thread %s, thread id : %lx, tid %lu\n",
> @@ -83,14 +83,14 @@ void *thr_reader(void *arg)
>  
>  	for (i = 0; i < 100000; i++) {
>  		for (j = 0; j < 100000000; j++) {
> -			rcu_read_lock(&qparity);
> +			rcu_read_lock();
>  			local_ptr = rcu_dereference(test_rcu_pointer);
>  			if (local_ptr) {
>  				assert(local_ptr->a == 8);
>  				assert(local_ptr->b == 12);
>  				assert(local_ptr->c[55] == 2);
>  			}
> -			rcu_read_unlock(&qparity);
> +			rcu_read_unlock();
>  		}
>  	}
>  
> diff --git a/test_urcu_timing.c b/test_urcu_timing.c
> index 57fda4f..9903705 100644
> --- a/test_urcu_timing.c
> +++ b/test_urcu_timing.c
> @@ -94,7 +94,7 @@ static cycles_t reader_time[NR_READ] __attribute__((aligned(128)));
>  
>  void *thr_reader(void *arg)
>  {
> -	int qparity, i, j;
> +	int i, j;
>  	struct test_array *local_ptr;
>  	cycles_t time1, time2;
>  
> @@ -107,12 +107,12 @@ void *thr_reader(void *arg)
>  	time1 = get_cycles();
>  	for (i = 0; i < OUTER_READ_LOOP; i++) {
>  		for (j = 0; j < INNER_READ_LOOP; j++) {
> -			rcu_read_lock(&qparity);
> +			rcu_read_lock();
>  			local_ptr = rcu_dereference(test_rcu_pointer);
>  			if (local_ptr) {
>  				assert(local_ptr->a == 8);
>  			}
> -			rcu_read_unlock(&qparity);
> +			rcu_read_unlock();
>  		}
>  	}
>  	time2 = get_cycles();
> diff --git a/urcu.c b/urcu.c
> index 08fb75d..2914b66 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -19,17 +19,17 @@
>  
>  pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
>  
> -/* Global quiescent period parity */
> -int urcu_qparity;
> +/* Global grace period counter */
> +long urcu_gp_ctr;
>  
> -int __thread urcu_active_readers[2];
> +long __thread urcu_active_readers;
>  
>  /* Thread IDs of registered readers */
>  #define INIT_NUM_THREADS 4
>  
>  struct reader_data {
>  	pthread_t tid;
> -	int *urcu_active_readers;
> +	long *urcu_active_readers;
>  };
>  
>  static struct reader_data *reader_data;
> @@ -60,11 +60,9 @@ void internal_urcu_unlock(void)
>  /*
>   * called with urcu_mutex held.
>   */
> -static int switch_next_urcu_qparity(void)
> +static void switch_next_urcu_qparity(void)
>  {
> -	int old_parity = urcu_qparity;
> -	urcu_qparity = 1 - old_parity;
> -	return old_parity;
> +	urcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
>  }
>  
>  static void force_mb_all_threads(void)
> @@ -89,7 +87,7 @@ static void force_mb_all_threads(void)
>  	mb();	/* read sig_done before ending the barrier */
>  }
>  
> -void wait_for_quiescent_state(int parity)
> +void wait_for_quiescent_state(void)
>  {
>  	struct reader_data *index;
>  
> @@ -101,7 +99,7 @@ void wait_for_quiescent_state(int parity)
>  		/*
>  		 * BUSY-LOOP.
>  		 */
> -		while (index->urcu_active_readers[parity] != 0)
> +		while (rcu_old_gp_ongoing(index->urcu_active_readers))
>  			barrier();
>  	}
>  	/*
> @@ -115,17 +113,16 @@ void wait_for_quiescent_state(int parity)
>  
>  static void switch_qparity(void)
>  {
> -	int prev_parity;
>  
>  	/* All threads should read qparity before accessing data structure. */
>  	/* Write ptr before changing the qparity */
>  	force_mb_all_threads();
> -	prev_parity = switch_next_urcu_qparity();
> +	switch_next_urcu_qparity();
>  
>  	/*
>  	 * Wait for previous parity to be empty of readers.
>  	 */
> -	wait_for_quiescent_state(prev_parity);
> +	wait_for_quiescent_state();
>  }
>  
>  void synchronize_rcu(void)
> diff --git a/urcu.h b/urcu.h
> index b6b5c7b..e83c69f 100644
> --- a/urcu.h
> +++ b/urcu.h
> @@ -66,23 +66,39 @@ static inline void atomic_inc(int *v)
>  
>  #define SIGURCU SIGUSR1
>  
> -/* Global quiescent period parity */
> -extern int urcu_qparity;
> +#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x80000000 : 0x100L)

Shouldn't it be the opposite ?

e.g.

#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x100L : 0x80000000L)

> +#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
>  
> -extern int __thread urcu_active_readers[2];
> +/* Global quiescent period counter with low-order bits unused. */
> +extern long urcu_gp_ctr;
>  
> -static inline int get_urcu_qparity(void)
> +extern long __thread urcu_active_readers;
> +
> +static inline int rcu_old_gp_ongoing(long *value)
>  {
> -	return urcu_qparity;
> +	long v;
> +
> +	if (value == NULL)
> +		return 0;
> +	v = ACCESS_ONCE(*value);
> +	if (sizeof(long) == 4) {
> +		return (v & RCU_GP_CTR_NEST_MASK) &&
> +		       ((v ^ ACCESS_ONCE(urcu_gp_ctr)) & ~RCU_GP_CTR_NEST_MASK);

There must be something about the ^ I am missing ? Compared to it, the
64-bits test is a - , with < 0...

Mathieu

> +	} else {
> +		return (v & RCU_GP_CTR_NEST_MASK) &&
> +		       (v - ACCESS_ONCE(urcu_gp_ctr) < 0);
> +	}
>  }
>  
> -/*
> - * urcu_parity should be declared on the caller's stack.
> - */
> -static inline void rcu_read_lock(int *urcu_parity)
> +static inline void rcu_read_lock(void)
>  {
> -	*urcu_parity = get_urcu_qparity();
> -	urcu_active_readers[*urcu_parity]++;
> +	long tmp;
> +
> +	tmp = urcu_active_readers;
> +	if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
> +		urcu_active_readers = urcu_gp_ctr + 1;
> +	else
> +		urcu_active_readers = tmp + 1;
>  	/*
>  	 * Increment active readers count before accessing the pointer.
>  	 * See force_mb_all_threads().
> @@ -90,14 +106,14 @@ static inline void rcu_read_lock(int *urcu_parity)
>  	barrier();
>  }
>  
> -static inline void rcu_read_unlock(int *urcu_parity)
> +static inline void rcu_read_unlock(void)
>  {
>  	barrier();
>  	/*
>  	 * Finish using rcu before decrementing the pointer.
>  	 * See force_mb_all_threads().
>  	 */
> -	urcu_active_readers[*urcu_parity]--;
> +	urcu_active_readers--;
>  }
>  
>  extern void *urcu_publish_content(void **ptr, void *new);
> 
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  0:54                   ` Mathieu Desnoyers
  2009-02-09  1:08                     ` [ltt-dev] " Mathieu Desnoyers
@ 2009-02-09  3:42                     ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09  3:42 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sun, Feb 08, 2009 at 07:54:50PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Feb 08, 2009 at 02:36:06PM -0800, Paul E. McKenney wrote:
> > > On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > > I ran your modified version within my benchmarks :
> > > > 
> > > > with return value : 14.164 cycles per read
> > > > without return value : 16.4017 cycles per read
> > > > 
> > > > So we have a 14% performance decrease due to this. We also pollute the
> > > > branch prediction buffer and we add a cache access due to the added
> > > > variables in the TLS. Returning the value has the clear advantage of
> > > > letting the compiler keep it around in registers or on the stack, which
> > > > clearly costs less.
> > > > 
> > > > So I think the speed factor outweights the visual considerations. Maybe
> > > > we could switch to something like :
> > > > 
> > > > unsigned int qparity;
> > > > 
> > > > urcu_read_lock(&qparity);
> > > > ...
> > > > urcu_read_unlock(&qparity);
> > > > 
> > > > That would be a bit like local_irq_save() in the kernel, except that we
> > > > could do it in a static inline because we pass the address. I
> > > > personnally dislike the local_irq_save() way of hiding the fact that it
> > > > writes to the variable in a "clever" macro. I'd really prefer to leave
> > > > the " & ".
> > > > 
> > > > What is your opinion ?
> > > 
> > > My current opinion is that I can avoid the overflow problem and the
> > > need to recheck, which might get rid of the need for both arguments
> > > and return values while still maintaining good performance.  The trick
> > > is to use only the topmost bit for the grace-period counter, and all
> > > the rest of the bits for nesting.  That way, no matter what value of
> > > global counter one picks up, it will be waited for (since there are but
> > > two values that the global counter takes on).
> > > 
> > > But just now coding it, so will see if it actually works.
> > 
> > Seems to work, and seems to be pretty fast on my machine, anyway.
> > This one adapts itself to 32- and 64-bit machines, though almost
> > all of the code is common.  It does do a check, but avoids array
> > indexing, arguments, and return values.
> > 
> > How does it do on your hardware?
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> Wow...
> 
> Patch updated against HEAD.
> 
> Time per read : 7.53622 cycles
> 
> Half of what we had previously.. I'll have to look at the assembly. :)

My guess is that CMOV is your friend here...  ;-)

							Thanx, Paul

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> ---
> 
>  test_urcu.c        |    6 +++---
>  test_urcu_timing.c |    6 +++---
>  urcu.c             |   23 ++++++++++-------------
>  urcu.h             |   42 +++++++++++++++++++++++++++++-------------
>  4 files changed, 45 insertions(+), 32 deletions(-)
> 
> diff --git a/test_urcu.c b/test_urcu.c
> index f6be45b..f115a4a 100644
> --- a/test_urcu.c
> +++ b/test_urcu.c
> @@ -72,7 +72,7 @@ void rcu_copy_mutex_unlock(void)
> 
>  void *thr_reader(void *arg)
>  {
> -	int qparity, i, j;
> +	int i, j;
>  	struct test_array *local_ptr;
> 
>  	printf("thread %s, thread id : %lx, tid %lu\n",
> @@ -83,14 +83,14 @@ void *thr_reader(void *arg)
> 
>  	for (i = 0; i < 100000; i++) {
>  		for (j = 0; j < 100000000; j++) {
> -			rcu_read_lock(&qparity);
> +			rcu_read_lock();
>  			local_ptr = rcu_dereference(test_rcu_pointer);
>  			if (local_ptr) {
>  				assert(local_ptr->a == 8);
>  				assert(local_ptr->b == 12);
>  				assert(local_ptr->c[55] == 2);
>  			}
> -			rcu_read_unlock(&qparity);
> +			rcu_read_unlock();
>  		}
>  	}
> 
> diff --git a/test_urcu_timing.c b/test_urcu_timing.c
> index 57fda4f..9903705 100644
> --- a/test_urcu_timing.c
> +++ b/test_urcu_timing.c
> @@ -94,7 +94,7 @@ static cycles_t reader_time[NR_READ] __attribute__((aligned(128)));
> 
>  void *thr_reader(void *arg)
>  {
> -	int qparity, i, j;
> +	int i, j;
>  	struct test_array *local_ptr;
>  	cycles_t time1, time2;
> 
> @@ -107,12 +107,12 @@ void *thr_reader(void *arg)
>  	time1 = get_cycles();
>  	for (i = 0; i < OUTER_READ_LOOP; i++) {
>  		for (j = 0; j < INNER_READ_LOOP; j++) {
> -			rcu_read_lock(&qparity);
> +			rcu_read_lock();
>  			local_ptr = rcu_dereference(test_rcu_pointer);
>  			if (local_ptr) {
>  				assert(local_ptr->a == 8);
>  			}
> -			rcu_read_unlock(&qparity);
> +			rcu_read_unlock();
>  		}
>  	}
>  	time2 = get_cycles();
> diff --git a/urcu.c b/urcu.c
> index 08fb75d..2914b66 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -19,17 +19,17 @@
> 
>  pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
> 
> -/* Global quiescent period parity */
> -int urcu_qparity;
> +/* Global grace period counter */
> +long urcu_gp_ctr;
> 
> -int __thread urcu_active_readers[2];
> +long __thread urcu_active_readers;
> 
>  /* Thread IDs of registered readers */
>  #define INIT_NUM_THREADS 4
> 
>  struct reader_data {
>  	pthread_t tid;
> -	int *urcu_active_readers;
> +	long *urcu_active_readers;
>  };
> 
>  static struct reader_data *reader_data;
> @@ -60,11 +60,9 @@ void internal_urcu_unlock(void)
>  /*
>   * called with urcu_mutex held.
>   */
> -static int switch_next_urcu_qparity(void)
> +static void switch_next_urcu_qparity(void)
>  {
> -	int old_parity = urcu_qparity;
> -	urcu_qparity = 1 - old_parity;
> -	return old_parity;
> +	urcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
>  }
> 
>  static void force_mb_all_threads(void)
> @@ -89,7 +87,7 @@ static void force_mb_all_threads(void)
>  	mb();	/* read sig_done before ending the barrier */
>  }
> 
> -void wait_for_quiescent_state(int parity)
> +void wait_for_quiescent_state(void)
>  {
>  	struct reader_data *index;
> 
> @@ -101,7 +99,7 @@ void wait_for_quiescent_state(int parity)
>  		/*
>  		 * BUSY-LOOP.
>  		 */
> -		while (index->urcu_active_readers[parity] != 0)
> +		while (rcu_old_gp_ongoing(index->urcu_active_readers))
>  			barrier();
>  	}
>  	/*
> @@ -115,17 +113,16 @@ void wait_for_quiescent_state(int parity)
> 
>  static void switch_qparity(void)
>  {
> -	int prev_parity;
> 
>  	/* All threads should read qparity before accessing data structure. */
>  	/* Write ptr before changing the qparity */
>  	force_mb_all_threads();
> -	prev_parity = switch_next_urcu_qparity();
> +	switch_next_urcu_qparity();
> 
>  	/*
>  	 * Wait for previous parity to be empty of readers.
>  	 */
> -	wait_for_quiescent_state(prev_parity);
> +	wait_for_quiescent_state();
>  }
> 
>  void synchronize_rcu(void)
> diff --git a/urcu.h b/urcu.h
> index b6b5c7b..e83c69f 100644
> --- a/urcu.h
> +++ b/urcu.h
> @@ -66,23 +66,39 @@ static inline void atomic_inc(int *v)
> 
>  #define SIGURCU SIGUSR1
> 
> -/* Global quiescent period parity */
> -extern int urcu_qparity;
> +#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x80000000 : 0x100L)
> +#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
> 
> -extern int __thread urcu_active_readers[2];
> +/* Global quiescent period counter with low-order bits unused. */
> +extern long urcu_gp_ctr;
> 
> -static inline int get_urcu_qparity(void)
> +extern long __thread urcu_active_readers;
> +
> +static inline int rcu_old_gp_ongoing(long *value)
>  {
> -	return urcu_qparity;
> +	long v;
> +
> +	if (value == NULL)
> +		return 0;
> +	v = ACCESS_ONCE(*value);
> +	if (sizeof(long) == 4) {
> +		return (v & RCU_GP_CTR_NEST_MASK) &&
> +		       ((v ^ ACCESS_ONCE(urcu_gp_ctr)) & ~RCU_GP_CTR_NEST_MASK);
> +	} else {
> +		return (v & RCU_GP_CTR_NEST_MASK) &&
> +		       (v - ACCESS_ONCE(urcu_gp_ctr) < 0);
> +	}
>  }
> 
> -/*
> - * urcu_parity should be declared on the caller's stack.
> - */
> -static inline void rcu_read_lock(int *urcu_parity)
> +static inline void rcu_read_lock(void)
>  {
> -	*urcu_parity = get_urcu_qparity();
> -	urcu_active_readers[*urcu_parity]++;
> +	long tmp;
> +
> +	tmp = urcu_active_readers;
> +	if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
> +		urcu_active_readers = urcu_gp_ctr + 1;
> +	else
> +		urcu_active_readers = tmp + 1;
>  	/*
>  	 * Increment active readers count before accessing the pointer.
>  	 * See force_mb_all_threads().
> @@ -90,14 +106,14 @@ static inline void rcu_read_lock(int *urcu_parity)
>  	barrier();
>  }
> 
> -static inline void rcu_read_unlock(int *urcu_parity)
> +static inline void rcu_read_unlock(void)
>  {
>  	barrier();
>  	/*
>  	 * Finish using rcu before decrementing the pointer.
>  	 * See force_mb_all_threads().
>  	 */
> -	urcu_active_readers[*urcu_parity]--;
> +	urcu_active_readers--;
>  }
> 
>  extern void *urcu_publish_content(void **ptr, void *new);
> 
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  1:08                     ` [ltt-dev] " Mathieu Desnoyers
@ 2009-02-09  3:47                       ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09  3:47 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Sun, Feb 08, 2009 at 08:08:25PM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sun, Feb 08, 2009 at 02:36:06PM -0800, Paul E. McKenney wrote:
> > > > On Sun, Feb 08, 2009 at 04:46:10PM -0500, Mathieu Desnoyers wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > I ran your modified version within my benchmarks :
> > > > > 
> > > > > with return value : 14.164 cycles per read
> > > > > without return value : 16.4017 cycles per read
> > > > > 
> > > > > So we have a 14% performance decrease due to this. We also pollute the
> > > > > branch prediction buffer and we add a cache access due to the added
> > > > > variables in the TLS. Returning the value has the clear advantage of
> > > > > letting the compiler keep it around in registers or on the stack, which
> > > > > clearly costs less.
> > > > > 
> > > > > So I think the speed factor outweights the visual considerations. Maybe
> > > > > we could switch to something like :
> > > > > 
> > > > > unsigned int qparity;
> > > > > 
> > > > > urcu_read_lock(&qparity);
> > > > > ...
> > > > > urcu_read_unlock(&qparity);
> > > > > 
> > > > > That would be a bit like local_irq_save() in the kernel, except that we
> > > > > could do it in a static inline because we pass the address. I
> > > > > personnally dislike the local_irq_save() way of hiding the fact that it
> > > > > writes to the variable in a "clever" macro. I'd really prefer to leave
> > > > > the " & ".
> > > > > 
> > > > > What is your opinion ?
> > > > 
> > > > My current opinion is that I can avoid the overflow problem and the
> > > > need to recheck, which might get rid of the need for both arguments
> > > > and return values while still maintaining good performance.  The trick
> > > > is to use only the topmost bit for the grace-period counter, and all
> > > > the rest of the bits for nesting.  That way, no matter what value of
> > > > global counter one picks up, it will be waited for (since there are but
> > > > two values that the global counter takes on).
> > > > 
> > > > But just now coding it, so will see if it actually works.
> > > 
> > > Seems to work, and seems to be pretty fast on my machine, anyway.
> > > This one adapts itself to 32- and 64-bit machines, though almost
> > > all of the code is common.  It does do a check, but avoids array
> > > indexing, arguments, and return values.
> > > 
> > > How does it do on your hardware?
> > > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > 
> > Wow...
> > 
> > Patch updated against HEAD.
> > 
> > Time per read : 7.53622 cycles
> > 
> > Half of what we had previously.. I'll have to look at the assembly. :)
> > 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
> > ---
> > 
> >  test_urcu.c        |    6 +++---
> >  test_urcu_timing.c |    6 +++---
> >  urcu.c             |   23 ++++++++++-------------
> >  urcu.h             |   42 +++++++++++++++++++++++++++++-------------
> >  4 files changed, 45 insertions(+), 32 deletions(-)
> > 
> > diff --git a/test_urcu.c b/test_urcu.c
> > index f6be45b..f115a4a 100644
> > --- a/test_urcu.c
> > +++ b/test_urcu.c
> > @@ -72,7 +72,7 @@ void rcu_copy_mutex_unlock(void)
> >  
> >  void *thr_reader(void *arg)
> >  {
> > -	int qparity, i, j;
> > +	int i, j;
> >  	struct test_array *local_ptr;
> >  
> >  	printf("thread %s, thread id : %lx, tid %lu\n",
> > @@ -83,14 +83,14 @@ void *thr_reader(void *arg)
> >  
> >  	for (i = 0; i < 100000; i++) {
> >  		for (j = 0; j < 100000000; j++) {
> > -			rcu_read_lock(&qparity);
> > +			rcu_read_lock();
> >  			local_ptr = rcu_dereference(test_rcu_pointer);
> >  			if (local_ptr) {
> >  				assert(local_ptr->a == 8);
> >  				assert(local_ptr->b == 12);
> >  				assert(local_ptr->c[55] == 2);
> >  			}
> > -			rcu_read_unlock(&qparity);
> > +			rcu_read_unlock();
> >  		}
> >  	}
> >  
> > diff --git a/test_urcu_timing.c b/test_urcu_timing.c
> > index 57fda4f..9903705 100644
> > --- a/test_urcu_timing.c
> > +++ b/test_urcu_timing.c
> > @@ -94,7 +94,7 @@ static cycles_t reader_time[NR_READ] __attribute__((aligned(128)));
> >  
> >  void *thr_reader(void *arg)
> >  {
> > -	int qparity, i, j;
> > +	int i, j;
> >  	struct test_array *local_ptr;
> >  	cycles_t time1, time2;
> >  
> > @@ -107,12 +107,12 @@ void *thr_reader(void *arg)
> >  	time1 = get_cycles();
> >  	for (i = 0; i < OUTER_READ_LOOP; i++) {
> >  		for (j = 0; j < INNER_READ_LOOP; j++) {
> > -			rcu_read_lock(&qparity);
> > +			rcu_read_lock();
> >  			local_ptr = rcu_dereference(test_rcu_pointer);
> >  			if (local_ptr) {
> >  				assert(local_ptr->a == 8);
> >  			}
> > -			rcu_read_unlock(&qparity);
> > +			rcu_read_unlock();
> >  		}
> >  	}
> >  	time2 = get_cycles();
> > diff --git a/urcu.c b/urcu.c
> > index 08fb75d..2914b66 100644
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -19,17 +19,17 @@
> >  
> >  pthread_mutex_t urcu_mutex = PTHREAD_MUTEX_INITIALIZER;
> >  
> > -/* Global quiescent period parity */
> > -int urcu_qparity;
> > +/* Global grace period counter */
> > +long urcu_gp_ctr;
> >  
> > -int __thread urcu_active_readers[2];
> > +long __thread urcu_active_readers;
> >  
> >  /* Thread IDs of registered readers */
> >  #define INIT_NUM_THREADS 4
> >  
> >  struct reader_data {
> >  	pthread_t tid;
> > -	int *urcu_active_readers;
> > +	long *urcu_active_readers;
> >  };
> >  
> >  static struct reader_data *reader_data;
> > @@ -60,11 +60,9 @@ void internal_urcu_unlock(void)
> >  /*
> >   * called with urcu_mutex held.
> >   */
> > -static int switch_next_urcu_qparity(void)
> > +static void switch_next_urcu_qparity(void)
> >  {
> > -	int old_parity = urcu_qparity;
> > -	urcu_qparity = 1 - old_parity;
> > -	return old_parity;
> > +	urcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> >  }
> >  
> >  static void force_mb_all_threads(void)
> > @@ -89,7 +87,7 @@ static void force_mb_all_threads(void)
> >  	mb();	/* read sig_done before ending the barrier */
> >  }
> >  
> > -void wait_for_quiescent_state(int parity)
> > +void wait_for_quiescent_state(void)
> >  {
> >  	struct reader_data *index;
> >  
> > @@ -101,7 +99,7 @@ void wait_for_quiescent_state(int parity)
> >  		/*
> >  		 * BUSY-LOOP.
> >  		 */
> > -		while (index->urcu_active_readers[parity] != 0)
> > +		while (rcu_old_gp_ongoing(index->urcu_active_readers))
> >  			barrier();
> >  	}
> >  	/*
> > @@ -115,17 +113,16 @@ void wait_for_quiescent_state(int parity)
> >  
> >  static void switch_qparity(void)
> >  {
> > -	int prev_parity;
> >  
> >  	/* All threads should read qparity before accessing data structure. */
> >  	/* Write ptr before changing the qparity */
> >  	force_mb_all_threads();
> > -	prev_parity = switch_next_urcu_qparity();
> > +	switch_next_urcu_qparity();
> >  
> >  	/*
> >  	 * Wait for previous parity to be empty of readers.
> >  	 */
> > -	wait_for_quiescent_state(prev_parity);
> > +	wait_for_quiescent_state();
> >  }
> >  
> >  void synchronize_rcu(void)
> > diff --git a/urcu.h b/urcu.h
> > index b6b5c7b..e83c69f 100644
> > --- a/urcu.h
> > +++ b/urcu.h
> > @@ -66,23 +66,39 @@ static inline void atomic_inc(int *v)
> >  
> >  #define SIGURCU SIGUSR1
> >  
> > -/* Global quiescent period parity */
> > -extern int urcu_qparity;
> > +#define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x80000000 : 0x100L)
> 
> Shouldn't it be the opposite ?
> 
> e.g.
> 
> #define RCU_GP_CTR_BOTTOM_BIT (sizeof(long) == 4 ? 0x100L : 0x80000000L)

Absolutely not!!!  For 32-bit systems, the GP count is only the upper
bit.  That is exactly what allows the overflow check to be omitted.
For 64-bit systems, I rely on the upper 56 bits taking a couple of
millenia to overflow.

For 64-bit systems, one could also use only the upper bit
(0x8000000000000000), and that might actually make for better code.

> > +#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BOTTOM_BIT - 1)
> >  
> > -extern int __thread urcu_active_readers[2];
> > +/* Global quiescent period counter with low-order bits unused. */
> > +extern long urcu_gp_ctr;
> >  
> > -static inline int get_urcu_qparity(void)
> > +extern long __thread urcu_active_readers;
> > +
> > +static inline int rcu_old_gp_ongoing(long *value)
> >  {
> > -	return urcu_qparity;
> > +	long v;
> > +
> > +	if (value == NULL)
> > +		return 0;
> > +	v = ACCESS_ONCE(*value);
> > +	if (sizeof(long) == 4) {
> > +		return (v & RCU_GP_CTR_NEST_MASK) &&
> > +		       ((v ^ ACCESS_ONCE(urcu_gp_ctr)) & ~RCU_GP_CTR_NEST_MASK);
> 
> There must be something about the ^ I am missing ? Compared to it, the
> 64-bits test is a - , with < 0...

Yep.  For 32 bits, if the top bit is the same as that of the current value
of the counter, we must wait.  I could have written:

	       (v & ~RCU_GP_CTR_NEST_MASK) !=
	       (ACCESS_ONCE(urcu_gp_ctr) & ~RCU_GP_CTR_NEST_MASK)

but doing so would require two "&" operations.  Though perhaps the
compiler would have figured it out...

							Thanx, Paul

> Mathieu
> 
> > +	} else {
> > +		return (v & RCU_GP_CTR_NEST_MASK) &&
> > +		       (v - ACCESS_ONCE(urcu_gp_ctr) < 0);
> > +	}
> >  }
> >  
> > -/*
> > - * urcu_parity should be declared on the caller's stack.
> > - */
> > -static inline void rcu_read_lock(int *urcu_parity)
> > +static inline void rcu_read_lock(void)
> >  {
> > -	*urcu_parity = get_urcu_qparity();
> > -	urcu_active_readers[*urcu_parity]++;
> > +	long tmp;
> > +
> > +	tmp = urcu_active_readers;
> > +	if ((tmp & RCU_GP_CTR_NEST_MASK) == 0)
> > +		urcu_active_readers = urcu_gp_ctr + 1;
> > +	else
> > +		urcu_active_readers = tmp + 1;
> >  	/*
> >  	 * Increment active readers count before accessing the pointer.
> >  	 * See force_mb_all_threads().
> > @@ -90,14 +106,14 @@ static inline void rcu_read_lock(int *urcu_parity)
> >  	barrier();
> >  }
> >  
> > -static inline void rcu_read_unlock(int *urcu_parity)
> > +static inline void rcu_read_unlock(void)
> >  {
> >  	barrier();
> >  	/*
> >  	 * Finish using rcu before decrementing the pointer.
> >  	 * See force_mb_all_threads().
> >  	 */
> > -	urcu_active_readers[*urcu_parity]--;
> > +	urcu_active_readers--;
> >  }
> >  
> >  extern void *urcu_publish_content(void **ptr, void *new);
> > 
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-08 22:44       ` Mathieu Desnoyers
@ 2009-02-09  4:11         ` Paul E. McKenney
  2009-02-09  4:53           ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09  4:11 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > 
> > > > Hi Paul,
> > > > 
> > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > of LTTng (for quick read access to the control variables) to trace
> > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > RCU implementation.
> > > > 
> > > > It works so far, but I have not gone through any formal verification
> > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > want to comment on it, it would be welcome. It's a userland-only
> > > > library. It's also currently x86-only, but only a few basic definitions
> > > > must be adapted in urcu.h to port it.
> > > > 
> > > > Here is the link to my git tree :
> > > > 
> > > > git://lttng.org/userspace-rcu.git
> > > > 
> > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > 
> > > Very cool!!!  I will take a look!
> > > 
> > > I will also point you at a few that I have put together:
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > (In the CodeSamples/defer directory.)
> > 
> > Interesting approach, using the signal to force memory-barrier execution!
> > 
> > o	One possible optimization would be to avoid sending a signal to
> > 	a blocked thread, as the context switch leading to blocking
> > 	will have implied a memory barrier -- otherwise it would not
> > 	be safe to resume the thread on some other CPU.  That said,
> > 	not sure whether checking to see whether a thread is blocked is
> > 	any faster than sending it a signal and forcing it to wake up.
> 
> I'm not sure it will be any faster, and it could be racy too. How would
> you envision querying the execution state of another thread ?

For my 64-bit implementation (or the old slow 32-bit version), the trick
would be to observe that the thread didn't do an RCU read-side critical
section during the past grace period.  This observation would be by
comparing counters.

For the new 32-bit implementation, the only way I know of is to grovel
through /proc, which would probably be slower than just sending the
signal.

> > 	Of course, this approach does require that the enclosing
> > 	application be willing to give up a signal.  I suspect that most
> > 	applications would be OK with this, though some might not.
> 
> If we want to make this transparent to the application, we'll have to
> investigate further in sigaction() and signal() library override I
> guess.

Certainly seems like it is worth a try!

> > 	Of course, I cannot resist pointing to an old LKML thread:
> > 
> > 		http://lkml.org/lkml/2001/10/8/189
> > 
> > 	But I think that the time is now right.  ;-)
> > 
> > o	I don't understand the purpose of rcu_write_lock() and
> > 	rcu_write_unlock().  I am concerned that it will lead people
> > 	to decide that a single global lock must protect RCU updates,
> > 	which is of course absolutely not the case.  I strongly
> > 	suggest making these internal to the urcu.c file.  Yes,
> > 	uses of urcu_publish_content() would then hit two locks (the
> > 	internal-to-urcu.c one and whatever they are using to protect
> > 	their data structure), but let's face it, if you are sending a
> > 	signal to each and every thread, the additional overhead of the
> > 	extra lock is the least of your worries.
> > 
> 
> Ok, just changed it.

Thank you!!!

> > 	If you really want to heavily optimize this, I would suggest
> > 	setting up a state machine that permits multiple concurrent
> > 	calls to urcu_publish_content() to share the same set of signal
> > 	invocations.  That way, if the caller has partitioned the
> > 	data structure, global locking might be avoided completely
> > 	(or at least greatly restricted in scope).
> > 
> 
> That brings an interesting question about urcu_publish_content :
> 
> void *urcu_publish_content(void **ptr, void *new)
> {
>         void *oldptr;
> 
>         internal_urcu_lock();
>         oldptr = *ptr;
>         *ptr = new;
> 
>         switch_qparity();
>         switch_qparity();
>         internal_urcu_unlock();
> 
>         return oldptr;
> }
> 
> Given that we take a global lock around the pointer assignment, we can
> safely assume, from the caller's perspective, that the update will
> happen as an "xchg" operation. So if the caller does not have to copy
> the old data, it can simply publish the new data without taking any
> lock itself.
> 
> So the question that arises if we want to remove global locking is :
> should we change this 
> 
>         oldptr = *ptr;
>         *ptr = new;
> 
> for an atomic xchg ?

Makes sense to me!

> > 	Of course, if updates are rare, the optimization would not
> > 	help, but in that case, acquiring two locks would be even less
> > 	of a problem.
> 
> I plan updates to be quite rare, but it's always good to foresee how
> that kind of infrastructure could be misused. :-)

;-)  ;-)  ;-)

> > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > 	used to index urcu_active_readers[], you must be relying on
> > 	initialization to zero.
> 
> Yes, starts at 0.

Whew!  ;-)

> > o	In rcu_read_lock(), why is a non-atomic increment of the
> > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > 	relying on the compiler generating an x86 add-to-memory
> > 	instruction?
> > 
> > 	Ditto for rcu_read_unlock().
> > 
> > 	Ah, never mind!!!  I now see the __thread specification,
> > 	and the keeping of references to it in the reader_data list.
> 
> Exactly :)

Getting old and blind, what can I say?

> > o	Combining the equivalent of rcu_assign_pointer() and
> > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > 	approach.  Not yet sure whether or not it is a good idea.  I
> > 	guess trying it out on several applications would be the way
> > 	to find out.  ;-)
> > 
> > 	That said, I suspect that it would be very convenient in a
> > 	number of situations.
> 
> I thought so. It seemed to be a natural way to express it to me. Usage
> will tell.

;-)

> > o	It would be good to avoid having to pass the return value
> > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > 	possible to avoid this via counter value tricks, though this
> > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > 	(64-bit machines don't have to worry about counter overflow.)
> > 
> > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > 	in the aforementioned git archive for a way to do this.
> > 	(And perhaps I should apply this change to SRCU...)
> 
> See my other mail about this.

And likewise!

> > o	Your test looks a bit strange, not sure why you test all the
> > 	different variables.  It would be nice to take a test duration
> > 	as an argument and run the test for that time.
> 
> I made a smaller version which only reads a single variable. I agree
> that the initial test was a bit strange on that aspect.
> 
> I'll do a version which takes a duration as parameter.

I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
file in my git archive:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

This torture test detects the missing second flip 15 times during a
10-second test on a two-processor machine.

The first part of the rcutorture.h file is performance tests -- search
for the string "Stress test" to find the torture test.

> > 	I killed the test after better part of an hour on my laptop,
> > 	will retry on a larger machine (after noting the 18 threads
> > 	created!).  (And yes, I first tried Power, which objected
> > 	strenously to the "mfence" and "lock; incl" instructions,
> > 	so getting an x86 machine to try on.)
> 
> That should be easy enough to fix. A bit of primitive cut'n'paste would
> do.

Yep.  Actually, I was considering porting your code into my environment,
which already has the Power primitives.  Any objections?  (This would
have the side effect of making a version available via perfbook.git.
I would of course add comments referencing your git archive as the
official version.)

> > Again, looks interesting!  Looks plausible, although I have not 100%
> > convinced myself that it is perfectly bug-free.  But I do maintain
> > a healthy skepticism of purported RCU algorithms, especially ones that
> > I have written.  ;-)
> > 
> 
> That's always good. I also tend to always be very skeptical about what I
> write and review.
> 
> Thanks for the thorough review.

No problem -- it has been quite fun!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  4:11         ` Paul E. McKenney
@ 2009-02-09  4:53           ` Mathieu Desnoyers
  2009-02-09  5:17             ` [ltt-dev] " Mathieu Desnoyers
  2009-02-09 13:16             ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  4:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > 
> > > > > Hi Paul,
> > > > > 
> > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > RCU implementation.
> > > > > 
> > > > > It works so far, but I have not gone through any formal verification
> > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > must be adapted in urcu.h to port it.
> > > > > 
> > > > > Here is the link to my git tree :
> > > > > 
> > > > > git://lttng.org/userspace-rcu.git
> > > > > 
> > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > 
> > > > Very cool!!!  I will take a look!
> > > > 
> > > > I will also point you at a few that I have put together:
> > > > 
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > 
> > > > (In the CodeSamples/defer directory.)
> > > 
> > > Interesting approach, using the signal to force memory-barrier execution!
> > > 
> > > o	One possible optimization would be to avoid sending a signal to
> > > 	a blocked thread, as the context switch leading to blocking
> > > 	will have implied a memory barrier -- otherwise it would not
> > > 	be safe to resume the thread on some other CPU.  That said,
> > > 	not sure whether checking to see whether a thread is blocked is
> > > 	any faster than sending it a signal and forcing it to wake up.
> > 
> > I'm not sure it will be any faster, and it could be racy too. How would
> > you envision querying the execution state of another thread ?
> 
> For my 64-bit implementation (or the old slow 32-bit version), the trick
> would be to observe that the thread didn't do an RCU read-side critical
> section during the past grace period.  This observation would be by
> comparing counters.
> 
> For the new 32-bit implementation, the only way I know of is to grovel
> through /proc, which would probably be slower than just sending the
> signal.
> 

Yes, I guess the signal is not so bad.

> > > 	Of course, this approach does require that the enclosing
> > > 	application be willing to give up a signal.  I suspect that most
> > > 	applications would be OK with this, though some might not.
> > 
> > If we want to make this transparent to the application, we'll have to
> > investigate further in sigaction() and signal() library override I
> > guess.
> 
> Certainly seems like it is worth a try!
> 
> > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > 
> > > 		http://lkml.org/lkml/2001/10/8/189
> > > 
> > > 	But I think that the time is now right.  ;-)
> > > 
> > > o	I don't understand the purpose of rcu_write_lock() and
> > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > 	to decide that a single global lock must protect RCU updates,
> > > 	which is of course absolutely not the case.  I strongly
> > > 	suggest making these internal to the urcu.c file.  Yes,
> > > 	uses of urcu_publish_content() would then hit two locks (the
> > > 	internal-to-urcu.c one and whatever they are using to protect
> > > 	their data structure), but let's face it, if you are sending a
> > > 	signal to each and every thread, the additional overhead of the
> > > 	extra lock is the least of your worries.
> > > 
> > 
> > Ok, just changed it.
> 
> Thank you!!!
> 
> > > 	If you really want to heavily optimize this, I would suggest
> > > 	setting up a state machine that permits multiple concurrent
> > > 	calls to urcu_publish_content() to share the same set of signal
> > > 	invocations.  That way, if the caller has partitioned the
> > > 	data structure, global locking might be avoided completely
> > > 	(or at least greatly restricted in scope).
> > > 
> > 
> > That brings an interesting question about urcu_publish_content :
> > 
> > void *urcu_publish_content(void **ptr, void *new)
> > {
> >         void *oldptr;
> > 
> >         internal_urcu_lock();
> >         oldptr = *ptr;
> >         *ptr = new;
> > 
> >         switch_qparity();
> >         switch_qparity();
> >         internal_urcu_unlock();
> > 
> >         return oldptr;
> > }
> > 
> > Given that we take a global lock around the pointer assignment, we can
> > safely assume, from the caller's perspective, that the update will
> > happen as an "xchg" operation. So if the caller does not have to copy
> > the old data, it can simply publish the new data without taking any
> > lock itself.
> > 
> > So the question that arises if we want to remove global locking is :
> > should we change this 
> > 
> >         oldptr = *ptr;
> >         *ptr = new;
> > 
> > for an atomic xchg ?
> 
> Makes sense to me!
> 
> > > 	Of course, if updates are rare, the optimization would not
> > > 	help, but in that case, acquiring two locks would be even less
> > > 	of a problem.
> > 
> > I plan updates to be quite rare, but it's always good to foresee how
> > that kind of infrastructure could be misused. :-)
> 
> ;-)  ;-)  ;-)
> 
> > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > 	used to index urcu_active_readers[], you must be relying on
> > > 	initialization to zero.
> > 
> > Yes, starts at 0.
> 
> Whew!  ;-)
> 
> > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > 	relying on the compiler generating an x86 add-to-memory
> > > 	instruction?
> > > 
> > > 	Ditto for rcu_read_unlock().
> > > 
> > > 	Ah, never mind!!!  I now see the __thread specification,
> > > 	and the keeping of references to it in the reader_data list.
> > 
> > Exactly :)
> 
> Getting old and blind, what can I say?
> 
> > > o	Combining the equivalent of rcu_assign_pointer() and
> > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > 	guess trying it out on several applications would be the way
> > > 	to find out.  ;-)
> > > 
> > > 	That said, I suspect that it would be very convenient in a
> > > 	number of situations.
> > 
> > I thought so. It seemed to be a natural way to express it to me. Usage
> > will tell.
> 
> ;-)
> 
> > > o	It would be good to avoid having to pass the return value
> > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > 	possible to avoid this via counter value tricks, though this
> > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > 	(64-bit machines don't have to worry about counter overflow.)
> > > 
> > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > 	in the aforementioned git archive for a way to do this.
> > > 	(And perhaps I should apply this change to SRCU...)
> > 
> > See my other mail about this.
> 
> And likewise!
> 
> > > o	Your test looks a bit strange, not sure why you test all the
> > > 	different variables.  It would be nice to take a test duration
> > > 	as an argument and run the test for that time.
> > 
> > I made a smaller version which only reads a single variable. I agree
> > that the initial test was a bit strange on that aspect.
> > 
> > I'll do a version which takes a duration as parameter.
> 
> I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> file in my git archive:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 
> This torture test detects the missing second flip 15 times during a
> 10-second test on a two-processor machine.
> 
> The first part of the rcutorture.h file is performance tests -- search
> for the string "Stress test" to find the torture test.
> 

I will.

> > > 	I killed the test after better part of an hour on my laptop,
> > > 	will retry on a larger machine (after noting the 18 threads
> > > 	created!).  (And yes, I first tried Power, which objected
> > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > 	so getting an x86 machine to try on.)
> > 
> > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > do.
> 
> Yep.  Actually, I was considering porting your code into my environment,
> which already has the Power primitives.  Any objections?  (This would
> have the side effect of making a version available via perfbook.git.
> I would of course add comments referencing your git archive as the
> official version.)
> 

Yes, no objection. I am currently looking at your last patch, cleaning
it up and making the 32 and 64-bit code the same. Also trying to save a
few instructions. I'll keep you posted when it's ready and committed.

Mathieu

> > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > I have written.  ;-)
> > > 
> > 
> > That's always good. I also tend to always be very skeptical about what I
> > write and review.
> > 
> > Thanks for the thorough review.
> 
> No problem -- it has been quite fun!  ;-)
> 
> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  4:53           ` Mathieu Desnoyers
@ 2009-02-09  5:17             ` Mathieu Desnoyers
  2009-02-09  7:03               ` Mathieu Desnoyers
  2009-02-09 13:23               ` Paul E. McKenney
  2009-02-09 13:16             ` Paul E. McKenney
  1 sibling, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  5:17 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > 
> > > > > > Hi Paul,
> > > > > > 
> > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > RCU implementation.
> > > > > > 
> > > > > > It works so far, but I have not gone through any formal verification
> > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > must be adapted in urcu.h to port it.
> > > > > > 
> > > > > > Here is the link to my git tree :
> > > > > > 
> > > > > > git://lttng.org/userspace-rcu.git
> > > > > > 
> > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > 
> > > > > Very cool!!!  I will take a look!
> > > > > 
> > > > > I will also point you at a few that I have put together:
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > 
> > > > > (In the CodeSamples/defer directory.)
> > > > 
> > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > 
> > > > o	One possible optimization would be to avoid sending a signal to
> > > > 	a blocked thread, as the context switch leading to blocking
> > > > 	will have implied a memory barrier -- otherwise it would not
> > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > 	not sure whether checking to see whether a thread is blocked is
> > > > 	any faster than sending it a signal and forcing it to wake up.
> > > 
> > > I'm not sure it will be any faster, and it could be racy too. How would
> > > you envision querying the execution state of another thread ?
> > 
> > For my 64-bit implementation (or the old slow 32-bit version), the trick
> > would be to observe that the thread didn't do an RCU read-side critical
> > section during the past grace period.  This observation would be by
> > comparing counters.
> > 
> > For the new 32-bit implementation, the only way I know of is to grovel
> > through /proc, which would probably be slower than just sending the
> > signal.
> > 
> 
> Yes, I guess the signal is not so bad.
> 
> > > > 	Of course, this approach does require that the enclosing
> > > > 	application be willing to give up a signal.  I suspect that most
> > > > 	applications would be OK with this, though some might not.
> > > 
> > > If we want to make this transparent to the application, we'll have to
> > > investigate further in sigaction() and signal() library override I
> > > guess.
> > 
> > Certainly seems like it is worth a try!
> > 
> > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > 
> > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > 
> > > > 	But I think that the time is now right.  ;-)
> > > > 
> > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > 	to decide that a single global lock must protect RCU updates,
> > > > 	which is of course absolutely not the case.  I strongly
> > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > 	their data structure), but let's face it, if you are sending a
> > > > 	signal to each and every thread, the additional overhead of the
> > > > 	extra lock is the least of your worries.
> > > > 
> > > 
> > > Ok, just changed it.
> > 
> > Thank you!!!
> > 
> > > > 	If you really want to heavily optimize this, I would suggest
> > > > 	setting up a state machine that permits multiple concurrent
> > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > 	invocations.  That way, if the caller has partitioned the
> > > > 	data structure, global locking might be avoided completely
> > > > 	(or at least greatly restricted in scope).
> > > > 
> > > 
> > > That brings an interesting question about urcu_publish_content :
> > > 
> > > void *urcu_publish_content(void **ptr, void *new)
> > > {
> > >         void *oldptr;
> > > 
> > >         internal_urcu_lock();
> > >         oldptr = *ptr;
> > >         *ptr = new;
> > > 
> > >         switch_qparity();
> > >         switch_qparity();
> > >         internal_urcu_unlock();
> > > 
> > >         return oldptr;
> > > }
> > > 
> > > Given that we take a global lock around the pointer assignment, we can
> > > safely assume, from the caller's perspective, that the update will
> > > happen as an "xchg" operation. So if the caller does not have to copy
> > > the old data, it can simply publish the new data without taking any
> > > lock itself.
> > > 
> > > So the question that arises if we want to remove global locking is :
> > > should we change this 
> > > 
> > >         oldptr = *ptr;
> > >         *ptr = new;
> > > 
> > > for an atomic xchg ?
> > 
> > Makes sense to me!
> > 
> > > > 	Of course, if updates are rare, the optimization would not
> > > > 	help, but in that case, acquiring two locks would be even less
> > > > 	of a problem.
> > > 
> > > I plan updates to be quite rare, but it's always good to foresee how
> > > that kind of infrastructure could be misused. :-)
> > 
> > ;-)  ;-)  ;-)
> > 
> > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > 	used to index urcu_active_readers[], you must be relying on
> > > > 	initialization to zero.
> > > 
> > > Yes, starts at 0.
> > 
> > Whew!  ;-)
> > 
> > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > 	relying on the compiler generating an x86 add-to-memory
> > > > 	instruction?
> > > > 
> > > > 	Ditto for rcu_read_unlock().
> > > > 
> > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > 	and the keeping of references to it in the reader_data list.
> > > 
> > > Exactly :)
> > 
> > Getting old and blind, what can I say?
> > 
> > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > 	guess trying it out on several applications would be the way
> > > > 	to find out.  ;-)
> > > > 
> > > > 	That said, I suspect that it would be very convenient in a
> > > > 	number of situations.
> > > 
> > > I thought so. It seemed to be a natural way to express it to me. Usage
> > > will tell.
> > 
> > ;-)
> > 
> > > > o	It would be good to avoid having to pass the return value
> > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > 	possible to avoid this via counter value tricks, though this
> > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > 
> > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > 	in the aforementioned git archive for a way to do this.
> > > > 	(And perhaps I should apply this change to SRCU...)
> > > 
> > > See my other mail about this.
> > 
> > And likewise!
> > 
> > > > o	Your test looks a bit strange, not sure why you test all the
> > > > 	different variables.  It would be nice to take a test duration
> > > > 	as an argument and run the test for that time.
> > > 
> > > I made a smaller version which only reads a single variable. I agree
> > > that the initial test was a bit strange on that aspect.
> > > 
> > > I'll do a version which takes a duration as parameter.
> > 
> > I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> > file in my git archive:
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > This torture test detects the missing second flip 15 times during a
> > 10-second test on a two-processor machine.
> > 
> > The first part of the rcutorture.h file is performance tests -- search
> > for the string "Stress test" to find the torture test.
> > 
> 
> I will.
> 
> > > > 	I killed the test after better part of an hour on my laptop,
> > > > 	will retry on a larger machine (after noting the 18 threads
> > > > 	created!).  (And yes, I first tried Power, which objected
> > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > 	so getting an x86 machine to try on.)
> > > 
> > > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > > do.
> > 
> > Yep.  Actually, I was considering porting your code into my environment,
> > which already has the Power primitives.  Any objections?  (This would
> > have the side effect of making a version available via perfbook.git.
> > I would of course add comments referencing your git archive as the
> > official version.)
> > 
> 
> Yes, no objection. I am currently looking at your last patch, cleaning
> it up and making the 32 and 64-bit code the same. Also trying to save a
> few instructions. I'll keep you posted when it's ready and committed.
> 

The new version is pushed into the repository. I changed you patch a
bit. Flaming is welcome. :)

Mathieu

> Mathieu
> 
> > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > I have written.  ;-)
> > > > 
> > > 
> > > That's always good. I also tend to always be very skeptical about what I
> > > write and review.
> > > 
> > > Thanks for the thorough review.
> > 
> > No problem -- it has been quite fun!  ;-)
> > 
> > 							Thanx, Paul
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  5:17             ` [ltt-dev] " Mathieu Desnoyers
@ 2009-02-09  7:03               ` Mathieu Desnoyers
  2009-02-09 15:33                 ` Paul E. McKenney
  2009-02-09 13:23               ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09  7:03 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > > 
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > > RCU implementation.
> > > > > > > 
> > > > > > > It works so far, but I have not gone through any formal verification
> > > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > > must be adapted in urcu.h to port it.
> > > > > > > 
> > > > > > > Here is the link to my git tree :
> > > > > > > 
> > > > > > > git://lttng.org/userspace-rcu.git
> > > > > > > 
> > > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > > 
> > > > > > Very cool!!!  I will take a look!
> > > > > > 
> > > > > > I will also point you at a few that I have put together:
> > > > > > 
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > 
> > > > > > (In the CodeSamples/defer directory.)
> > > > > 
> > > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > > 
> > > > > o	One possible optimization would be to avoid sending a signal to
> > > > > 	a blocked thread, as the context switch leading to blocking
> > > > > 	will have implied a memory barrier -- otherwise it would not
> > > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > > 	not sure whether checking to see whether a thread is blocked is
> > > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > 
> > > > I'm not sure it will be any faster, and it could be racy too. How would
> > > > you envision querying the execution state of another thread ?
> > > 
> > > For my 64-bit implementation (or the old slow 32-bit version), the trick
> > > would be to observe that the thread didn't do an RCU read-side critical
> > > section during the past grace period.  This observation would be by
> > > comparing counters.
> > > 
> > > For the new 32-bit implementation, the only way I know of is to grovel
> > > through /proc, which would probably be slower than just sending the
> > > signal.
> > > 
> > 
> > Yes, I guess the signal is not so bad.
> > 
> > > > > 	Of course, this approach does require that the enclosing
> > > > > 	application be willing to give up a signal.  I suspect that most
> > > > > 	applications would be OK with this, though some might not.
> > > > 
> > > > If we want to make this transparent to the application, we'll have to
> > > > investigate further in sigaction() and signal() library override I
> > > > guess.
> > > 
> > > Certainly seems like it is worth a try!
> > > 
> > > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > > 
> > > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > > 
> > > > > 	But I think that the time is now right.  ;-)
> > > > > 
> > > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > > 	to decide that a single global lock must protect RCU updates,
> > > > > 	which is of course absolutely not the case.  I strongly
> > > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > > 	their data structure), but let's face it, if you are sending a
> > > > > 	signal to each and every thread, the additional overhead of the
> > > > > 	extra lock is the least of your worries.
> > > > > 
> > > > 
> > > > Ok, just changed it.
> > > 
> > > Thank you!!!
> > > 
> > > > > 	If you really want to heavily optimize this, I would suggest
> > > > > 	setting up a state machine that permits multiple concurrent
> > > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > > 	invocations.  That way, if the caller has partitioned the
> > > > > 	data structure, global locking might be avoided completely
> > > > > 	(or at least greatly restricted in scope).
> > > > > 
> > > > 
> > > > That brings an interesting question about urcu_publish_content :
> > > > 
> > > > void *urcu_publish_content(void **ptr, void *new)
> > > > {
> > > >         void *oldptr;
> > > > 
> > > >         internal_urcu_lock();
> > > >         oldptr = *ptr;
> > > >         *ptr = new;
> > > > 
> > > >         switch_qparity();
> > > >         switch_qparity();
> > > >         internal_urcu_unlock();
> > > > 
> > > >         return oldptr;
> > > > }
> > > > 
> > > > Given that we take a global lock around the pointer assignment, we can
> > > > safely assume, from the caller's perspective, that the update will
> > > > happen as an "xchg" operation. So if the caller does not have to copy
> > > > the old data, it can simply publish the new data without taking any
> > > > lock itself.
> > > > 
> > > > So the question that arises if we want to remove global locking is :
> > > > should we change this 
> > > > 
> > > >         oldptr = *ptr;
> > > >         *ptr = new;
> > > > 
> > > > for an atomic xchg ?
> > > 
> > > Makes sense to me!
> > > 
> > > > > 	Of course, if updates are rare, the optimization would not
> > > > > 	help, but in that case, acquiring two locks would be even less
> > > > > 	of a problem.
> > > > 
> > > > I plan updates to be quite rare, but it's always good to foresee how
> > > > that kind of infrastructure could be misused. :-)
> > > 
> > > ;-)  ;-)  ;-)
> > > 
> > > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > > 	used to index urcu_active_readers[], you must be relying on
> > > > > 	initialization to zero.
> > > > 
> > > > Yes, starts at 0.
> > > 
> > > Whew!  ;-)
> > > 
> > > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > > 	relying on the compiler generating an x86 add-to-memory
> > > > > 	instruction?
> > > > > 
> > > > > 	Ditto for rcu_read_unlock().
> > > > > 
> > > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > > 	and the keeping of references to it in the reader_data list.
> > > > 
> > > > Exactly :)
> > > 
> > > Getting old and blind, what can I say?
> > > 
> > > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > > 	guess trying it out on several applications would be the way
> > > > > 	to find out.  ;-)
> > > > > 
> > > > > 	That said, I suspect that it would be very convenient in a
> > > > > 	number of situations.
> > > > 
> > > > I thought so. It seemed to be a natural way to express it to me. Usage
> > > > will tell.
> > > 
> > > ;-)
> > > 
> > > > > o	It would be good to avoid having to pass the return value
> > > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > > 	possible to avoid this via counter value tricks, though this
> > > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > > 
> > > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > > 	in the aforementioned git archive for a way to do this.
> > > > > 	(And perhaps I should apply this change to SRCU...)
> > > > 
> > > > See my other mail about this.
> > > 
> > > And likewise!
> > > 
> > > > > o	Your test looks a bit strange, not sure why you test all the
> > > > > 	different variables.  It would be nice to take a test duration
> > > > > 	as an argument and run the test for that time.
> > > > 
> > > > I made a smaller version which only reads a single variable. I agree
> > > > that the initial test was a bit strange on that aspect.
> > > > 
> > > > I'll do a version which takes a duration as parameter.
> > > 
> > > I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> > > file in my git archive:
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > This torture test detects the missing second flip 15 times during a
> > > 10-second test on a two-processor machine.
> > > 
> > > The first part of the rcutorture.h file is performance tests -- search
> > > for the string "Stress test" to find the torture test.
> > > 
> > 
> > I will.
> > 
> > > > > 	I killed the test after better part of an hour on my laptop,
> > > > > 	will retry on a larger machine (after noting the 18 threads
> > > > > 	created!).  (And yes, I first tried Power, which objected
> > > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > > 	so getting an x86 machine to try on.)
> > > > 
> > > > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > > > do.
> > > 
> > > Yep.  Actually, I was considering porting your code into my environment,
> > > which already has the Power primitives.  Any objections?  (This would
> > > have the side effect of making a version available via perfbook.git.
> > > I would of course add comments referencing your git archive as the
> > > official version.)
> > > 
> > 
> > Yes, no objection. I am currently looking at your last patch, cleaning
> > it up and making the 32 and 64-bit code the same. Also trying to save a
> > few instructions. I'll keep you posted when it's ready and committed.
> > 
> 
> The new version is pushed into the repository. I changed you patch a
> bit. Flaming is welcome. :)
> 

I just added modified rcutorture.h and api.h from your git tree
specifically for an urcutorture program to the repository. Some results :

8-way x86_64
E5405 @2 GHZ

./urcutorture 8 perf
n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
ns/read: 4.12871  ns/update: 3.33333e+08

./urcutorture 8 uperf
n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
ns/read: nan  ns/update: 1812.46

n_reads: 98844204  n_updates: 10  n_mberror: 0
rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0

However, I've tried removing the second switch_qparity() call, and the
rcutorture test did not detect anything wrong. I also did a variation
which calls the "sched_yield" version of the urcu, "urcutorture-yield".

Mathieu

> Mathieu
> 
> > Mathieu
> > 
> > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > I have written.  ;-)
> > > > > 
> > > > 
> > > > That's always good. I also tend to always be very skeptical about what I
> > > > write and review.
> > > > 
> > > > Thanks for the thorough review.
> > > 
> > > No problem -- it has been quite fun!  ;-)
> > > 
> > > 							Thanx, Paul
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  4:53           ` Mathieu Desnoyers
  2009-02-09  5:17             ` [ltt-dev] " Mathieu Desnoyers
@ 2009-02-09 13:16             ` Paul E. McKenney
  2009-02-09 17:19               ` Bert Wesarg
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 13:16 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel, Robert Wisniewski

On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > 
> > > > > > Hi Paul,
> > > > > > 
> > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > RCU implementation.
> > > > > > 
> > > > > > It works so far, but I have not gone through any formal verification
> > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > must be adapted in urcu.h to port it.
> > > > > > 
> > > > > > Here is the link to my git tree :
> > > > > > 
> > > > > > git://lttng.org/userspace-rcu.git
> > > > > > 
> > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > 
> > > > > Very cool!!!  I will take a look!
> > > > > 
> > > > > I will also point you at a few that I have put together:
> > > > > 
> > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > 
> > > > > (In the CodeSamples/defer directory.)
> > > > 
> > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > 
> > > > o	One possible optimization would be to avoid sending a signal to
> > > > 	a blocked thread, as the context switch leading to blocking
> > > > 	will have implied a memory barrier -- otherwise it would not
> > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > 	not sure whether checking to see whether a thread is blocked is
> > > > 	any faster than sending it a signal and forcing it to wake up.
> > > 
> > > I'm not sure it will be any faster, and it could be racy too. How would
> > > you envision querying the execution state of another thread ?
> > 
> > For my 64-bit implementation (or the old slow 32-bit version), the trick
> > would be to observe that the thread didn't do an RCU read-side critical
> > section during the past grace period.  This observation would be by
> > comparing counters.
> > 
> > For the new 32-bit implementation, the only way I know of is to grovel
> > through /proc, which would probably be slower than just sending the
> > signal.
> 
> Yes, I guess the signal is not so bad.

Now if there were a /proc entry that listed out the tids of the
currently running threads, then it might be possible to do something,
especially for applications with many more threads than CPUs.

> > > > 	Of course, this approach does require that the enclosing
> > > > 	application be willing to give up a signal.  I suspect that most
> > > > 	applications would be OK with this, though some might not.
> > > 
> > > If we want to make this transparent to the application, we'll have to
> > > investigate further in sigaction() and signal() library override I
> > > guess.
> > 
> > Certainly seems like it is worth a try!
> > 
> > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > 
> > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > 
> > > > 	But I think that the time is now right.  ;-)
> > > > 
> > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > 	to decide that a single global lock must protect RCU updates,
> > > > 	which is of course absolutely not the case.  I strongly
> > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > 	their data structure), but let's face it, if you are sending a
> > > > 	signal to each and every thread, the additional overhead of the
> > > > 	extra lock is the least of your worries.
> > > > 
> > > 
> > > Ok, just changed it.
> > 
> > Thank you!!!
> > 
> > > > 	If you really want to heavily optimize this, I would suggest
> > > > 	setting up a state machine that permits multiple concurrent
> > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > 	invocations.  That way, if the caller has partitioned the
> > > > 	data structure, global locking might be avoided completely
> > > > 	(or at least greatly restricted in scope).
> > > > 
> > > 
> > > That brings an interesting question about urcu_publish_content :
> > > 
> > > void *urcu_publish_content(void **ptr, void *new)
> > > {
> > >         void *oldptr;
> > > 
> > >         internal_urcu_lock();
> > >         oldptr = *ptr;
> > >         *ptr = new;
> > > 
> > >         switch_qparity();
> > >         switch_qparity();
> > >         internal_urcu_unlock();
> > > 
> > >         return oldptr;
> > > }
> > > 
> > > Given that we take a global lock around the pointer assignment, we can
> > > safely assume, from the caller's perspective, that the update will
> > > happen as an "xchg" operation. So if the caller does not have to copy
> > > the old data, it can simply publish the new data without taking any
> > > lock itself.
> > > 
> > > So the question that arises if we want to remove global locking is :
> > > should we change this 
> > > 
> > >         oldptr = *ptr;
> > >         *ptr = new;
> > > 
> > > for an atomic xchg ?
> > 
> > Makes sense to me!
> > 
> > > > 	Of course, if updates are rare, the optimization would not
> > > > 	help, but in that case, acquiring two locks would be even less
> > > > 	of a problem.
> > > 
> > > I plan updates to be quite rare, but it's always good to foresee how
> > > that kind of infrastructure could be misused. :-)
> > 
> > ;-)  ;-)  ;-)
> > 
> > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > 	used to index urcu_active_readers[], you must be relying on
> > > > 	initialization to zero.
> > > 
> > > Yes, starts at 0.
> > 
> > Whew!  ;-)
> > 
> > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > 	relying on the compiler generating an x86 add-to-memory
> > > > 	instruction?
> > > > 
> > > > 	Ditto for rcu_read_unlock().
> > > > 
> > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > 	and the keeping of references to it in the reader_data list.
> > > 
> > > Exactly :)
> > 
> > Getting old and blind, what can I say?
> > 
> > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > 	guess trying it out on several applications would be the way
> > > > 	to find out.  ;-)
> > > > 
> > > > 	That said, I suspect that it would be very convenient in a
> > > > 	number of situations.
> > > 
> > > I thought so. It seemed to be a natural way to express it to me. Usage
> > > will tell.
> > 
> > ;-)
> > 
> > > > o	It would be good to avoid having to pass the return value
> > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > 	possible to avoid this via counter value tricks, though this
> > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > 
> > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > 	in the aforementioned git archive for a way to do this.
> > > > 	(And perhaps I should apply this change to SRCU...)
> > > 
> > > See my other mail about this.
> > 
> > And likewise!
> > 
> > > > o	Your test looks a bit strange, not sure why you test all the
> > > > 	different variables.  It would be nice to take a test duration
> > > > 	as an argument and run the test for that time.
> > > 
> > > I made a smaller version which only reads a single variable. I agree
> > > that the initial test was a bit strange on that aspect.
> > > 
> > > I'll do a version which takes a duration as parameter.
> > 
> > I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> > file in my git archive:
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > This torture test detects the missing second flip 15 times during a
> > 10-second test on a two-processor machine.
> > 
> > The first part of the rcutorture.h file is performance tests -- search
> > for the string "Stress test" to find the torture test.
> 
> I will.
> 
> > > > 	I killed the test after better part of an hour on my laptop,
> > > > 	will retry on a larger machine (after noting the 18 threads
> > > > 	created!).  (And yes, I first tried Power, which objected
> > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > 	so getting an x86 machine to try on.)
> > > 
> > > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > > do.
> > 
> > Yep.  Actually, I was considering porting your code into my environment,
> > which already has the Power primitives.  Any objections?  (This would
> > have the side effect of making a version available via perfbook.git.
> > I would of course add comments referencing your git archive as the
> > official version.)
> 
> Yes, no objection. I am currently looking at your last patch, cleaning
> it up and making the 32 and 64-bit code the same. Also trying to save a
> few instructions. I'll keep you posted when it's ready and committed.

Sounds very good!  I made absolutely no attempt to micro-optimize,
so it would be no surprise to hear that you were able to shave a few
more cycles off.

							Thanx, Paul

> Mathieu
> 
> > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > I have written.  ;-)
> > > > 
> > > 
> > > That's always good. I also tend to always be very skeptical about what I
> > > write and review.
> > > 
> > > Thanks for the thorough review.
> > 
> > No problem -- it has been quite fun!  ;-)
> > 
> > 							Thanx, Paul
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  5:17             ` [ltt-dev] " Mathieu Desnoyers
  2009-02-09  7:03               ` Mathieu Desnoyers
@ 2009-02-09 13:23               ` Paul E. McKenney
  2009-02-09 17:28                 ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 13:23 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > > 
> > > > > > > Hi Paul,
> > > > > > > 
> > > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > > RCU implementation.
> > > > > > > 
> > > > > > > It works so far, but I have not gone through any formal verification
> > > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > > must be adapted in urcu.h to port it.
> > > > > > > 
> > > > > > > Here is the link to my git tree :
> > > > > > > 
> > > > > > > git://lttng.org/userspace-rcu.git
> > > > > > > 
> > > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > > 
> > > > > > Very cool!!!  I will take a look!
> > > > > > 
> > > > > > I will also point you at a few that I have put together:
> > > > > > 
> > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > 
> > > > > > (In the CodeSamples/defer directory.)
> > > > > 
> > > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > > 
> > > > > o	One possible optimization would be to avoid sending a signal to
> > > > > 	a blocked thread, as the context switch leading to blocking
> > > > > 	will have implied a memory barrier -- otherwise it would not
> > > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > > 	not sure whether checking to see whether a thread is blocked is
> > > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > 
> > > > I'm not sure it will be any faster, and it could be racy too. How would
> > > > you envision querying the execution state of another thread ?
> > > 
> > > For my 64-bit implementation (or the old slow 32-bit version), the trick
> > > would be to observe that the thread didn't do an RCU read-side critical
> > > section during the past grace period.  This observation would be by
> > > comparing counters.
> > > 
> > > For the new 32-bit implementation, the only way I know of is to grovel
> > > through /proc, which would probably be slower than just sending the
> > > signal.
> > > 
> > 
> > Yes, I guess the signal is not so bad.
> > 
> > > > > 	Of course, this approach does require that the enclosing
> > > > > 	application be willing to give up a signal.  I suspect that most
> > > > > 	applications would be OK with this, though some might not.
> > > > 
> > > > If we want to make this transparent to the application, we'll have to
> > > > investigate further in sigaction() and signal() library override I
> > > > guess.
> > > 
> > > Certainly seems like it is worth a try!
> > > 
> > > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > > 
> > > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > > 
> > > > > 	But I think that the time is now right.  ;-)
> > > > > 
> > > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > > 	to decide that a single global lock must protect RCU updates,
> > > > > 	which is of course absolutely not the case.  I strongly
> > > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > > 	their data structure), but let's face it, if you are sending a
> > > > > 	signal to each and every thread, the additional overhead of the
> > > > > 	extra lock is the least of your worries.
> > > > > 
> > > > 
> > > > Ok, just changed it.
> > > 
> > > Thank you!!!
> > > 
> > > > > 	If you really want to heavily optimize this, I would suggest
> > > > > 	setting up a state machine that permits multiple concurrent
> > > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > > 	invocations.  That way, if the caller has partitioned the
> > > > > 	data structure, global locking might be avoided completely
> > > > > 	(or at least greatly restricted in scope).
> > > > > 
> > > > 
> > > > That brings an interesting question about urcu_publish_content :
> > > > 
> > > > void *urcu_publish_content(void **ptr, void *new)
> > > > {
> > > >         void *oldptr;
> > > > 
> > > >         internal_urcu_lock();
> > > >         oldptr = *ptr;
> > > >         *ptr = new;
> > > > 
> > > >         switch_qparity();
> > > >         switch_qparity();
> > > >         internal_urcu_unlock();
> > > > 
> > > >         return oldptr;
> > > > }
> > > > 
> > > > Given that we take a global lock around the pointer assignment, we can
> > > > safely assume, from the caller's perspective, that the update will
> > > > happen as an "xchg" operation. So if the caller does not have to copy
> > > > the old data, it can simply publish the new data without taking any
> > > > lock itself.
> > > > 
> > > > So the question that arises if we want to remove global locking is :
> > > > should we change this 
> > > > 
> > > >         oldptr = *ptr;
> > > >         *ptr = new;
> > > > 
> > > > for an atomic xchg ?
> > > 
> > > Makes sense to me!
> > > 
> > > > > 	Of course, if updates are rare, the optimization would not
> > > > > 	help, but in that case, acquiring two locks would be even less
> > > > > 	of a problem.
> > > > 
> > > > I plan updates to be quite rare, but it's always good to foresee how
> > > > that kind of infrastructure could be misused. :-)
> > > 
> > > ;-)  ;-)  ;-)
> > > 
> > > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > > 	used to index urcu_active_readers[], you must be relying on
> > > > > 	initialization to zero.
> > > > 
> > > > Yes, starts at 0.
> > > 
> > > Whew!  ;-)
> > > 
> > > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > > 	relying on the compiler generating an x86 add-to-memory
> > > > > 	instruction?
> > > > > 
> > > > > 	Ditto for rcu_read_unlock().
> > > > > 
> > > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > > 	and the keeping of references to it in the reader_data list.
> > > > 
> > > > Exactly :)
> > > 
> > > Getting old and blind, what can I say?
> > > 
> > > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > > 	guess trying it out on several applications would be the way
> > > > > 	to find out.  ;-)
> > > > > 
> > > > > 	That said, I suspect that it would be very convenient in a
> > > > > 	number of situations.
> > > > 
> > > > I thought so. It seemed to be a natural way to express it to me. Usage
> > > > will tell.
> > > 
> > > ;-)
> > > 
> > > > > o	It would be good to avoid having to pass the return value
> > > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > > 	possible to avoid this via counter value tricks, though this
> > > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > > 
> > > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > > 	in the aforementioned git archive for a way to do this.
> > > > > 	(And perhaps I should apply this change to SRCU...)
> > > > 
> > > > See my other mail about this.
> > > 
> > > And likewise!
> > > 
> > > > > o	Your test looks a bit strange, not sure why you test all the
> > > > > 	different variables.  It would be nice to take a test duration
> > > > > 	as an argument and run the test for that time.
> > > > 
> > > > I made a smaller version which only reads a single variable. I agree
> > > > that the initial test was a bit strange on that aspect.
> > > > 
> > > > I'll do a version which takes a duration as parameter.
> > > 
> > > I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> > > file in my git archive:
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > This torture test detects the missing second flip 15 times during a
> > > 10-second test on a two-processor machine.
> > > 
> > > The first part of the rcutorture.h file is performance tests -- search
> > > for the string "Stress test" to find the torture test.
> > > 
> > 
> > I will.
> > 
> > > > > 	I killed the test after better part of an hour on my laptop,
> > > > > 	will retry on a larger machine (after noting the 18 threads
> > > > > 	created!).  (And yes, I first tried Power, which objected
> > > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > > 	so getting an x86 machine to try on.)
> > > > 
> > > > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > > > do.
> > > 
> > > Yep.  Actually, I was considering porting your code into my environment,
> > > which already has the Power primitives.  Any objections?  (This would
> > > have the side effect of making a version available via perfbook.git.
> > > I would of course add comments referencing your git archive as the
> > > official version.)
> > 
> > Yes, no objection. I am currently looking at your last patch, cleaning
> > it up and making the 32 and 64-bit code the same. Also trying to save a
> > few instructions. I'll keep you posted when it's ready and committed.
> 
> The new version is pushed into the repository. I changed you patch a
> bit. Flaming is welcome. :)

Looks reasonable at first glance.  Just out of curiosity, why are
urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
one reason would be that many architectures work better with int than
with char...

So, how many cycles did this save?  ;-)

						Thanx, Paul

> Mathieu
> 
> > Mathieu
> > 
> > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > I have written.  ;-)
> > > > > 
> > > > 
> > > > That's always good. I also tend to always be very skeptical about what I
> > > > write and review.
> > > > 
> > > > Thanks for the thorough review.
> > > 
> > > No problem -- it has been quite fun!  ;-)
> > > 
> > > 							Thanx, Paul
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09  7:03               ` Mathieu Desnoyers
@ 2009-02-09 15:33                 ` Paul E. McKenney
  2009-02-10 19:17                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 15:33 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:

[ . . . ]

> I just added modified rcutorture.h and api.h from your git tree
> specifically for an urcutorture program to the repository. Some results :
> 
> 8-way x86_64
> E5405 @2 GHZ
> 
> ./urcutorture 8 perf
> n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> ns/read: 4.12871  ns/update: 3.33333e+08
> 
> ./urcutorture 8 uperf
> n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> ns/read: nan  ns/update: 1812.46
> 
> n_reads: 98844204  n_updates: 10  n_mberror: 0
> rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> 
> However, I've tried removing the second switch_qparity() call, and the
> rcutorture test did not detect anything wrong. I also did a variation
> which calls the "sched_yield" version of the urcu, "urcutorture-yield".

My confusion -- I was testing my old approach where the memory barriers
are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
your signal-handler-memory-barrier approach, I suspect that you are
going to need a bigger hammer.  In this case, one such bigger hammer
would be:

o	Just before exit from the signal handler, do a
	pthread_cond_wait() under a pthread_mutex().

o	In force_mb_all_threads(), refrain from sending a signal to self.

	Then it should be safe in force_mb_all_threads() to do a
	pthread_cond_broadcast() under the same pthread_mutex().

This should raise the probability of seeing the failure in the case
where there is a single switch_qparity().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 13:16             ` Paul E. McKenney
@ 2009-02-09 17:19               ` Bert Wesarg
  2009-02-09 17:34                 ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Bert Wesarg @ 2009-02-09 17:19 UTC (permalink / raw)
  To: paulmck; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
>> Yes, I guess the signal is not so bad.
>
> Now if there were a /proc entry that listed out the tids of the
> currently running threads, then it might be possible to do something,
> especially for applications with many more threads than CPUs.
Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
atomic enough?

Bert

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 13:23               ` Paul E. McKenney
@ 2009-02-09 17:28                 ` Mathieu Desnoyers
  2009-02-09 17:47                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 17:28 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:
> > * Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Sun, Feb 08, 2009 at 05:44:19PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Fri, Feb 06, 2009 at 05:06:40AM -0800, Paul E. McKenney wrote:
> > > > > > > On Thu, Feb 05, 2009 at 11:58:41PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > (sorry for repost, I got the ltt-dev email wrong in the previous one)
> > > > > > > > 
> > > > > > > > Hi Paul,
> > > > > > > > 
> > > > > > > > I figured out I needed some userspace RCU for the userspace tracing part
> > > > > > > > of LTTng (for quick read access to the control variables) to trace
> > > > > > > > userspace pthread applications. So I've done a quick-and-dirty userspace
> > > > > > > > RCU implementation.
> > > > > > > > 
> > > > > > > > It works so far, but I have not gone through any formal verification
> > > > > > > > phase. It seems to work on paper, and the tests are also OK (so far),
> > > > > > > > but I offer no guarantee for this 300-lines-ish 1-day hack. :-) If you
> > > > > > > > want to comment on it, it would be welcome. It's a userland-only
> > > > > > > > library. It's also currently x86-only, but only a few basic definitions
> > > > > > > > must be adapted in urcu.h to port it.
> > > > > > > > 
> > > > > > > > Here is the link to my git tree :
> > > > > > > > 
> > > > > > > > git://lttng.org/userspace-rcu.git
> > > > > > > > 
> > > > > > > > http://lttng.org/cgi-bin/gitweb.cgi?p=userspace-rcu.git;a=summary
> > > > > > > 
> > > > > > > Very cool!!!  I will take a look!
> > > > > > > 
> > > > > > > I will also point you at a few that I have put together:
> > > > > > > 
> > > > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > > 
> > > > > > > (In the CodeSamples/defer directory.)
> > > > > > 
> > > > > > Interesting approach, using the signal to force memory-barrier execution!
> > > > > > 
> > > > > > o	One possible optimization would be to avoid sending a signal to
> > > > > > 	a blocked thread, as the context switch leading to blocking
> > > > > > 	will have implied a memory barrier -- otherwise it would not
> > > > > > 	be safe to resume the thread on some other CPU.  That said,
> > > > > > 	not sure whether checking to see whether a thread is blocked is
> > > > > > 	any faster than sending it a signal and forcing it to wake up.
> > > > > 
> > > > > I'm not sure it will be any faster, and it could be racy too. How would
> > > > > you envision querying the execution state of another thread ?
> > > > 
> > > > For my 64-bit implementation (or the old slow 32-bit version), the trick
> > > > would be to observe that the thread didn't do an RCU read-side critical
> > > > section during the past grace period.  This observation would be by
> > > > comparing counters.
> > > > 
> > > > For the new 32-bit implementation, the only way I know of is to grovel
> > > > through /proc, which would probably be slower than just sending the
> > > > signal.
> > > > 
> > > 
> > > Yes, I guess the signal is not so bad.
> > > 
> > > > > > 	Of course, this approach does require that the enclosing
> > > > > > 	application be willing to give up a signal.  I suspect that most
> > > > > > 	applications would be OK with this, though some might not.
> > > > > 
> > > > > If we want to make this transparent to the application, we'll have to
> > > > > investigate further in sigaction() and signal() library override I
> > > > > guess.
> > > > 
> > > > Certainly seems like it is worth a try!
> > > > 
> > > > > > 	Of course, I cannot resist pointing to an old LKML thread:
> > > > > > 
> > > > > > 		http://lkml.org/lkml/2001/10/8/189
> > > > > > 
> > > > > > 	But I think that the time is now right.  ;-)
> > > > > > 
> > > > > > o	I don't understand the purpose of rcu_write_lock() and
> > > > > > 	rcu_write_unlock().  I am concerned that it will lead people
> > > > > > 	to decide that a single global lock must protect RCU updates,
> > > > > > 	which is of course absolutely not the case.  I strongly
> > > > > > 	suggest making these internal to the urcu.c file.  Yes,
> > > > > > 	uses of urcu_publish_content() would then hit two locks (the
> > > > > > 	internal-to-urcu.c one and whatever they are using to protect
> > > > > > 	their data structure), but let's face it, if you are sending a
> > > > > > 	signal to each and every thread, the additional overhead of the
> > > > > > 	extra lock is the least of your worries.
> > > > > > 
> > > > > 
> > > > > Ok, just changed it.
> > > > 
> > > > Thank you!!!
> > > > 
> > > > > > 	If you really want to heavily optimize this, I would suggest
> > > > > > 	setting up a state machine that permits multiple concurrent
> > > > > > 	calls to urcu_publish_content() to share the same set of signal
> > > > > > 	invocations.  That way, if the caller has partitioned the
> > > > > > 	data structure, global locking might be avoided completely
> > > > > > 	(or at least greatly restricted in scope).
> > > > > > 
> > > > > 
> > > > > That brings an interesting question about urcu_publish_content :
> > > > > 
> > > > > void *urcu_publish_content(void **ptr, void *new)
> > > > > {
> > > > >         void *oldptr;
> > > > > 
> > > > >         internal_urcu_lock();
> > > > >         oldptr = *ptr;
> > > > >         *ptr = new;
> > > > > 
> > > > >         switch_qparity();
> > > > >         switch_qparity();
> > > > >         internal_urcu_unlock();
> > > > > 
> > > > >         return oldptr;
> > > > > }
> > > > > 
> > > > > Given that we take a global lock around the pointer assignment, we can
> > > > > safely assume, from the caller's perspective, that the update will
> > > > > happen as an "xchg" operation. So if the caller does not have to copy
> > > > > the old data, it can simply publish the new data without taking any
> > > > > lock itself.
> > > > > 
> > > > > So the question that arises if we want to remove global locking is :
> > > > > should we change this 
> > > > > 
> > > > >         oldptr = *ptr;
> > > > >         *ptr = new;
> > > > > 
> > > > > for an atomic xchg ?
> > > > 
> > > > Makes sense to me!
> > > > 
> > > > > > 	Of course, if updates are rare, the optimization would not
> > > > > > 	help, but in that case, acquiring two locks would be even less
> > > > > > 	of a problem.
> > > > > 
> > > > > I plan updates to be quite rare, but it's always good to foresee how
> > > > > that kind of infrastructure could be misused. :-)
> > > > 
> > > > ;-)  ;-)  ;-)
> > > > 
> > > > > > o	Is urcu_qparity relying on initialization to zero?  Or on the
> > > > > > 	fact that, for all x, 1-x!=x mod 2^32?  Ah, given that this is
> > > > > > 	used to index urcu_active_readers[], you must be relying on
> > > > > > 	initialization to zero.
> > > > > 
> > > > > Yes, starts at 0.
> > > > 
> > > > Whew!  ;-)
> > > > 
> > > > > > o	In rcu_read_lock(), why is a non-atomic increment of the
> > > > > > 	urcu_active_readers[urcu_parity] element safe?  Are you
> > > > > > 	relying on the compiler generating an x86 add-to-memory
> > > > > > 	instruction?
> > > > > > 
> > > > > > 	Ditto for rcu_read_unlock().
> > > > > > 
> > > > > > 	Ah, never mind!!!  I now see the __thread specification,
> > > > > > 	and the keeping of references to it in the reader_data list.
> > > > > 
> > > > > Exactly :)
> > > > 
> > > > Getting old and blind, what can I say?
> > > > 
> > > > > > o	Combining the equivalent of rcu_assign_pointer() and
> > > > > > 	synchronize_rcu() into urcu_publish_content() is an interesting
> > > > > > 	approach.  Not yet sure whether or not it is a good idea.  I
> > > > > > 	guess trying it out on several applications would be the way
> > > > > > 	to find out.  ;-)
> > > > > > 
> > > > > > 	That said, I suspect that it would be very convenient in a
> > > > > > 	number of situations.
> > > > > 
> > > > > I thought so. It seemed to be a natural way to express it to me. Usage
> > > > > will tell.
> > > > 
> > > > ;-)
> > > > 
> > > > > > o	It would be good to avoid having to pass the return value
> > > > > > 	of rcu_read_lock() into rcu_read_unlock().  It should be
> > > > > > 	possible to avoid this via counter value tricks, though this
> > > > > > 	would add a bit more code in rcu_read_lock() on 32-bit machines.
> > > > > > 	(64-bit machines don't have to worry about counter overflow.)
> > > > > > 
> > > > > > 	See the recently updated version of CodeSamples/defer/rcu_nest.[ch]
> > > > > > 	in the aforementioned git archive for a way to do this.
> > > > > > 	(And perhaps I should apply this change to SRCU...)
> > > > > 
> > > > > See my other mail about this.
> > > > 
> > > > And likewise!
> > > > 
> > > > > > o	Your test looks a bit strange, not sure why you test all the
> > > > > > 	different variables.  It would be nice to take a test duration
> > > > > > 	as an argument and run the test for that time.
> > > > > 
> > > > > I made a smaller version which only reads a single variable. I agree
> > > > > that the initial test was a bit strange on that aspect.
> > > > > 
> > > > > I'll do a version which takes a duration as parameter.
> > > > 
> > > > I strongly recommend taking a look at my CodeSamples/defer/rcutorture.h
> > > > file in my git archive:
> > > > 
> > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > 
> > > > This torture test detects the missing second flip 15 times during a
> > > > 10-second test on a two-processor machine.
> > > > 
> > > > The first part of the rcutorture.h file is performance tests -- search
> > > > for the string "Stress test" to find the torture test.
> > > > 
> > > 
> > > I will.
> > > 
> > > > > > 	I killed the test after better part of an hour on my laptop,
> > > > > > 	will retry on a larger machine (after noting the 18 threads
> > > > > > 	created!).  (And yes, I first tried Power, which objected
> > > > > > 	strenously to the "mfence" and "lock; incl" instructions,
> > > > > > 	so getting an x86 machine to try on.)
> > > > > 
> > > > > That should be easy enough to fix. A bit of primitive cut'n'paste would
> > > > > do.
> > > > 
> > > > Yep.  Actually, I was considering porting your code into my environment,
> > > > which already has the Power primitives.  Any objections?  (This would
> > > > have the side effect of making a version available via perfbook.git.
> > > > I would of course add comments referencing your git archive as the
> > > > official version.)
> > > 
> > > Yes, no objection. I am currently looking at your last patch, cleaning
> > > it up and making the 32 and 64-bit code the same. Also trying to save a
> > > few instructions. I'll keep you posted when it's ready and committed.
> > 
> > The new version is pushed into the repository. I changed you patch a
> > bit. Flaming is welcome. :)
> 
> Looks reasonable at first glance.  Just out of curiosity, why are
> urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
> one reason would be that many architectures work better with int than
> with char...
> 

Exactly. This is done to make sure we don't end up having false register
dependencies causing stalls on such architectures. I'll add a comment.

> So, how many cycles did this save?  ;-)
> 

On x86_64, it's pretty much the same as before. It just helps having the
32 and 64 bits algorithms being exactly the same, which I think is a
very good thing.

BTW, my tests were done without any CMOV instruction due to the standard
gcc options I used. Given think past discussion about CMOV :

http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus

It does not seem like such a good idea to use it anyway, given it can
take 10 cycles to run on a P4a

BTW, do you think having the 256 nested rcu read locks limitation could
become a problem ? I really think an application has recursion problem
if it does, but this is not impossible, especially on a particularly
badly designed tree-traversal algorithm on a 64-bits arch...

Mathieu

> 						Thanx, Paul
> 
> > Mathieu
> > 
> > > Mathieu
> > > 
> > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > I have written.  ;-)
> > > > > > 
> > > > > 
> > > > > That's always good. I also tend to always be very skeptical about what I
> > > > > write and review.
> > > > > 
> > > > > Thanks for the thorough review.
> > > > 
> > > > No problem -- it has been quite fun!  ;-)
> > > > 
> > > > 							Thanx, Paul
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > 
> > > _______________________________________________
> > > ltt-dev mailing list
> > > ltt-dev@lists.casi.polymtl.ca
> > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:19               ` Bert Wesarg
@ 2009-02-09 17:34                 ` Paul E. McKenney
  2009-02-09 17:35                   ` Bert Wesarg
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 17:34 UTC (permalink / raw)
  To: Bert Wesarg; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> >> Yes, I guess the signal is not so bad.
> >
> > Now if there were a /proc entry that listed out the tids of the
> > currently running threads, then it might be possible to do something,
> > especially for applications with many more threads than CPUs.
>
> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
> atomic enough?

Won't that give me all the threads rather than only the ones currently
running?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:34                 ` Paul E. McKenney
@ 2009-02-09 17:35                   ` Bert Wesarg
  2009-02-09 17:40                     ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Bert Wesarg @ 2009-02-09 17:35 UTC (permalink / raw)
  To: paulmck; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
>> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
>> >> Yes, I guess the signal is not so bad.
>> >
>> > Now if there were a /proc entry that listed out the tids of the
>> > currently running threads, then it might be possible to do something,
>> > especially for applications with many more threads than CPUs.
>>
>> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
>> atomic enough?
>
> Won't that give me all the threads rather than only the ones currently
> running?
What do you mean by 'running'?

Bert
>
>                                                        Thanx, Paul
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:35                   ` Bert Wesarg
@ 2009-02-09 17:40                     ` Paul E. McKenney
  2009-02-09 17:42                       ` Mathieu Desnoyers
  2009-02-09 17:45                       ` Bert Wesarg
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 17:40 UTC (permalink / raw)
  To: Bert Wesarg; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 09, 2009 at 06:35:38PM +0100, Bert Wesarg wrote:
> On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
> >> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> >> >> Yes, I guess the signal is not so bad.
> >> >
> >> > Now if there were a /proc entry that listed out the tids of the
> >> > currently running threads, then it might be possible to do something,
> >> > especially for applications with many more threads than CPUs.
> >>
> >> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
> >> atomic enough?
> >
> > Won't that give me all the threads rather than only the ones currently
> > running?
>
> What do you mean by 'running'?

Sitting on a CPU and executing, as opposed to blocked or preempted.

It is pretty easy to scan the running tasks within the kernel, but I
don't know of an efficient way to do it from user mode.  The only way
I know of would be to cat out the /proc/$pid/tasks/*/status (IIRC)
and look for the task state.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:40                     ` Paul E. McKenney
@ 2009-02-09 17:42                       ` Mathieu Desnoyers
  2009-02-09 18:00                         ` Paul E. McKenney
  2009-02-09 17:45                       ` Bert Wesarg
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 17:42 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Bert Wesarg, ltt-dev, linux-kernel, Robert Wisniewski

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, Feb 09, 2009 at 06:35:38PM +0100, Bert Wesarg wrote:
> > On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
> > <paulmck@linux.vnet.ibm.com> wrote:
> > > On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
> > >> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
> > >> <paulmck@linux.vnet.ibm.com> wrote:
> > >> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> > >> >> Yes, I guess the signal is not so bad.
> > >> >
> > >> > Now if there were a /proc entry that listed out the tids of the
> > >> > currently running threads, then it might be possible to do something,
> > >> > especially for applications with many more threads than CPUs.
> > >>
> > >> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
> > >> atomic enough?
> > >
> > > Won't that give me all the threads rather than only the ones currently
> > > running?
> >
> > What do you mean by 'running'?
> 
> Sitting on a CPU and executing, as opposed to blocked or preempted.
> 
> It is pretty easy to scan the running tasks within the kernel, but I
> don't know of an efficient way to do it from user mode.  The only way
> I know of would be to cat out the /proc/$pid/tasks/*/status (IIRC)
> and look for the task state.
> 

The thing I dislike about this approach is the non-portability. Ideally,
if we want to integrate urcu to pthreads, we should also aim at
BSD-based OSes.

Mathieu

> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:40                     ` Paul E. McKenney
  2009-02-09 17:42                       ` Mathieu Desnoyers
@ 2009-02-09 17:45                       ` Bert Wesarg
  2009-02-09 17:59                         ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Bert Wesarg @ 2009-02-09 17:45 UTC (permalink / raw)
  To: paulmck; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 9, 2009 at 18:40, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Mon, Feb 09, 2009 at 06:35:38PM +0100, Bert Wesarg wrote:
>> On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
>> <paulmck@linux.vnet.ibm.com> wrote:
>> > On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
>> >> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
>> >> <paulmck@linux.vnet.ibm.com> wrote:
>> >> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
>> >> >> Yes, I guess the signal is not so bad.
>> >> >
>> >> > Now if there were a /proc entry that listed out the tids of the
>> >> > currently running threads, then it might be possible to do something,
>> >> > especially for applications with many more threads than CPUs.
>> >>
>> >> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
>> >> atomic enough?
>> >
>> > Won't that give me all the threads rather than only the ones currently
>> > running?
>>
>> What do you mean by 'running'?
>
> Sitting on a CPU and executing, as opposed to blocked or preempted.
Ok, me too.

>
> It is pretty easy to scan the running tasks within the kernel, but I
> don't know of an efficient way to do it from user mode.  The only way
> I know of would be to cat out the /proc/$pid/tasks/*/status (IIRC)
> and look for the task state.
Yes, me too.

Bert
>
>                                                        Thanx, Paul
>

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:28                 ` Mathieu Desnoyers
@ 2009-02-09 17:47                   ` Paul E. McKenney
  2009-02-09 18:13                     ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 17:47 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 12:28:17PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:

[ . . . ]

> > > The new version is pushed into the repository. I changed you patch a
> > > bit. Flaming is welcome. :)
> > 
> > Looks reasonable at first glance.  Just out of curiosity, why are
> > urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
> > one reason would be that many architectures work better with int than
> > with char...
> 
> Exactly. This is done to make sure we don't end up having false register
> dependencies causing stalls on such architectures. I'll add a comment.

Are there any 64-bit architectures that would prefer a long to an int?
(Other than really old Alpha CPUs, that is.)

> > So, how many cycles did this save?  ;-)
> 
> On x86_64, it's pretty much the same as before. It just helps having the
> 32 and 64 bits algorithms being exactly the same, which I think is a
> very good thing.

Good point!

> BTW, my tests were done without any CMOV instruction due to the standard
> gcc options I used. Given think past discussion about CMOV :
> 
> http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
> 
> It does not seem like such a good idea to use it anyway, given it can
> take 10 cycles to run on a P4a

Fair enough!

> BTW, do you think having the 256 nested rcu read locks limitation could
> become a problem ? I really think an application has recursion problem
> if it does, but this is not impossible, especially on a particularly
> badly designed tree-traversal algorithm on a 64-bits arch...

I don't know of any code in the Linux kernel that nests rcu_read_lock()
anywhere near that deep.  And if someone does find such a case, it is
pretty easy to use 15 bits rather than 8 to hold the nesting depth, just
by changing the definition of RCU_GP_CTR_BIT.

							Thanx, Paul

> Mathieu
> 
> > 						Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > > Mathieu
> > > > 
> > > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > > I have written.  ;-)
> > > > > > > 
> > > > > > 
> > > > > > That's always good. I also tend to always be very skeptical about what I
> > > > > > write and review.
> > > > > > 
> > > > > > Thanks for the thorough review.
> > > > > 
> > > > > No problem -- it has been quite fun!  ;-)
> > > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > 
> > > > _______________________________________________
> > > > ltt-dev mailing list
> > > > ltt-dev@lists.casi.polymtl.ca
> > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:45                       ` Bert Wesarg
@ 2009-02-09 17:59                         ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 17:59 UTC (permalink / raw)
  To: Bert Wesarg; +Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 09, 2009 at 06:45:05PM +0100, Bert Wesarg wrote:
> On Mon, Feb 9, 2009 at 18:40, Paul E. McKenney
> <paulmck@linux.vnet.ibm.com> wrote:
> > On Mon, Feb 09, 2009 at 06:35:38PM +0100, Bert Wesarg wrote:
> >> On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> > On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
> >> >> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
> >> >> <paulmck@linux.vnet.ibm.com> wrote:
> >> >> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> >> >> >> Yes, I guess the signal is not so bad.
> >> >> >
> >> >> > Now if there were a /proc entry that listed out the tids of the
> >> >> > currently running threads, then it might be possible to do something,
> >> >> > especially for applications with many more threads than CPUs.
> >> >>
> >> >> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
> >> >> atomic enough?
> >> >
> >> > Won't that give me all the threads rather than only the ones currently
> >> > running?
> >>
> >> What do you mean by 'running'?
> >
> > Sitting on a CPU and executing, as opposed to blocked or preempted.
>
> Ok, me too.
> 
> > It is pretty easy to scan the running tasks within the kernel, but I
> > don't know of an efficient way to do it from user mode.  The only way
> > I know of would be to cat out the /proc/$pid/tasks/*/status (IIRC)
> > and look for the task state.
>
> Yes, me too.

I was afraid of that...  ;-)

Do you believe that something like a /proc/runningtids that listsd
currently running tasks be useful in general?  Or just to this algorithm?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:42                       ` Mathieu Desnoyers
@ 2009-02-09 18:00                         ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 18:00 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Bert Wesarg, ltt-dev, linux-kernel, Robert Wisniewski

On Mon, Feb 09, 2009 at 12:42:02PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 06:35:38PM +0100, Bert Wesarg wrote:
> > > On Mon, Feb 9, 2009 at 18:34, Paul E. McKenney
> > > <paulmck@linux.vnet.ibm.com> wrote:
> > > > On Mon, Feb 09, 2009 at 06:19:45PM +0100, Bert Wesarg wrote:
> > > >> On Mon, Feb 9, 2009 at 14:16, Paul E. McKenney
> > > >> <paulmck@linux.vnet.ibm.com> wrote:
> > > >> > On Sun, Feb 08, 2009 at 11:53:52PM -0500, Mathieu Desnoyers wrote:
> > > >> >> Yes, I guess the signal is not so bad.
> > > >> >
> > > >> > Now if there were a /proc entry that listed out the tids of the
> > > >> > currently running threads, then it might be possible to do something,
> > > >> > especially for applications with many more threads than CPUs.
> > > >>
> > > >> Do you mean something like: `ls /proc/$pid/tasks/*`? Or is this not
> > > >> atomic enough?
> > > >
> > > > Won't that give me all the threads rather than only the ones currently
> > > > running?
> > >
> > > What do you mean by 'running'?
> > 
> > Sitting on a CPU and executing, as opposed to blocked or preempted.
> > 
> > It is pretty easy to scan the running tasks within the kernel, but I
> > don't know of an efficient way to do it from user mode.  The only way
> > I know of would be to cat out the /proc/$pid/tasks/*/status (IIRC)
> > and look for the task state.
> 
> The thing I dislike about this approach is the non-portability. Ideally,
> if we want to integrate urcu to pthreads, we should also aim at
> BSD-based OSes.

But it could be portable.  If the /proc file in question could not be
opened (as would be the case on BSDs), you just send the signal to all
the tasks.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 17:47                   ` Paul E. McKenney
@ 2009-02-09 18:13                     ` Mathieu Desnoyers
  2009-02-09 18:19                       ` Mathieu Desnoyers
  2009-02-09 18:37                       ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 18:13 UTC (permalink / raw)
  To: Paul E. McKenney, H. Peter Anvin, Christoph Hellwig; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, Feb 09, 2009 at 12:28:17PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:
> 
> [ . . . ]
> 
> > > > The new version is pushed into the repository. I changed you patch a
> > > > bit. Flaming is welcome. :)
> > > 
> > > Looks reasonable at first glance.  Just out of curiosity, why are
> > > urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
> > > one reason would be that many architectures work better with int than
> > > with char...
> > 
> > Exactly. This is done to make sure we don't end up having false register
> > dependencies causing stalls on such architectures. I'll add a comment.
> 
> Are there any 64-bit architectures that would prefer a long to an int?
> (Other than really old Alpha CPUs, that is.)
> 

None that I am aware of, but Christoph or Peter would probably know more
than I do on this aspect.

> > > So, how many cycles did this save?  ;-)
> > 
> > On x86_64, it's pretty much the same as before. It just helps having the
> > 32 and 64 bits algorithms being exactly the same, which I think is a
> > very good thing.
> 
> Good point!
> 
> > BTW, my tests were done without any CMOV instruction due to the standard
> > gcc options I used. Given think past discussion about CMOV :
> > 
> > http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
> > 
> > It does not seem like such a good idea to use it anyway, given it can
> > take 10 cycles to run on a P4a
> 
> Fair enough!
> 
> > BTW, do you think having the 256 nested rcu read locks limitation could
> > become a problem ? I really think an application has recursion problem
> > if it does, but this is not impossible, especially on a particularly
> > badly designed tree-traversal algorithm on a 64-bits arch...
> 
> I don't know of any code in the Linux kernel that nests rcu_read_lock()
> anywhere near that deep.  And if someone does find such a case, it is
> pretty easy to use 15 bits rather than 8 to hold the nesting depth, just
> by changing the definition of RCU_GP_CTR_BIT.
> 

You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
testw %ax, %ax instead of a testb %al, %al. The trick here is that
RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
16-bits or 32-bits bitmask for the lower order bits.

On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.

To provide 32-bits compability and allow the deepest nesting possible, I
think it makes sense to use

/* Use the amount of bits equal to half of the architecture long size */
#define RCU_GP_CTR_BIT (sizeof(long) << 2)

Mathieu


> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 						Thanx, Paul
> > > 
> > > > Mathieu
> > > > 
> > > > > Mathieu
> > > > > 
> > > > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > > > I have written.  ;-)
> > > > > > > > 
> > > > > > > 
> > > > > > > That's always good. I also tend to always be very skeptical about what I
> > > > > > > write and review.
> > > > > > > 
> > > > > > > Thanks for the thorough review.
> > > > > > 
> > > > > > No problem -- it has been quite fun!  ;-)
> > > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > > 
> > > > > _______________________________________________
> > > > > ltt-dev mailing list
> > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 18:13                     ` Mathieu Desnoyers
@ 2009-02-09 18:19                       ` Mathieu Desnoyers
  2009-02-09 18:37                       ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 18:19 UTC (permalink / raw)
  To: Paul E. McKenney, H. Peter Anvin, Christoph Hellwig; +Cc: ltt-dev, linux-kernel

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 12:28:17PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > > > The new version is pushed into the repository. I changed you patch a
> > > > > bit. Flaming is welcome. :)
> > > > 
> > > > Looks reasonable at first glance.  Just out of curiosity, why are
> > > > urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
> > > > one reason would be that many architectures work better with int than
> > > > with char...
> > > 
> > > Exactly. This is done to make sure we don't end up having false register
> > > dependencies causing stalls on such architectures. I'll add a comment.
> > 
> > Are there any 64-bit architectures that would prefer a long to an int?
> > (Other than really old Alpha CPUs, that is.)
> > 
> 
> None that I am aware of, but Christoph or Peter would probably know more
> than I do on this aspect.
> 

Well, I had to put back a "long" rather than a "int" since we not
support 2^32 nesting nevels on 64-bits. The following bit, used for
quiescent state "parity", ends up being the 33rd bit, which needs a
64-bit long.

Mathieu

> > > > So, how many cycles did this save?  ;-)
> > > 
> > > On x86_64, it's pretty much the same as before. It just helps having the
> > > 32 and 64 bits algorithms being exactly the same, which I think is a
> > > very good thing.
> > 
> > Good point!
> > 
> > > BTW, my tests were done without any CMOV instruction due to the standard
> > > gcc options I used. Given think past discussion about CMOV :
> > > 
> > > http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
> > > 
> > > It does not seem like such a good idea to use it anyway, given it can
> > > take 10 cycles to run on a P4a
> > 
> > Fair enough!
> > 
> > > BTW, do you think having the 256 nested rcu read locks limitation could
> > > become a problem ? I really think an application has recursion problem
> > > if it does, but this is not impossible, especially on a particularly
> > > badly designed tree-traversal algorithm on a 64-bits arch...
> > 
> > I don't know of any code in the Linux kernel that nests rcu_read_lock()
> > anywhere near that deep.  And if someone does find such a case, it is
> > pretty easy to use 15 bits rather than 8 to hold the nesting depth, just
> > by changing the definition of RCU_GP_CTR_BIT.
> > 
> 
> You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> testw %ax, %ax instead of a testb %al, %al. The trick here is that
> RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> 16-bits or 32-bits bitmask for the lower order bits.
> 
> On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> 
> To provide 32-bits compability and allow the deepest nesting possible, I
> think it makes sense to use
> 
> /* Use the amount of bits equal to half of the architecture long size */
> #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> 
> Mathieu
> 
> 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > > 						Thanx, Paul
> > > > 
> > > > > Mathieu
> > > > > 
> > > > > > Mathieu
> > > > > > 
> > > > > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > > > > I have written.  ;-)
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That's always good. I also tend to always be very skeptical about what I
> > > > > > > > write and review.
> > > > > > > > 
> > > > > > > > Thanks for the thorough review.
> > > > > > > 
> > > > > > > No problem -- it has been quite fun!  ;-)
> > > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Mathieu Desnoyers
> > > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > > > 
> > > > > > _______________________________________________
> > > > > > ltt-dev mailing list
> > > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 18:13                     ` Mathieu Desnoyers
  2009-02-09 18:19                       ` Mathieu Desnoyers
@ 2009-02-09 18:37                       ` Paul E. McKenney
  2009-02-09 18:49                         ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 18:37 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Christoph Hellwig, ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 12:28:17PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Mon, Feb 09, 2009 at 12:17:37AM -0500, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > > > The new version is pushed into the repository. I changed you patch a
> > > > > bit. Flaming is welcome. :)
> > > > 
> > > > Looks reasonable at first glance.  Just out of curiosity, why are
> > > > urcu_gp_ctr and urcu_active_readers int rather than char?  I guess that
> > > > one reason would be that many architectures work better with int than
> > > > with char...
> > > 
> > > Exactly. This is done to make sure we don't end up having false register
> > > dependencies causing stalls on such architectures. I'll add a comment.
> > 
> > Are there any 64-bit architectures that would prefer a long to an int?
> > (Other than really old Alpha CPUs, that is.)
> 
> None that I am aware of, but Christoph or Peter would probably know more
> than I do on this aspect.
> 
> > > > So, how many cycles did this save?  ;-)
> > > 
> > > On x86_64, it's pretty much the same as before. It just helps having the
> > > 32 and 64 bits algorithms being exactly the same, which I think is a
> > > very good thing.
> > 
> > Good point!
> > 
> > > BTW, my tests were done without any CMOV instruction due to the standard
> > > gcc options I used. Given think past discussion about CMOV :
> > > 
> > > http://ondioline.org/mail/cmov-a-bad-idea-on-out-of-order-cpus
> > > 
> > > It does not seem like such a good idea to use it anyway, given it can
> > > take 10 cycles to run on a P4a
> > 
> > Fair enough!
> > 
> > > BTW, do you think having the 256 nested rcu read locks limitation could
> > > become a problem ? I really think an application has recursion problem
> > > if it does, but this is not impossible, especially on a particularly
> > > badly designed tree-traversal algorithm on a 64-bits arch...
> > 
> > I don't know of any code in the Linux kernel that nests rcu_read_lock()
> > anywhere near that deep.  And if someone does find such a case, it is
> > pretty easy to use 15 bits rather than 8 to hold the nesting depth, just
> > by changing the definition of RCU_GP_CTR_BIT.
> > 
> 
> You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> testw %ax, %ax instead of a testb %al, %al. The trick here is that
> RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> 16-bits or 32-bits bitmask for the lower order bits.
> 
> On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> 
> To provide 32-bits compability and allow the deepest nesting possible, I
> think it makes sense to use
> 
> /* Use the amount of bits equal to half of the architecture long size */
> #define RCU_GP_CTR_BIT (sizeof(long) << 2)

You lost me on this one:

	sizeof(long) << 2 = 0x10

I could believe the following (run on a 32-bit machine):

	1 << (sizeof(long) * 8 - 1) = 0x80000000

Or, if you were wanting to use a bit halfway up the word, perhaps this:

	1 << (sizeof(long) * 4 - 1) = 0x8000

Or am I confused?

							Thanx, Paul

> Mathieu
> 
> 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > > 						Thanx, Paul
> > > > 
> > > > > Mathieu
> > > > > 
> > > > > > Mathieu
> > > > > > 
> > > > > > > > > Again, looks interesting!  Looks plausible, although I have not 100%
> > > > > > > > > convinced myself that it is perfectly bug-free.  But I do maintain
> > > > > > > > > a healthy skepticism of purported RCU algorithms, especially ones that
> > > > > > > > > I have written.  ;-)
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > That's always good. I also tend to always be very skeptical about what I
> > > > > > > > write and review.
> > > > > > > > 
> > > > > > > > Thanks for the thorough review.
> > > > > > > 
> > > > > > > No problem -- it has been quite fun!  ;-)
> > > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Mathieu Desnoyers
> > > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > > > 
> > > > > > _______________________________________________
> > > > > > ltt-dev mailing list
> > > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 18:37                       ` Paul E. McKenney
@ 2009-02-09 18:49                         ` Paul E. McKenney
  2009-02-09 19:05                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 18:49 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Christoph Hellwig, ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 10:37:42AM -0800, Paul E. McKenney wrote:
> On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:

[ . . . ]

> > You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> > testw %ax, %ax instead of a testb %al, %al. The trick here is that
> > RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> > 16-bits or 32-bits bitmask for the lower order bits.
> > 
> > On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> > 
> > To provide 32-bits compability and allow the deepest nesting possible, I
> > think it makes sense to use
> > 
> > /* Use the amount of bits equal to half of the architecture long size */
> > #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> 
> You lost me on this one:
> 
> 	sizeof(long) << 2 = 0x10
> 
> I could believe the following (run on a 32-bit machine):
> 
> 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> 
> Or, if you were wanting to use a bit halfway up the word, perhaps this:
> 
> 	1 << (sizeof(long) * 4 - 1) = 0x8000
> 
> Or am I confused?

Well, I am at least partly confused.  You were wanting a low-order bit,
so you want to lose the "- 1" above.  Here are some of the possibilities:

	sizeof(long) = 0x4
	sizeof(long) << 2 = 0x10
	1 << (sizeof(long) * 8 - 1) = 0x80000000
	1 << (sizeof(long) * 4) = 0x10000
	1 << (sizeof(long) * 4 - 1) = 0x8000
	1 << (sizeof(long) * 2) = 0x100
	1 << (sizeof(long) * 2 - 1) = 0x80

My guess is that 1 << (sizeof(long) * 4) and 1 << (sizeof(long) * 2)
are of the most interest.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 18:49                         ` Paul E. McKenney
@ 2009-02-09 19:05                           ` Mathieu Desnoyers
  2009-02-09 19:15                             ` Mathieu Desnoyers
  2009-02-09 19:23                             ` Paul E. McKenney
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 19:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: H. Peter Anvin, Christoph Hellwig, ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, Feb 09, 2009 at 10:37:42AM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> 
> [ . . . ]
> 
> > > You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> > > testw %ax, %ax instead of a testb %al, %al. The trick here is that
> > > RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> > > 16-bits or 32-bits bitmask for the lower order bits.
> > > 
> > > On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> > > 
> > > To provide 32-bits compability and allow the deepest nesting possible, I
> > > think it makes sense to use
> > > 
> > > /* Use the amount of bits equal to half of the architecture long size */
> > > #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> > 
> > You lost me on this one:
> > 
> > 	sizeof(long) << 2 = 0x10
> > 
> > I could believe the following (run on a 32-bit machine):
> > 
> > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > 
> > Or, if you were wanting to use a bit halfway up the word, perhaps this:
> > 
> > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > 
> > Or am I confused?
> 
> Well, I am at least partly confused.  You were wanting a low-order bit,
> so you want to lose the "- 1" above.  Here are some of the possibilities:
> 
> 	sizeof(long) = 0x4
> 	sizeof(long) << 2 = 0x10
> 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> 	1 << (sizeof(long) * 4) = 0x10000
> 	1 << (sizeof(long) * 4 - 1) = 0x8000
> 	1 << (sizeof(long) * 2) = 0x100
> 	1 << (sizeof(long) * 2 - 1) = 0x80
> 
> My guess is that 1 << (sizeof(long) * 4) and 1 << (sizeof(long) * 2)
> are of the most interest.
> 

Exactly. I'll change it to :

#define RCU_GP_CTR_BIT          (1 << (sizeof(long) << 2))

I somehow thought this define was used as a bit number rather than the
bit mask.

Thanks,

Mathieu



> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 19:05                           ` Mathieu Desnoyers
@ 2009-02-09 19:15                             ` Mathieu Desnoyers
  2009-02-09 19:35                               ` Paul E. McKenney
  2009-02-09 19:23                             ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-09 19:15 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Christoph Hellwig, ltt-dev, linux-kernel, H. Peter Anvin

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 10:37:42AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > 
> > [ . . . ]
> > 
> > > > You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> > > > testw %ax, %ax instead of a testb %al, %al. The trick here is that
> > > > RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> > > > 16-bits or 32-bits bitmask for the lower order bits.
> > > > 
> > > > On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> > > > 
> > > > To provide 32-bits compability and allow the deepest nesting possible, I
> > > > think it makes sense to use
> > > > 
> > > > /* Use the amount of bits equal to half of the architecture long size */
> > > > #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> > > 
> > > You lost me on this one:
> > > 
> > > 	sizeof(long) << 2 = 0x10
> > > 
> > > I could believe the following (run on a 32-bit machine):
> > > 
> > > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > > 
> > > Or, if you were wanting to use a bit halfway up the word, perhaps this:
> > > 
> > > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > > 
> > > Or am I confused?
> > 
> > Well, I am at least partly confused.  You were wanting a low-order bit,
> > so you want to lose the "- 1" above.  Here are some of the possibilities:
> > 
> > 	sizeof(long) = 0x4
> > 	sizeof(long) << 2 = 0x10
> > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > 	1 << (sizeof(long) * 4) = 0x10000
> > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > 	1 << (sizeof(long) * 2) = 0x100
> > 	1 << (sizeof(long) * 2 - 1) = 0x80
> > 
> > My guess is that 1 << (sizeof(long) * 4) and 1 << (sizeof(long) * 2)
> > are of the most interest.
> > 
> 
> Exactly. I'll change it to :
> 
> #define RCU_GP_CTR_BIT          (1 << (sizeof(long) << 2))
> 
> I somehow thought this define was used as a bit number rather than the
> bit mask.
> 
> Thanks,
> 
> Mathieu
> 

It's pushed in the git tree. I also removed an increment in the fast
path by initializing urcu_gp_ctr to RCU_GP_COUNT.

It brings benchmarks to :

Time per read : 6.87183 to 7.25318 cycles

So we seem to save about half a cycle to a cycle with this.

Mathieu


> 
> 
> > 							Thanx, Paul
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 19:05                           ` Mathieu Desnoyers
  2009-02-09 19:15                             ` Mathieu Desnoyers
@ 2009-02-09 19:23                             ` Paul E. McKenney
  1 sibling, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 19:23 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: H. Peter Anvin, Christoph Hellwig, ltt-dev, linux-kernel

On Mon, Feb 09, 2009 at 02:05:09PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 10:37:42AM -0800, Paul E. McKenney wrote:
> > > On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > 
> > [ . . . ]
> > 
> > > > You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> > > > testw %ax, %ax instead of a testb %al, %al. The trick here is that
> > > > RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> > > > 16-bits or 32-bits bitmask for the lower order bits.
> > > > 
> > > > On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> > > > 
> > > > To provide 32-bits compability and allow the deepest nesting possible, I
> > > > think it makes sense to use
> > > > 
> > > > /* Use the amount of bits equal to half of the architecture long size */
> > > > #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> > > 
> > > You lost me on this one:
> > > 
> > > 	sizeof(long) << 2 = 0x10
> > > 
> > > I could believe the following (run on a 32-bit machine):
> > > 
> > > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > > 
> > > Or, if you were wanting to use a bit halfway up the word, perhaps this:
> > > 
> > > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > > 
> > > Or am I confused?
> > 
> > Well, I am at least partly confused.  You were wanting a low-order bit,
> > so you want to lose the "- 1" above.  Here are some of the possibilities:
> > 
> > 	sizeof(long) = 0x4
> > 	sizeof(long) << 2 = 0x10
> > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > 	1 << (sizeof(long) * 4) = 0x10000
> > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > 	1 << (sizeof(long) * 2) = 0x100
> > 	1 << (sizeof(long) * 2 - 1) = 0x80
> > 
> > My guess is that 1 << (sizeof(long) * 4) and 1 << (sizeof(long) * 2)
> > are of the most interest.
> > 
> 
> Exactly. I'll change it to :
> 
> #define RCU_GP_CTR_BIT          (1 << (sizeof(long) << 2))
> 
> I somehow thought this define was used as a bit number rather than the
> bit mask.

Ah!  Been there, done that!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 19:15                             ` Mathieu Desnoyers
@ 2009-02-09 19:35                               ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-09 19:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Christoph Hellwig, ltt-dev, linux-kernel, H. Peter Anvin

On Mon, Feb 09, 2009 at 02:15:26PM -0500, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Mon, Feb 09, 2009 at 10:37:42AM -0800, Paul E. McKenney wrote:
> > > > On Mon, Feb 09, 2009 at 01:13:41PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > You know what ? Changing RCU_GP_CTR_BIT to 16 uses a
> > > > > testw %ax, %ax instead of a testb %al, %al. The trick here is that
> > > > > RCU_GP_CTR_BIT must be a multiple of 8 so we can use a full 8-bits,
> > > > > 16-bits or 32-bits bitmask for the lower order bits.
> > > > > 
> > > > > On 64-bits, using a RCU_GP_CTR_BIT of 32 is also ok. It uses a testl.
> > > > > 
> > > > > To provide 32-bits compability and allow the deepest nesting possible, I
> > > > > think it makes sense to use
> > > > > 
> > > > > /* Use the amount of bits equal to half of the architecture long size */
> > > > > #define RCU_GP_CTR_BIT (sizeof(long) << 2)
> > > > 
> > > > You lost me on this one:
> > > > 
> > > > 	sizeof(long) << 2 = 0x10
> > > > 
> > > > I could believe the following (run on a 32-bit machine):
> > > > 
> > > > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > > > 
> > > > Or, if you were wanting to use a bit halfway up the word, perhaps this:
> > > > 
> > > > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > > > 
> > > > Or am I confused?
> > > 
> > > Well, I am at least partly confused.  You were wanting a low-order bit,
> > > so you want to lose the "- 1" above.  Here are some of the possibilities:
> > > 
> > > 	sizeof(long) = 0x4
> > > 	sizeof(long) << 2 = 0x10
> > > 	1 << (sizeof(long) * 8 - 1) = 0x80000000
> > > 	1 << (sizeof(long) * 4) = 0x10000
> > > 	1 << (sizeof(long) * 4 - 1) = 0x8000
> > > 	1 << (sizeof(long) * 2) = 0x100
> > > 	1 << (sizeof(long) * 2 - 1) = 0x80
> > > 
> > > My guess is that 1 << (sizeof(long) * 4) and 1 << (sizeof(long) * 2)
> > > are of the most interest.
> > > 
> > 
> > Exactly. I'll change it to :
> > 
> > #define RCU_GP_CTR_BIT          (1 << (sizeof(long) << 2))
> > 
> > I somehow thought this define was used as a bit number rather than the
> > bit mask.
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> 
> It's pushed in the git tree. I also removed an increment in the fast
> path by initializing urcu_gp_ctr to RCU_GP_COUNT.
> 
> It brings benchmarks to :
> 
> Time per read : 6.87183 to 7.25318 cycles
> 
> So we seem to save about half a cycle to a cycle with this.

I like it!!!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-09 15:33                 ` Paul E. McKenney
@ 2009-02-10 19:17                   ` Mathieu Desnoyers
  2009-02-10 21:16                     ` Paul E. McKenney
  2009-02-11  5:08                     ` Lai Jiangshan
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-10 19:17 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> 
> [ . . . ]
> 
> > I just added modified rcutorture.h and api.h from your git tree
> > specifically for an urcutorture program to the repository. Some results :
> > 
> > 8-way x86_64
> > E5405 @2 GHZ
> > 
> > ./urcutorture 8 perf
> > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > ns/read: 4.12871  ns/update: 3.33333e+08
> > 
> > ./urcutorture 8 uperf
> > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > ns/read: nan  ns/update: 1812.46
> > 
> > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > 
> > However, I've tried removing the second switch_qparity() call, and the
> > rcutorture test did not detect anything wrong. I also did a variation
> > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> 
> My confusion -- I was testing my old approach where the memory barriers
> are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> your signal-handler-memory-barrier approach, I suspect that you are
> going to need a bigger hammer.  In this case, one such bigger hammer
> would be:
> 
> o	Just before exit from the signal handler, do a
> 	pthread_cond_wait() under a pthread_mutex().
> 
> o	In force_mb_all_threads(), refrain from sending a signal to self.
> 
> 	Then it should be safe in force_mb_all_threads() to do a
> 	pthread_cond_broadcast() under the same pthread_mutex().
> 
> This should raise the probability of seeing the failure in the case
> where there is a single switch_qparity().
> 

I just did a mb() version of the urcu :

(uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)

Time per read : 48.4086 cycles
(about 6-7 times slower, as expected)

This will be useful especially to increase the chance to trigger races.

I tried removing the second parity switch from the writer. The rcu
torture test did not find the problem yet (maybe I am not using the
correct parameters ? It does not run for more than 5 seconds).

So I added a "-n" option to test_urcu, so it can make the usleep(1)
between the writes optional. I also changed the yield for a usleep with
random delay. I also now use a circular buffer rather than malloc so we
are sure the memory is not quickly reused by the writer and stays longer
in an invalid state.

So what really make the problem appear quickly is to add a delay between
the rcu_dereference and the assertion on the data validity in thr_reader.

It now appears after just a few seconds when running
./test_urcu_yield 20 -r -n
Compiled with CFLAGS=+-DDEBUG_FULL_MB

It seem to be much harder to trigger with the signal-based version. It's
expected, because the writer takes about 50 times longer to execute than
with the -DDEBUG_FULL_MB version.

So I'll let the ./test_urcu_yield NN -r -n run for a while on the
correct version (with DEBUG_FULL_MB) and see what it gives.

Mathieu


> 							Thanx, Paul
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 19:17                   ` Mathieu Desnoyers
@ 2009-02-10 21:16                     ` Paul E. McKenney
  2009-02-10 21:28                       ` Mathieu Desnoyers
  2009-02-11  5:08                     ` Lai Jiangshan
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-10 21:16 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > 
> > [ . . . ]
> > 
> > > I just added modified rcutorture.h and api.h from your git tree
> > > specifically for an urcutorture program to the repository. Some results :
> > > 
> > > 8-way x86_64
> > > E5405 @2 GHZ
> > > 
> > > ./urcutorture 8 perf
> > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > 
> > > ./urcutorture 8 uperf
> > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > ns/read: nan  ns/update: 1812.46
> > > 
> > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > 
> > > However, I've tried removing the second switch_qparity() call, and the
> > > rcutorture test did not detect anything wrong. I also did a variation
> > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > 
> > My confusion -- I was testing my old approach where the memory barriers
> > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > your signal-handler-memory-barrier approach, I suspect that you are
> > going to need a bigger hammer.  In this case, one such bigger hammer
> > would be:
> > 
> > o	Just before exit from the signal handler, do a
> > 	pthread_cond_wait() under a pthread_mutex().
> > 
> > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > 
> > 	Then it should be safe in force_mb_all_threads() to do a
> > 	pthread_cond_broadcast() under the same pthread_mutex().
> > 
> > This should raise the probability of seeing the failure in the case
> > where there is a single switch_qparity().
> > 
> 
> I just did a mb() version of the urcu :
> 
> (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> 
> Time per read : 48.4086 cycles
> (about 6-7 times slower, as expected)
> 
> This will be useful especially to increase the chance to trigger races.
> 
> I tried removing the second parity switch from the writer. The rcu
> torture test did not find the problem yet (maybe I am not using the
> correct parameters ? It does not run for more than 5 seconds).
> 
> So I added a "-n" option to test_urcu, so it can make the usleep(1)
> between the writes optional. I also changed the yield for a usleep with
> random delay. I also now use a circular buffer rather than malloc so we
> are sure the memory is not quickly reused by the writer and stays longer
> in an invalid state.
> 
> So what really make the problem appear quickly is to add a delay between
> the rcu_dereference and the assertion on the data validity in thr_reader.
> 
> It now appears after just a few seconds when running
> ./test_urcu_yield 20 -r -n
> Compiled with CFLAGS=+-DDEBUG_FULL_MB
> 
> It seem to be much harder to trigger with the signal-based version. It's
> expected, because the writer takes about 50 times longer to execute than
> with the -DDEBUG_FULL_MB version.
> 
> So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> correct version (with DEBUG_FULL_MB) and see what it gives.

Hmmm...  I had worse luck this time, took three 10-second tries to
see a failure:

paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
n_reads: 44682055  n_updates: 9609503  n_mberror: 0
rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
./rcu_nest32 1 stress
n_reads: 42281884  n_updates: 9870129  n_mberror: 0
rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
./rcu_nest32 1 stress
n_reads: 41384304  n_updates: 10040805  n_mberror: 0
rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$

This is my prototype version, with read-side memory barriers, no
signals, and without your initialization-value speedup.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 21:16                     ` Paul E. McKenney
@ 2009-02-10 21:28                       ` Mathieu Desnoyers
  2009-02-10 22:21                         ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-10 21:28 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > I just added modified rcutorture.h and api.h from your git tree
> > > > specifically for an urcutorture program to the repository. Some results :
> > > > 
> > > > 8-way x86_64
> > > > E5405 @2 GHZ
> > > > 
> > > > ./urcutorture 8 perf
> > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > 
> > > > ./urcutorture 8 uperf
> > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > ns/read: nan  ns/update: 1812.46
> > > > 
> > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > 
> > > > However, I've tried removing the second switch_qparity() call, and the
> > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > 
> > > My confusion -- I was testing my old approach where the memory barriers
> > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > your signal-handler-memory-barrier approach, I suspect that you are
> > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > would be:
> > > 
> > > o	Just before exit from the signal handler, do a
> > > 	pthread_cond_wait() under a pthread_mutex().
> > > 
> > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > 
> > > 	Then it should be safe in force_mb_all_threads() to do a
> > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > 
> > > This should raise the probability of seeing the failure in the case
> > > where there is a single switch_qparity().
> > > 
> > 
> > I just did a mb() version of the urcu :
> > 
> > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > 
> > Time per read : 48.4086 cycles
> > (about 6-7 times slower, as expected)
> > 
> > This will be useful especially to increase the chance to trigger races.
> > 
> > I tried removing the second parity switch from the writer. The rcu
> > torture test did not find the problem yet (maybe I am not using the
> > correct parameters ? It does not run for more than 5 seconds).
> > 
> > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > between the writes optional. I also changed the yield for a usleep with
> > random delay. I also now use a circular buffer rather than malloc so we
> > are sure the memory is not quickly reused by the writer and stays longer
> > in an invalid state.
> > 
> > So what really make the problem appear quickly is to add a delay between
> > the rcu_dereference and the assertion on the data validity in thr_reader.
> > 
> > It now appears after just a few seconds when running
> > ./test_urcu_yield 20 -r -n
> > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > 
> > It seem to be much harder to trigger with the signal-based version. It's
> > expected, because the writer takes about 50 times longer to execute than
> > with the -DDEBUG_FULL_MB version.
> > 
> > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > correct version (with DEBUG_FULL_MB) and see what it gives.
> 
> Hmmm...  I had worse luck this time, took three 10-second tries to
> see a failure:
> 
> paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> ./rcu_nest32 1 stress
> n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> ./rcu_nest32 1 stress
> n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> 
> This is my prototype version, with read-side memory barriers, no
> signals, and without your initialization-value speedup.
> 

It would be interesting to re-sync our trees, or if you can point me to
a current version of your prototype, I could review it.

Thanks,

Mathieu

> 							Thanx, Paul
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 21:28                       ` Mathieu Desnoyers
@ 2009-02-10 22:21                         ` Paul E. McKenney
  2009-02-10 22:58                           ` Paul E. McKenney
  2009-02-11  0:57                           ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-10 22:21 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > 
> > > > > 8-way x86_64
> > > > > E5405 @2 GHZ
> > > > > 
> > > > > ./urcutorture 8 perf
> > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > 
> > > > > ./urcutorture 8 uperf
> > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > ns/read: nan  ns/update: 1812.46
> > > > > 
> > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > 
> > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > 
> > > > My confusion -- I was testing my old approach where the memory barriers
> > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > would be:
> > > > 
> > > > o	Just before exit from the signal handler, do a
> > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > 
> > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > 
> > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > 
> > > > This should raise the probability of seeing the failure in the case
> > > > where there is a single switch_qparity().
> > > > 
> > > 
> > > I just did a mb() version of the urcu :
> > > 
> > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > 
> > > Time per read : 48.4086 cycles
> > > (about 6-7 times slower, as expected)
> > > 
> > > This will be useful especially to increase the chance to trigger races.
> > > 
> > > I tried removing the second parity switch from the writer. The rcu
> > > torture test did not find the problem yet (maybe I am not using the
> > > correct parameters ? It does not run for more than 5 seconds).
> > > 
> > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > between the writes optional. I also changed the yield for a usleep with
> > > random delay. I also now use a circular buffer rather than malloc so we
> > > are sure the memory is not quickly reused by the writer and stays longer
> > > in an invalid state.
> > > 
> > > So what really make the problem appear quickly is to add a delay between
> > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > 
> > > It now appears after just a few seconds when running
> > > ./test_urcu_yield 20 -r -n
> > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > 
> > > It seem to be much harder to trigger with the signal-based version. It's
> > > expected, because the writer takes about 50 times longer to execute than
> > > with the -DDEBUG_FULL_MB version.
> > > 
> > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > 
> > Hmmm...  I had worse luck this time, took three 10-second tries to
> > see a failure:
> > 
> > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > ./rcu_nest32 1 stress
> > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > ./rcu_nest32 1 stress
> > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > 
> > This is my prototype version, with read-side memory barriers, no
> > signals, and without your initialization-value speedup.
> > 
> 
> It would be interesting to re-sync our trees, or if you can point me to
> a current version of your prototype, I could review it.

Look at:

	CodeSamples/defer/rcu_nest32.[hc]

In the git archive:

	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 22:21                         ` Paul E. McKenney
@ 2009-02-10 22:58                           ` Paul E. McKenney
  2009-02-10 23:01                             ` Paul E. McKenney
  2009-02-11  0:57                           ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-10 22:58 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 5593 bytes --]

On Tue, Feb 10, 2009 at 02:21:15PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > 
> > > > > > 8-way x86_64
> > > > > > E5405 @2 GHZ
> > > > > > 
> > > > > > ./urcutorture 8 perf
> > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > 
> > > > > > ./urcutorture 8 uperf
> > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > 
> > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > 
> > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > 
> > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > would be:
> > > > > 
> > > > > o	Just before exit from the signal handler, do a
> > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > 
> > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > 
> > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > 
> > > > > This should raise the probability of seeing the failure in the case
> > > > > where there is a single switch_qparity().
> > > > > 
> > > > 
> > > > I just did a mb() version of the urcu :
> > > > 
> > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > 
> > > > Time per read : 48.4086 cycles
> > > > (about 6-7 times slower, as expected)
> > > > 
> > > > This will be useful especially to increase the chance to trigger races.
> > > > 
> > > > I tried removing the second parity switch from the writer. The rcu
> > > > torture test did not find the problem yet (maybe I am not using the
> > > > correct parameters ? It does not run for more than 5 seconds).
> > > > 
> > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > between the writes optional. I also changed the yield for a usleep with
> > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > in an invalid state.
> > > > 
> > > > So what really make the problem appear quickly is to add a delay between
> > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > 
> > > > It now appears after just a few seconds when running
> > > > ./test_urcu_yield 20 -r -n
> > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > 
> > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > expected, because the writer takes about 50 times longer to execute than
> > > > with the -DDEBUG_FULL_MB version.
> > > > 
> > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > 
> > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > see a failure:
> > > 
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > ./rcu_nest32 1 stress
> > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > ./rcu_nest32 1 stress
> > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > 
> > > This is my prototype version, with read-side memory barriers, no
> > > signals, and without your initialization-value speedup.
> > > 
> > 
> > It would be interesting to re-sync our trees, or if you can point me to
> > a current version of your prototype, I could review it.
> 
> Look at:
> 
> 	CodeSamples/defer/rcu_nest32.[hc]
> 
> In the git archive:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git

And attached is an attempted Promela-based proof, along with a script
that runs it.  It currently says that this version of RCU works.  Not yet
sure whether to believe it.  ;-)

It notes that lines 37 and 92 are unreached.  37 is unreached because
the Promela code currently doesn't exercise nested RCU read-side
critical sections, and 92 is unreached because there is an infinite
loop processing memory-barrier requests at the end of the reader code.

Thoughts?

							Thanx, Paul

[-- Attachment #2: urcu.spin --]
[-- Type: text/plain, Size: 2492 bytes --]

bit removed = 0;
bit free = 0;

#define RCU_GP_CTR_BIT (1 << 7)
#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)

bit need_mb = 0;
byte urcu_gp_ctr = 1;
byte urcu_active_readers = 0;

bit reader_progress[4];

proctype urcu_reader()
{
	bit done = 0;
	byte tmp;
	byte tmp_removed;
	byte tmp_free;

	do
	:: 1 ->
		if
		:: need_mb == 1 ->
			need_mb = 0;
		:: else -> break;
		fi
	od;
	do
	:: 1 ->
		if
		:: reader_progress[0] == 0 ->
			tmp = urcu_active_readers;
			if
			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
				urcu_active_readers = urcu_gp_ctr;
			:: else ->
				urcu_active_readers = tmp + 1;
			fi;
			reader_progress[0] = 1;
		:: reader_progress[1] == 0 ->
			tmp_removed = removed;
			reader_progress[1] = 1;
		:: reader_progress[2] == 0 ->
			tmp_free = free;
			reader_progress[2] = 1;
		:: ((reader_progress[0] == 1) && (reader_progress[3] == 0)) ->
			urcu_active_readers = urcu_active_readers - 1;
		:: else -> break;
		fi;
		atomic {
			tmp = 0;
			do
			:: reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
				break;
			:: reader_progress[tmp] == 1 && tmp < 4 ->
				tmp = tmp + 1;
			:: tmp >= 4 ->
				done = 1;
				break;
			od;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
			:: tmp < 4 && reader_progress[tmp] == 1 ->
				break;
			:: tmp >= 4 ->
				if
				:: need_mb == 1 ->
					need_mb = 0;
				:: else -> skip;
				fi;
				done = 1;
				break;
			od

		}
		if
		:: done == 1 -> break;
		:: else -> skip;
		fi
	od;
	do
	:: 1 ->
		if
		:: need_mb == 1 ->
			need_mb = 0;
		:: else -> skip;
		fi;
		assert((free == 0) || (removed == 1));
	od;
}

proctype urcu_updater()
{
	removed = 1;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;

	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;

	free = 1;
}

init {
	atomic {
		reader_progress[0] = 0;
		reader_progress[1] = 0;
		reader_progress[2] = 0;
		reader_progress[3] = 0;
		run urcu_reader();
		run urcu_updater();
	}
}

[-- Attachment #3: urcu.sh --]
[-- Type: application/x-sh, Size: 53 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 22:58                           ` Paul E. McKenney
@ 2009-02-10 23:01                             ` Paul E. McKenney
  0 siblings, 0 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-10 23:01 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 6085 bytes --]

On Tue, Feb 10, 2009 at 02:58:39PM -0800, Paul E. McKenney wrote:
> On Tue, Feb 10, 2009 at 02:21:15PM -0800, Paul E. McKenney wrote:
> > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > 
> > > > > > > 8-way x86_64
> > > > > > > E5405 @2 GHZ
> > > > > > > 
> > > > > > > ./urcutorture 8 perf
> > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > 
> > > > > > > ./urcutorture 8 uperf
> > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > 
> > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > 
> > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > 
> > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > would be:
> > > > > > 
> > > > > > o	Just before exit from the signal handler, do a
> > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > 
> > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > 
> > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > 
> > > > > > This should raise the probability of seeing the failure in the case
> > > > > > where there is a single switch_qparity().
> > > > > > 
> > > > > 
> > > > > I just did a mb() version of the urcu :
> > > > > 
> > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > 
> > > > > Time per read : 48.4086 cycles
> > > > > (about 6-7 times slower, as expected)
> > > > > 
> > > > > This will be useful especially to increase the chance to trigger races.
> > > > > 
> > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > 
> > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > in an invalid state.
> > > > > 
> > > > > So what really make the problem appear quickly is to add a delay between
> > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > 
> > > > > It now appears after just a few seconds when running
> > > > > ./test_urcu_yield 20 -r -n
> > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > 
> > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > with the -DDEBUG_FULL_MB version.
> > > > > 
> > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > 
> > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > see a failure:
> > > > 
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > ./rcu_nest32 1 stress
> > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > ./rcu_nest32 1 stress
> > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > 
> > > > This is my prototype version, with read-side memory barriers, no
> > > > signals, and without your initialization-value speedup.
> > > > 
> > > 
> > > It would be interesting to re-sync our trees, or if you can point me to
> > > a current version of your prototype, I could review it.
> > 
> > Look at:
> > 
> > 	CodeSamples/defer/rcu_nest32.[hc]
> > 
> > In the git archive:
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 
> And attached is an attempted Promela-based proof, along with a script
> that runs it.  It currently says that this version of RCU works.  Not yet
> sure whether to believe it.  ;-)
> 
> It notes that lines 37 and 92 are unreached.  37 is unreached because
> the Promela code currently doesn't exercise nested RCU read-side
> critical sections, and 92 is unreached because there is an infinite
> loop processing memory-barrier requests at the end of the reader code.
> 
> Thoughts?

And of course it is trivial to add nested RCU read-side critical
sections, as in the attached.  Still passes, so up to you to figure
out what errors I have in my Promela code.  ;-)

							Thanx, Paul

[-- Attachment #2: urcu.spin --]
[-- Type: text/plain, Size: 2513 bytes --]

bit removed = 0;
bit free = 0;

#define RCU_GP_CTR_BIT (1 << 7)
#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)

bit need_mb = 0;
byte urcu_gp_ctr = 1;
byte urcu_active_readers = 0;

byte reader_progress[4];

proctype urcu_reader()
{
	bit done = 0;
	byte tmp;
	byte tmp_removed;
	byte tmp_free;

	do
	:: 1 ->
		if
		:: need_mb == 1 ->
			need_mb = 0;
		:: else -> break;
		fi
	od;
	do
	:: 1 ->
		if
		:: reader_progress[0] < 2 ->
			tmp = urcu_active_readers;
			if
			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
				urcu_active_readers = urcu_gp_ctr;
			:: else ->
				urcu_active_readers = tmp + 1;
			fi;
			reader_progress[0] = 1;
		:: reader_progress[1] == 0 ->
			tmp_removed = removed;
			reader_progress[1] = 1;
		:: reader_progress[2] == 0 ->
			tmp_free = free;
			reader_progress[2] = 1;
		:: ((reader_progress[0] > reader_progress[3]) &&
		    (reader_progress[3] < 2)) ->
			urcu_active_readers = urcu_active_readers - 1;
		:: else -> break;
		fi;
		atomic {
			tmp = 0;
			do
			:: reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
				break;
			:: reader_progress[tmp] == 1 && tmp < 4 ->
				tmp = tmp + 1;
			:: tmp >= 4 ->
				done = 1;
				break;
			od;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
			:: tmp < 4 && reader_progress[tmp] == 1 ->
				break;
			:: tmp >= 4 ->
				if
				:: need_mb == 1 ->
					need_mb = 0;
				:: else -> skip;
				fi;
				done = 1;
				break;
			od

		}
		if
		:: done == 1 -> break;
		:: else -> skip;
		fi
	od;
	do
	:: 1 ->
		if
		:: need_mb == 1 ->
			need_mb = 0;
		:: else -> skip;
		fi;
		assert((free == 0) || (removed == 1));
	od;
}

proctype urcu_updater()
{
	removed = 1;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;

	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;

	free = 1;
}

init {
	atomic {
		reader_progress[0] = 0;
		reader_progress[1] = 0;
		reader_progress[2] = 0;
		reader_progress[3] = 0;
		run urcu_reader();
		run urcu_updater();
	}
}

[-- Attachment #3: urcu.sh --]
[-- Type: application/x-sh, Size: 53 bytes --]

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 22:21                         ` Paul E. McKenney
  2009-02-10 22:58                           ` Paul E. McKenney
@ 2009-02-11  0:57                           ` Mathieu Desnoyers
  2009-02-11  5:28                             ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11  0:57 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > 
> > > > > > 8-way x86_64
> > > > > > E5405 @2 GHZ
> > > > > > 
> > > > > > ./urcutorture 8 perf
> > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > 
> > > > > > ./urcutorture 8 uperf
> > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > 
> > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > 
> > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > 
> > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > would be:
> > > > > 
> > > > > o	Just before exit from the signal handler, do a
> > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > 
> > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > 
> > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > 
> > > > > This should raise the probability of seeing the failure in the case
> > > > > where there is a single switch_qparity().
> > > > > 
> > > > 
> > > > I just did a mb() version of the urcu :
> > > > 
> > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > 
> > > > Time per read : 48.4086 cycles
> > > > (about 6-7 times slower, as expected)
> > > > 
> > > > This will be useful especially to increase the chance to trigger races.
> > > > 
> > > > I tried removing the second parity switch from the writer. The rcu
> > > > torture test did not find the problem yet (maybe I am not using the
> > > > correct parameters ? It does not run for more than 5 seconds).
> > > > 
> > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > between the writes optional. I also changed the yield for a usleep with
> > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > in an invalid state.
> > > > 
> > > > So what really make the problem appear quickly is to add a delay between
> > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > 
> > > > It now appears after just a few seconds when running
> > > > ./test_urcu_yield 20 -r -n
> > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > 
> > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > expected, because the writer takes about 50 times longer to execute than
> > > > with the -DDEBUG_FULL_MB version.
> > > > 
> > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > 
> > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > see a failure:
> > > 
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > ./rcu_nest32 1 stress
> > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > ./rcu_nest32 1 stress
> > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > 
> > > This is my prototype version, with read-side memory barriers, no
> > > signals, and without your initialization-value speedup.
> > > 
> > 
> > It would be interesting to re-sync our trees, or if you can point me to
> > a current version of your prototype, I could review it.
> 
> Look at:
> 
> 	CodeSamples/defer/rcu_nest32.[hc]
> 
> In the git archive:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 

flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.

Another major difference between our tree is the lack of smp_mb() at the
end of flip_counter_and_wait() (in your tree).

Your code does :

  smp_mb()
  switch parity
  smp_mb()
  wait for each thread ongoing old gp
    <<<<<<< ---- missing smp_mb.
  switch parity
  smp_mb()
  wait for each thread ongoing old gp
  smp_mb()

I also wonder why you have a smp_mb() after spin_unlock() in your
synchronize_rcu() -> if you follow the Linux kernel semantics for
spinlocks, the smp_mb() should be implied. (but I have not looked at
your spin_lock/unlock primitives yet).

Mathieu

> 							Thanx, Paul
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-10 19:17                   ` Mathieu Desnoyers
  2009-02-10 21:16                     ` Paul E. McKenney
@ 2009-02-11  5:08                     ` Lai Jiangshan
  2009-02-11  8:58                       ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Lai Jiangshan @ 2009-02-11  5:08 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: Paul E. McKenney, ltt-dev, linux-kernel

Mathieu Desnoyers wrote:
> 
> I just did a mb() version of the urcu :
> 
> (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> 
> Time per read : 48.4086 cycles
> (about 6-7 times slower, as expected)
> 

I had read many papers of Paul.
(http://www.rdrop.com/users/paulmck/RCU/)
and I know Paul did his endeavor to remove memory barrier in
RCU read site in kernel. His work is of consequence.

But, I think,
1) Userspace RCU's read site can pay for the latency of
memory barrier(include atomic operator).
   Userspace does not access to shared data so frequently as kernel.
and userspace's read site is not so fast as kernel.

2) Userspace uses RCU is for RCU's excellence, not saving a little cpu cycles
   (http://lwn.net/Articles/263130/)
   One of the most important excellence is lock-free.


If my thinking is right, the following opinion has some meaning too.

Use All-SYSTEM 's RCU for Userspace RCU.

All-SYSTEM 's RCU is QRCU which is implemented by Paul.
http://lwn.net/Articles/223752/

Any system which has mechanisms equivalent to atomic_op,
__wait_event, wake_up, mutex, This system can also implement QRCU.
So most system can implement QRCU, and I say QRCU is All-SYSTEM 's RCU.

Obviously, we can implement a portable QRCU highly simply in NPTL.
and read lock is:
	for (;;) {
		int idx = qp->completed & 0x1;
		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
			return idx;
	}
"atomic_inc_not_zero" is called once likely, it's fast enough.

Lai.




^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11  0:57                           ` Mathieu Desnoyers
@ 2009-02-11  5:28                             ` Paul E. McKenney
  2009-02-11  6:35                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-11  5:28 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > 
> > > > > > > 8-way x86_64
> > > > > > > E5405 @2 GHZ
> > > > > > > 
> > > > > > > ./urcutorture 8 perf
> > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > 
> > > > > > > ./urcutorture 8 uperf
> > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > 
> > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > 
> > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > 
> > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > would be:
> > > > > > 
> > > > > > o	Just before exit from the signal handler, do a
> > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > 
> > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > 
> > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > 
> > > > > > This should raise the probability of seeing the failure in the case
> > > > > > where there is a single switch_qparity().
> > > > > > 
> > > > > 
> > > > > I just did a mb() version of the urcu :
> > > > > 
> > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > 
> > > > > Time per read : 48.4086 cycles
> > > > > (about 6-7 times slower, as expected)
> > > > > 
> > > > > This will be useful especially to increase the chance to trigger races.
> > > > > 
> > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > 
> > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > in an invalid state.
> > > > > 
> > > > > So what really make the problem appear quickly is to add a delay between
> > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > 
> > > > > It now appears after just a few seconds when running
> > > > > ./test_urcu_yield 20 -r -n
> > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > 
> > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > with the -DDEBUG_FULL_MB version.
> > > > > 
> > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > 
> > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > see a failure:
> > > > 
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > ./rcu_nest32 1 stress
> > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > ./rcu_nest32 1 stress
> > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > 
> > > > This is my prototype version, with read-side memory barriers, no
> > > > signals, and without your initialization-value speedup.
> > > > 
> > > 
> > > It would be interesting to re-sync our trees, or if you can point me to
> > > a current version of your prototype, I could review it.
> > 
> > Look at:
> > 
> > 	CodeSamples/defer/rcu_nest32.[hc]
> > 
> > In the git archive:
> > 
> > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> 
> flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.

Yep, this is before your optimization.

> Another major difference between our tree is the lack of smp_mb() at the
> end of flip_counter_and_wait() (in your tree).
> 
> Your code does :
> 
>   smp_mb()
>   switch parity
>   smp_mb()
>   wait for each thread ongoing old gp
>     <<<<<<< ---- missing smp_mb.
>   switch parity
>   smp_mb()
>   wait for each thread ongoing old gp
>   smp_mb()

This should be OK -- or am I missing a failure scenario?
Keep in mind that I get failures only when omitting a counter
flip, not with the above code.

> I also wonder why you have a smp_mb() after spin_unlock() in your
> synchronize_rcu() -> if you follow the Linux kernel semantics for
> spinlocks, the smp_mb() should be implied. (but I have not looked at
> your spin_lock/unlock primitives yet).

Perhaps things have changed, but last I knew, spin_lock() and
spin_unlock() were only required to keep the critical section in, not
to keep things out of the critical section.

							Thanx, Paul

> Mathieu
> 
> > 							Thanx, Paul
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11  5:28                             ` Paul E. McKenney
@ 2009-02-11  6:35                               ` Mathieu Desnoyers
  2009-02-11 15:32                                 ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11  6:35 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > 
> > > > > > > > 8-way x86_64
> > > > > > > > E5405 @2 GHZ
> > > > > > > > 
> > > > > > > > ./urcutorture 8 perf
> > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > 
> > > > > > > > ./urcutorture 8 uperf
> > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > 
> > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > 
> > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > 
> > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > would be:
> > > > > > > 
> > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > 
> > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > 
> > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > 
> > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > where there is a single switch_qparity().
> > > > > > > 
> > > > > > 
> > > > > > I just did a mb() version of the urcu :
> > > > > > 
> > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > 
> > > > > > Time per read : 48.4086 cycles
> > > > > > (about 6-7 times slower, as expected)
> > > > > > 
> > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > 
> > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > 
> > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > in an invalid state.
> > > > > > 
> > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > 
> > > > > > It now appears after just a few seconds when running
> > > > > > ./test_urcu_yield 20 -r -n
> > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > 
> > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > 
> > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > 
> > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > see a failure:
> > > > > 
> > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > ./rcu_nest32 1 stress
> > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > ./rcu_nest32 1 stress
> > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > 
> > > > > This is my prototype version, with read-side memory barriers, no
> > > > > signals, and without your initialization-value speedup.
> > > > > 
> > > > 
> > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > a current version of your prototype, I could review it.
> > > 
> > > Look at:
> > > 
> > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > 
> > > In the git archive:
> > > 
> > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > 
> > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> 
> Yep, this is before your optimization.
> 

Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
possible effect on the LSBs. That should work even if it overflows. OK.
That should even work with my optimization. But I somehow prefer the xor
(if it's not slower), because we really only need 1 bit to flip on and
off.

> > Another major difference between our tree is the lack of smp_mb() at the
> > end of flip_counter_and_wait() (in your tree).
> > 
> > Your code does :
> > 
> >   smp_mb()
> >   switch parity
> >   smp_mb()
> >   wait for each thread ongoing old gp
> >     <<<<<<< ---- missing smp_mb.
> >   switch parity
> >   smp_mb()
> >   wait for each thread ongoing old gp
> >   smp_mb()
> 
> This should be OK -- or am I missing a failure scenario?
> Keep in mind that I get failures only when omitting a counter
> flip, not with the above code.
> 

OK, it's good that you point out that the failure only occurs when
omitting the counter flip.

So if we leave out the mb() we can end up in a situation where a reader
thread is still in an ongoing old gp and we switch the parity. The big
question is : should we be concerned about this ?

>From the writer point of view :

Given there is no data dependency between the parity update and the
per_thread(rcu_reader_gp, t) read done in the while loop waiting for
threads, and given even the compiler barrier() has no effect wrt the
last test done after the last iteration of the loop, we could think of
compiler optimizations doing the following to our code (let's focus on a
single loop of for_each_thread) :

transforming

                while (rcu_old_gp_ongoing(t))
                        barrier();
                rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;

into

                if (!rcu_old_gp_ongoing(t))
                  goto end;
                while (rcu_old_gp_ongoing(t))
                        barrier();
end:
                rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;

This leaves the choice to the compiler to perform the rcu_gp_ctr
increment before the per_thread(rcu_reader_gp, t) read, because there is
no barrier.

Not only does this apply to the compiler, but also to the memory
barriers. We can end up in a situation where the CPU decides to to the
rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
given there is no data dependency between those two.

You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
t) read will order reads, but I don't think we should rely on this on
SMP. This is really supposed to be there just to make sure we don't end
up doing multiple variable reads on UP wrt to local interrupts.

You could also argue that rcu_gp_ctr is read within
rcu_old_gp_ongoing(), which should normally order the memory accesses.
It actually does only order memory access to the rcu_gp_ctr variable,
not the per_thread(rcu_reader_gp, t), because, here again, there if no
data dependency whatsoever between per_thread(rcu_reader_gp, t) and
rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
the rcu_gp_ctr increment, and only then could the
per_thread(rcu_reader_gp, t) variable be read to perform the test.

But I see that even in rcu_read_lock, there is no strict ordering
between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
I conclude that ordering between those two variables does not matter at
all. I also suspect that this is the core reason for doing 2 q.s. period
flip at each update.

Am I correct ?


> > I also wonder why you have a smp_mb() after spin_unlock() in your
> > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > your spin_lock/unlock primitives yet).
> 
> Perhaps things have changed, but last I knew, spin_lock() and
> spin_unlock() were only required to keep the critical section in, not
> to keep things out of the critical section.
> 

Hrm, reading Documentation/memory-barriers.txt again tells me things
might have changed (if I am reading correctly the section LOCKS VS
MEMORY ACCESSES).

Correct me if I am wrong, but I don't think it makes sense to insure
memory barriers to keep accesses within the critical section and not
outside, because such memory access could well be another spinlock.

Therefore, we could end up in a situation where we have two locks, A and
B, taken in the following order in the source code :

LOCK A

UNLOCK A

LOCK B

UNLOCK B

Then, following your assumption, it would be possible for a CPU to do
the memory accesses associated to lock A and B in a random order one vs
the other. Given there would be no requirement to keep things out of
those respective critical sections, LOCK A could be taken within LOCK B,
and the opposite would also be valid.

Valid memory access orders :

1)
LOCK A
LOCK B
UNLOCK B
UNLOCK A

2)
LOCK B
LOCK A
UNLOCK A
UNLOCK B

The only constraint that ensures we won't end up in this situation is
the fact that memory accesses done outside of the critical section stays
outside of the critical section.

Mathieu



> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 							Thanx, Paul
> > > 
> > > _______________________________________________
> > > ltt-dev mailing list
> > > ltt-dev@lists.casi.polymtl.ca
> > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11  5:08                     ` Lai Jiangshan
@ 2009-02-11  8:58                       ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11  8:58 UTC (permalink / raw)
  To: Lai Jiangshan; +Cc: Paul E. McKenney, ltt-dev, linux-kernel

* Lai Jiangshan (laijs@cn.fujitsu.com) wrote:
> Mathieu Desnoyers wrote:
> > 
> > I just did a mb() version of the urcu :
> > 
> > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > 
> > Time per read : 48.4086 cycles
> > (about 6-7 times slower, as expected)
> > 
> 
> I had read many papers of Paul.
> (http://www.rdrop.com/users/paulmck/RCU/)
> and I know Paul did his endeavor to remove memory barrier in
> RCU read site in kernel. His work is of consequence.
> 
> But, I think,
> 1) Userspace RCU's read site can pay for the latency of
> memory barrier(include atomic operator).
>    Userspace does not access to shared data so frequently as kernel.
> and userspace's read site is not so fast as kernel.
> 
> 2) Userspace uses RCU is for RCU's excellence, not saving a little cpu cycles
>    (http://lwn.net/Articles/263130/)
>    One of the most important excellence is lock-free.
> 
> 
> If my thinking is right, the following opinion has some meaning too.
> 
> Use All-SYSTEM 's RCU for Userspace RCU.
> 
> All-SYSTEM 's RCU is QRCU which is implemented by Paul.
> http://lwn.net/Articles/223752/
> 
> Any system which has mechanisms equivalent to atomic_op,
> __wait_event, wake_up, mutex, This system can also implement QRCU.
> So most system can implement QRCU, and I say QRCU is All-SYSTEM 's RCU.
> 
> Obviously, we can implement a portable QRCU highly simply in NPTL.
> and read lock is:
> 	for (;;) {
> 		int idx = qp->completed & 0x1;
> 		if (likely(atomic_inc_not_zero(qp->ctr + idx)))
> 			return idx;
> 	}
> "atomic_inc_not_zero" is called once likely, it's fast enough.
> 

Hi Lai,

There are a few reasons why we need rcu in userspace for tracing :

- We need very fast per-cpu read-side synchronization for data structure
  handling. Updates are rare (enabling/disabling tracing). Therefore,
  your argument about userspace not needing "fast" rcu does not hold in
  this case. Note that LTTng has the performance it has today in the
  kernel because I made sure to use no memory barriers when unnecessary
  and because I used the minimal amount of atomic operations required.
  Those represent costly synchronization primitives on quite a few
  architectures.
- Being lock-free (atomic). To trace code executed in signal handlers,
  we need to be able to nest over any user code. With the solution you
  propose above, the busy-loop in the read-lock does not seems to be
  signal-safe : if it nests over a writer, it could busy-loop forever.

Mathieu

> Lai.
> 
> 
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11  6:35                               ` Mathieu Desnoyers
@ 2009-02-11 15:32                                 ` Paul E. McKenney
  2009-02-11 18:52                                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-11 15:32 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Wed, Feb 11, 2009 at 01:35:20AM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > > 
> > > > > > > > [ . . . ]
> > > > > > > > 
> > > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > > 
> > > > > > > > > 8-way x86_64
> > > > > > > > > E5405 @2 GHZ
> > > > > > > > > 
> > > > > > > > > ./urcutorture 8 perf
> > > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > > 
> > > > > > > > > ./urcutorture 8 uperf
> > > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > > 
> > > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > > 
> > > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > > 
> > > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > > would be:
> > > > > > > > 
> > > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > > 
> > > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > > 
> > > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > > 
> > > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > > where there is a single switch_qparity().
> > > > > > > > 
> > > > > > > 
> > > > > > > I just did a mb() version of the urcu :
> > > > > > > 
> > > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > > 
> > > > > > > Time per read : 48.4086 cycles
> > > > > > > (about 6-7 times slower, as expected)
> > > > > > > 
> > > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > > 
> > > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > > 
> > > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > > in an invalid state.
> > > > > > > 
> > > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > > 
> > > > > > > It now appears after just a few seconds when running
> > > > > > > ./test_urcu_yield 20 -r -n
> > > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > > 
> > > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > > 
> > > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > > 
> > > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > > see a failure:
> > > > > > 
> > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > ./rcu_nest32 1 stress
> > > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > ./rcu_nest32 1 stress
> > > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > > 
> > > > > > This is my prototype version, with read-side memory barriers, no
> > > > > > signals, and without your initialization-value speedup.
> > > > > > 
> > > > > 
> > > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > > a current version of your prototype, I could review it.
> > > > 
> > > > Look at:
> > > > 
> > > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > > 
> > > > In the git archive:
> > > > 
> > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > 
> > > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> > 
> > Yep, this is before your optimization.
> > 
> 
> Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
> possible effect on the LSBs. That should work even if it overflows. OK.
> That should even work with my optimization. But I somehow prefer the xor
> (if it's not slower), because we really only need 1 bit to flip on and
> off.
> 
> > > Another major difference between our tree is the lack of smp_mb() at the
> > > end of flip_counter_and_wait() (in your tree).
> > > 
> > > Your code does :
> > > 
> > >   smp_mb()
> > >   switch parity
> > >   smp_mb()
> > >   wait for each thread ongoing old gp
> > >     <<<<<<< ---- missing smp_mb.
> > >   switch parity
> > >   smp_mb()
> > >   wait for each thread ongoing old gp
> > >   smp_mb()
> > 
> > This should be OK -- or am I missing a failure scenario?
> > Keep in mind that I get failures only when omitting a counter
> > flip, not with the above code.
> > 
> 
> OK, it's good that you point out that the failure only occurs when
> omitting the counter flip.
> 
> So if we leave out the mb() we can end up in a situation where a reader
> thread is still in an ongoing old gp and we switch the parity. The big
> question is : should we be concerned about this ?
> 
> From the writer point of view :
> 
> Given there is no data dependency between the parity update and the
> per_thread(rcu_reader_gp, t) read done in the while loop waiting for
> threads, and given even the compiler barrier() has no effect wrt the
> last test done after the last iteration of the loop, we could think of
> compiler optimizations doing the following to our code (let's focus on a
> single loop of for_each_thread) :
> 
> transforming
> 
>                 while (rcu_old_gp_ongoing(t))
>                         barrier();
>                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> 
> into
> 
>                 if (!rcu_old_gp_ongoing(t))
>                   goto end;
>                 while (rcu_old_gp_ongoing(t))
>                         barrier();
> end:
>                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> 
> This leaves the choice to the compiler to perform the rcu_gp_ctr
> increment before the per_thread(rcu_reader_gp, t) read, because there is
> no barrier.
> 
> Not only does this apply to the compiler, but also to the memory
> barriers. We can end up in a situation where the CPU decides to to the
> rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
> given there is no data dependency between those two.
> 
> You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
> t) read will order reads, but I don't think we should rely on this on
> SMP. This is really supposed to be there just to make sure we don't end
> up doing multiple variable reads on UP wrt to local interrupts.
> 
> You could also argue that rcu_gp_ctr is read within
> rcu_old_gp_ongoing(), which should normally order the memory accesses.
> It actually does only order memory access to the rcu_gp_ctr variable,
> not the per_thread(rcu_reader_gp, t), because, here again, there if no
> data dependency whatsoever between per_thread(rcu_reader_gp, t) and
> rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
> the rcu_gp_ctr increment, and only then could the
> per_thread(rcu_reader_gp, t) variable be read to perform the test.
> 
> But I see that even in rcu_read_lock, there is no strict ordering
> between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
> I conclude that ordering between those two variables does not matter at
> all. I also suspect that this is the core reason for doing 2 q.s. period
> flip at each update.
> 
> Am I correct ?

I do not believe so -- please see my earlier email calling out the
sequence of events leading to failure in the single-flip case:

	http://lkml.org/lkml/2009/2/7/67

> > > I also wonder why you have a smp_mb() after spin_unlock() in your
> > > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > > your spin_lock/unlock primitives yet).
> > 
> > Perhaps things have changed, but last I knew, spin_lock() and
> > spin_unlock() were only required to keep the critical section in, not
> > to keep things out of the critical section.
> > 
> 
> Hrm, reading Documentation/memory-barriers.txt again tells me things
> might have changed (if I am reading correctly the section LOCKS VS
> MEMORY ACCESSES).

In the 2.6.26 version of Documentation/memory-barriers.txt, there is
the following near line 366:

 (5) LOCK operations.

     This acts as a one-way permeable barrier.  It guarantees that all memory
     operations after the LOCK operation will appear to happen after the LOCK
     operation with respect to the other components of the system.

     Memory operations that occur before a LOCK operation may appear to happen
     after it completes.

     A LOCK operation should almost always be paired with an UNLOCK operation.


 (6) UNLOCK operations.

     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the UNLOCK operation will appear to happen before
     the UNLOCK operation with respect to the other components of the system.

     Memory operations that occur after an UNLOCK operation may appear to
     happen before it completes.

     LOCK and UNLOCK operations are guaranteed to appear with respect to each
     other strictly in the order specified.

     The use of LOCK and UNLOCK operations generally precludes the need for
     other sorts of memory barrier (but note the exceptions mentioned in the
     subsection "MMIO write barrier").

> Correct me if I am wrong, but I don't think it makes sense to insure
> memory barriers to keep accesses within the critical section and not
> outside, because such memory access could well be another spinlock.

Almost, but not quite.  ;-)

> Therefore, we could end up in a situation where we have two locks, A and
> B, taken in the following order in the source code :
> 
> LOCK A
> 
> UNLOCK A
> 
> LOCK B
> 
> UNLOCK B
> 
> Then, following your assumption, it would be possible for a CPU to do
> the memory accesses associated to lock A and B in a random order one vs
> the other. Given there would be no requirement to keep things out of
> those respective critical sections, LOCK A could be taken within LOCK B,
> and the opposite would also be valid.
> 
> Valid memory access orders :
> 
> 1)
> LOCK A
> LOCK B
> UNLOCK B
> UNLOCK A
> 
> 2)
> LOCK B
> LOCK A
> UNLOCK A
> UNLOCK B

#2 is wrong -- LOCK A is guaranteed to prohibit LOCK B from passing it,
as that would be equivalent to letting LOCK A's critical section leak out.

> The only constraint that ensures we won't end up in this situation is
> the fact that memory accesses done outside of the critical section stays
> outside of the critical section.

Let's take it one transformation at a time:

1.	LOCK A; UNLOCK A; LOCK B; UNLOCK B

2.	LOCK A; LOCK B; UNLOCK A; UNLOCK B

	This one is OK, because both the LOCK B and the UNLOCK A
	are permitted to allow more stuff to enter their respective
	critical sections.

3.	LOCK B; LOCK A; UNLOCK A; UNLOCK B

	This is -not- legal!  LOCK A is forbidden to allow LOCK B
	to escape its critical section.

Does this make sense?

							Thanx, Paul

> Mathieu
> 
> 
> 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > > 							Thanx, Paul
> > > > 
> > > > _______________________________________________
> > > > ltt-dev mailing list
> > > > ltt-dev@lists.casi.polymtl.ca
> > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11 15:32                                 ` Paul E. McKenney
@ 2009-02-11 18:52                                   ` Mathieu Desnoyers
  2009-02-11 20:09                                     ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11 18:52 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 01:35:20AM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > 
> > > > > > > > > [ . . . ]
> > > > > > > > > 
> > > > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > > > 
> > > > > > > > > > 8-way x86_64
> > > > > > > > > > E5405 @2 GHZ
> > > > > > > > > > 
> > > > > > > > > > ./urcutorture 8 perf
> > > > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > > > 
> > > > > > > > > > ./urcutorture 8 uperf
> > > > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > > > 
> > > > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > > > 
> > > > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > > > 
> > > > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > > > would be:
> > > > > > > > > 
> > > > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > > > 
> > > > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > > > 
> > > > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > > > 
> > > > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > > > where there is a single switch_qparity().
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I just did a mb() version of the urcu :
> > > > > > > > 
> > > > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > > > 
> > > > > > > > Time per read : 48.4086 cycles
> > > > > > > > (about 6-7 times slower, as expected)
> > > > > > > > 
> > > > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > > > 
> > > > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > > > 
> > > > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > > > in an invalid state.
> > > > > > > > 
> > > > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > > > 
> > > > > > > > It now appears after just a few seconds when running
> > > > > > > > ./test_urcu_yield 20 -r -n
> > > > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > > > 
> > > > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > > > 
> > > > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > > > 
> > > > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > > > see a failure:
> > > > > > > 
> > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > ./rcu_nest32 1 stress
> > > > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > ./rcu_nest32 1 stress
> > > > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > > > 
> > > > > > > This is my prototype version, with read-side memory barriers, no
> > > > > > > signals, and without your initialization-value speedup.
> > > > > > > 
> > > > > > 
> > > > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > > > a current version of your prototype, I could review it.
> > > > > 
> > > > > Look at:
> > > > > 
> > > > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > > > 
> > > > > In the git archive:
> > > > > 
> > > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > 
> > > > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > > > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> > > 
> > > Yep, this is before your optimization.
> > > 
> > 
> > Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
> > possible effect on the LSBs. That should work even if it overflows. OK.
> > That should even work with my optimization. But I somehow prefer the xor
> > (if it's not slower), because we really only need 1 bit to flip on and
> > off.
> > 
> > > > Another major difference between our tree is the lack of smp_mb() at the
> > > > end of flip_counter_and_wait() (in your tree).
> > > > 
> > > > Your code does :
> > > > 
> > > >   smp_mb()
> > > >   switch parity
> > > >   smp_mb()
> > > >   wait for each thread ongoing old gp
> > > >     <<<<<<< ---- missing smp_mb.
> > > >   switch parity
> > > >   smp_mb()
> > > >   wait for each thread ongoing old gp
> > > >   smp_mb()
> > > 
> > > This should be OK -- or am I missing a failure scenario?
> > > Keep in mind that I get failures only when omitting a counter
> > > flip, not with the above code.
> > > 
> > 
> > OK, it's good that you point out that the failure only occurs when
> > omitting the counter flip.
> > 
> > So if we leave out the mb() we can end up in a situation where a reader
> > thread is still in an ongoing old gp and we switch the parity. The big
> > question is : should we be concerned about this ?
> > 
> > From the writer point of view :
> > 
> > Given there is no data dependency between the parity update and the
> > per_thread(rcu_reader_gp, t) read done in the while loop waiting for
> > threads, and given even the compiler barrier() has no effect wrt the
> > last test done after the last iteration of the loop, we could think of
> > compiler optimizations doing the following to our code (let's focus on a
> > single loop of for_each_thread) :
> > 
> > transforming
> > 
> >                 while (rcu_old_gp_ongoing(t))
> >                         barrier();
> >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > 
> > into
> > 
> >                 if (!rcu_old_gp_ongoing(t))
> >                   goto end;
> >                 while (rcu_old_gp_ongoing(t))
> >                         barrier();
> > end:
> >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > 
> > This leaves the choice to the compiler to perform the rcu_gp_ctr
> > increment before the per_thread(rcu_reader_gp, t) read, because there is
> > no barrier.
> > 
> > Not only does this apply to the compiler, but also to the memory
> > barriers. We can end up in a situation where the CPU decides to to the
> > rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
> > given there is no data dependency between those two.
> > 
> > You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
> > t) read will order reads, but I don't think we should rely on this on
> > SMP. This is really supposed to be there just to make sure we don't end
> > up doing multiple variable reads on UP wrt to local interrupts.
> > 
> > You could also argue that rcu_gp_ctr is read within
> > rcu_old_gp_ongoing(), which should normally order the memory accesses.
> > It actually does only order memory access to the rcu_gp_ctr variable,
> > not the per_thread(rcu_reader_gp, t), because, here again, there if no
> > data dependency whatsoever between per_thread(rcu_reader_gp, t) and
> > rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
> > the rcu_gp_ctr increment, and only then could the
> > per_thread(rcu_reader_gp, t) variable be read to perform the test.
> > 
> > But I see that even in rcu_read_lock, there is no strict ordering
> > between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
> > I conclude that ordering between those two variables does not matter at
> > all. I also suspect that this is the core reason for doing 2 q.s. period
> > flip at each update.
> > 
> > Am I correct ?
> 
> I do not believe so -- please see my earlier email calling out the
> sequence of events leading to failure in the single-flip case:
> 
> 	http://lkml.org/lkml/2009/2/7/67
> 

Hrm, let me present it in a different, more straightfoward way :

In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)

There is a memory barrier here in the updater :

	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;

However, in your C code of nest_32.c, there is none. So it is at the
very least an inconsistency between your code and your model.


> > > > I also wonder why you have a smp_mb() after spin_unlock() in your
> > > > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > > > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > > > your spin_lock/unlock primitives yet).
> > > 
> > > Perhaps things have changed, but last I knew, spin_lock() and
> > > spin_unlock() were only required to keep the critical section in, not
> > > to keep things out of the critical section.
> > > 
> > 
> > Hrm, reading Documentation/memory-barriers.txt again tells me things
> > might have changed (if I am reading correctly the section LOCKS VS
> > MEMORY ACCESSES).
> 
> In the 2.6.26 version of Documentation/memory-barriers.txt, there is
> the following near line 366:
> 
>  (5) LOCK operations.
> 
>      This acts as a one-way permeable barrier.  It guarantees that all memory
>      operations after the LOCK operation will appear to happen after the LOCK
>      operation with respect to the other components of the system.
> 
>      Memory operations that occur before a LOCK operation may appear to happen
>      after it completes.
> 
>      A LOCK operation should almost always be paired with an UNLOCK operation.
> 
> 
>  (6) UNLOCK operations.
> 
>      This also acts as a one-way permeable barrier.  It guarantees that all
>      memory operations before the UNLOCK operation will appear to happen before
>      the UNLOCK operation with respect to the other components of the system.
> 
>      Memory operations that occur after an UNLOCK operation may appear to
>      happen before it completes.
> 
>      LOCK and UNLOCK operations are guaranteed to appear with respect to each
>      other strictly in the order specified.
> 
>      The use of LOCK and UNLOCK operations generally precludes the need for
>      other sorts of memory barrier (but note the exceptions mentioned in the
>      subsection "MMIO write barrier").
> 
> > Correct me if I am wrong, but I don't think it makes sense to insure
> > memory barriers to keep accesses within the critical section and not
> > outside, because such memory access could well be another spinlock.
> 
> Almost, but not quite.  ;-)
> 
> > Therefore, we could end up in a situation where we have two locks, A and
> > B, taken in the following order in the source code :
> > 
> > LOCK A
> > 
> > UNLOCK A
> > 
> > LOCK B
> > 
> > UNLOCK B
> > 
> > Then, following your assumption, it would be possible for a CPU to do
> > the memory accesses associated to lock A and B in a random order one vs
> > the other. Given there would be no requirement to keep things out of
> > those respective critical sections, LOCK A could be taken within LOCK B,
> > and the opposite would also be valid.
> > 
> > Valid memory access orders :
> > 
> > 1)
> > LOCK A
> > LOCK B
> > UNLOCK B
> > UNLOCK A
> > 
> > 2)
> > LOCK B
> > LOCK A
> > UNLOCK A
> > UNLOCK B
> 
> #2 is wrong -- LOCK A is guaranteed to prohibit LOCK B from passing it,
> as that would be equivalent to letting LOCK A's critical section leak out.
> 
> > The only constraint that ensures we won't end up in this situation is
> > the fact that memory accesses done outside of the critical section stays
> > outside of the critical section.
> 
> Let's take it one transformation at a time:
> 
> 1.	LOCK A; UNLOCK A; LOCK B; UNLOCK B
> 
> 2.	LOCK A; LOCK B; UNLOCK A; UNLOCK B
> 
> 	This one is OK, because both the LOCK B and the UNLOCK A
> 	are permitted to allow more stuff to enter their respective
> 	critical sections.
> 
> 3.	LOCK B; LOCK A; UNLOCK A; UNLOCK B
> 
> 	This is -not- legal!  LOCK A is forbidden to allow LOCK B
> 	to escape its critical section.
> 
> Does this make sense?
> 

Ah, yes. Thanks for the explanation.

Mathieu

> 							Thanx, Paul
> 
> > Mathieu
> > 
> > 
> > 
> > > 							Thanx, Paul
> > > 
> > > > Mathieu
> > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > _______________________________________________
> > > > > ltt-dev mailing list
> > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11 18:52                                   ` Mathieu Desnoyers
@ 2009-02-11 20:09                                     ` Paul E. McKenney
  2009-02-11 21:42                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-11 20:09 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Wed, Feb 11, 2009 at 01:52:03PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 01:35:20AM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > 
> > > > > > > > > > [ . . . ]
> > > > > > > > > > 
> > > > > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > > > > 
> > > > > > > > > > > 8-way x86_64
> > > > > > > > > > > E5405 @2 GHZ
> > > > > > > > > > > 
> > > > > > > > > > > ./urcutorture 8 perf
> > > > > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > > > > 
> > > > > > > > > > > ./urcutorture 8 uperf
> > > > > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > > > > 
> > > > > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > > > > 
> > > > > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > > > > 
> > > > > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > > > > would be:
> > > > > > > > > > 
> > > > > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > > > > 
> > > > > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > > > > 
> > > > > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > > > > 
> > > > > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > > > > where there is a single switch_qparity().
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I just did a mb() version of the urcu :
> > > > > > > > > 
> > > > > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > > > > 
> > > > > > > > > Time per read : 48.4086 cycles
> > > > > > > > > (about 6-7 times slower, as expected)
> > > > > > > > > 
> > > > > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > > > > 
> > > > > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > > > > 
> > > > > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > > > > in an invalid state.
> > > > > > > > > 
> > > > > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > > > > 
> > > > > > > > > It now appears after just a few seconds when running
> > > > > > > > > ./test_urcu_yield 20 -r -n
> > > > > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > > > > 
> > > > > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > > > > 
> > > > > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > > > > 
> > > > > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > > > > see a failure:
> > > > > > > > 
> > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > > > > 
> > > > > > > > This is my prototype version, with read-side memory barriers, no
> > > > > > > > signals, and without your initialization-value speedup.
> > > > > > > > 
> > > > > > > 
> > > > > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > > > > a current version of your prototype, I could review it.
> > > > > > 
> > > > > > Look at:
> > > > > > 
> > > > > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > > > > 
> > > > > > In the git archive:
> > > > > > 
> > > > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > 
> > > > > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > > > > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> > > > 
> > > > Yep, this is before your optimization.
> > > > 
> > > 
> > > Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
> > > possible effect on the LSBs. That should work even if it overflows. OK.
> > > That should even work with my optimization. But I somehow prefer the xor
> > > (if it's not slower), because we really only need 1 bit to flip on and
> > > off.
> > > 
> > > > > Another major difference between our tree is the lack of smp_mb() at the
> > > > > end of flip_counter_and_wait() (in your tree).
> > > > > 
> > > > > Your code does :
> > > > > 
> > > > >   smp_mb()
> > > > >   switch parity
> > > > >   smp_mb()
> > > > >   wait for each thread ongoing old gp
> > > > >     <<<<<<< ---- missing smp_mb.
> > > > >   switch parity
> > > > >   smp_mb()
> > > > >   wait for each thread ongoing old gp
> > > > >   smp_mb()
> > > > 
> > > > This should be OK -- or am I missing a failure scenario?
> > > > Keep in mind that I get failures only when omitting a counter
> > > > flip, not with the above code.
> > > > 
> > > 
> > > OK, it's good that you point out that the failure only occurs when
> > > omitting the counter flip.
> > > 
> > > So if we leave out the mb() we can end up in a situation where a reader
> > > thread is still in an ongoing old gp and we switch the parity. The big
> > > question is : should we be concerned about this ?
> > > 
> > > From the writer point of view :
> > > 
> > > Given there is no data dependency between the parity update and the
> > > per_thread(rcu_reader_gp, t) read done in the while loop waiting for
> > > threads, and given even the compiler barrier() has no effect wrt the
> > > last test done after the last iteration of the loop, we could think of
> > > compiler optimizations doing the following to our code (let's focus on a
> > > single loop of for_each_thread) :
> > > 
> > > transforming
> > > 
> > >                 while (rcu_old_gp_ongoing(t))
> > >                         barrier();
> > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > 
> > > into
> > > 
> > >                 if (!rcu_old_gp_ongoing(t))
> > >                   goto end;
> > >                 while (rcu_old_gp_ongoing(t))
> > >                         barrier();
> > > end:
> > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > 
> > > This leaves the choice to the compiler to perform the rcu_gp_ctr
> > > increment before the per_thread(rcu_reader_gp, t) read, because there is
> > > no barrier.
> > > 
> > > Not only does this apply to the compiler, but also to the memory
> > > barriers. We can end up in a situation where the CPU decides to to the
> > > rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
> > > given there is no data dependency between those two.
> > > 
> > > You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
> > > t) read will order reads, but I don't think we should rely on this on
> > > SMP. This is really supposed to be there just to make sure we don't end
> > > up doing multiple variable reads on UP wrt to local interrupts.
> > > 
> > > You could also argue that rcu_gp_ctr is read within
> > > rcu_old_gp_ongoing(), which should normally order the memory accesses.
> > > It actually does only order memory access to the rcu_gp_ctr variable,
> > > not the per_thread(rcu_reader_gp, t), because, here again, there if no
> > > data dependency whatsoever between per_thread(rcu_reader_gp, t) and
> > > rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
> > > the rcu_gp_ctr increment, and only then could the
> > > per_thread(rcu_reader_gp, t) variable be read to perform the test.
> > > 
> > > But I see that even in rcu_read_lock, there is no strict ordering
> > > between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
> > > I conclude that ordering between those two variables does not matter at
> > > all. I also suspect that this is the core reason for doing 2 q.s. period
> > > flip at each update.
> > > 
> > > Am I correct ?
> > 
> > I do not believe so -- please see my earlier email calling out the
> > sequence of events leading to failure in the single-flip case:
> > 
> > 	http://lkml.org/lkml/2009/2/7/67
> > 
> 
> Hrm, let me present it in a different, more straightfoward way :
> 
> In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> 
> There is a memory barrier here in the updater :
> 
> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi
> 	od;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;

I believe you were actually looking for a memory barrier here, not?
I do not believe that your urcu.c has a memory barrier here, please
see below.

> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi;
> 	od;
> 
> However, in your C code of nest_32.c, there is none. So it is at the
> very least an inconsistency between your code and your model.

The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:

synchronize_rcu()

	switch_qparity()

		force_mb_all_threads()

		switch_next_urcu_qparity()  [Just does counter flip]

		wait_for_quiescent_state()

			Wait for all threads

			force_mb_all_threads()
				My model does not represent this
				memory barrier, because it seemed to
				me that it was redundant with the
				following one.

				I added it, no effect.

	switch_qparity()

		force_mb_all_threads()

		switch_next_urcu_qparity()  [Just does counter flip]

		wait_for_quiescent_state()

			Wait for all threads

			force_mb_all_threads()

The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
follows:

synchronize_rcu()

	flip_counter_and_wait()

		flips counter

		smp_mb();

		Wait for threads

	flip_counter_and_wait()

		flips counter

		smp_mb();

		Wait for threads

So, if I am reading the code correctly, I have memory barriers
everywhere you don't and vice versa.  ;-)

The reason that I believe that I do not need a memory barrier between
the wait-for-threads and the subsequent flip is that the threads we
are waiting for have to have already committed to the earlier value of
the counter, and so changing the counter out of order has no effect.

Does this make sense, or am I confused?

(BTW, I do not trust my model yet, as it currently cannot detect the
failure case I pointed out earlier.  :-/  Here and I thought that the
point of such models was to detect additional failure cases!!!)

							Thanx, Paul

> > > > > I also wonder why you have a smp_mb() after spin_unlock() in your
> > > > > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > > > > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > > > > your spin_lock/unlock primitives yet).
> > > > 
> > > > Perhaps things have changed, but last I knew, spin_lock() and
> > > > spin_unlock() were only required to keep the critical section in, not
> > > > to keep things out of the critical section.
> > > > 
> > > 
> > > Hrm, reading Documentation/memory-barriers.txt again tells me things
> > > might have changed (if I am reading correctly the section LOCKS VS
> > > MEMORY ACCESSES).
> > 
> > In the 2.6.26 version of Documentation/memory-barriers.txt, there is
> > the following near line 366:
> > 
> >  (5) LOCK operations.
> > 
> >      This acts as a one-way permeable barrier.  It guarantees that all memory
> >      operations after the LOCK operation will appear to happen after the LOCK
> >      operation with respect to the other components of the system.
> > 
> >      Memory operations that occur before a LOCK operation may appear to happen
> >      after it completes.
> > 
> >      A LOCK operation should almost always be paired with an UNLOCK operation.
> > 
> > 
> >  (6) UNLOCK operations.
> > 
> >      This also acts as a one-way permeable barrier.  It guarantees that all
> >      memory operations before the UNLOCK operation will appear to happen before
> >      the UNLOCK operation with respect to the other components of the system.
> > 
> >      Memory operations that occur after an UNLOCK operation may appear to
> >      happen before it completes.
> > 
> >      LOCK and UNLOCK operations are guaranteed to appear with respect to each
> >      other strictly in the order specified.
> > 
> >      The use of LOCK and UNLOCK operations generally precludes the need for
> >      other sorts of memory barrier (but note the exceptions mentioned in the
> >      subsection "MMIO write barrier").
> > 
> > > Correct me if I am wrong, but I don't think it makes sense to insure
> > > memory barriers to keep accesses within the critical section and not
> > > outside, because such memory access could well be another spinlock.
> > 
> > Almost, but not quite.  ;-)
> > 
> > > Therefore, we could end up in a situation where we have two locks, A and
> > > B, taken in the following order in the source code :
> > > 
> > > LOCK A
> > > 
> > > UNLOCK A
> > > 
> > > LOCK B
> > > 
> > > UNLOCK B
> > > 
> > > Then, following your assumption, it would be possible for a CPU to do
> > > the memory accesses associated to lock A and B in a random order one vs
> > > the other. Given there would be no requirement to keep things out of
> > > those respective critical sections, LOCK A could be taken within LOCK B,
> > > and the opposite would also be valid.
> > > 
> > > Valid memory access orders :
> > > 
> > > 1)
> > > LOCK A
> > > LOCK B
> > > UNLOCK B
> > > UNLOCK A
> > > 
> > > 2)
> > > LOCK B
> > > LOCK A
> > > UNLOCK A
> > > UNLOCK B
> > 
> > #2 is wrong -- LOCK A is guaranteed to prohibit LOCK B from passing it,
> > as that would be equivalent to letting LOCK A's critical section leak out.
> > 
> > > The only constraint that ensures we won't end up in this situation is
> > > the fact that memory accesses done outside of the critical section stays
> > > outside of the critical section.
> > 
> > Let's take it one transformation at a time:
> > 
> > 1.	LOCK A; UNLOCK A; LOCK B; UNLOCK B
> > 
> > 2.	LOCK A; LOCK B; UNLOCK A; UNLOCK B
> > 
> > 	This one is OK, because both the LOCK B and the UNLOCK A
> > 	are permitted to allow more stuff to enter their respective
> > 	critical sections.
> > 
> > 3.	LOCK B; LOCK A; UNLOCK A; UNLOCK B
> > 
> > 	This is -not- legal!  LOCK A is forbidden to allow LOCK B
> > 	to escape its critical section.
> > 
> > Does this make sense?
> > 
> 
> Ah, yes. Thanks for the explanation.
> 
> Mathieu
> 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > 
> > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Mathieu
> > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > _______________________________________________
> > > > > > ltt-dev mailing list
> > > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11 20:09                                     ` Paul E. McKenney
@ 2009-02-11 21:42                                       ` Mathieu Desnoyers
  2009-02-11 22:08                                         ` Mathieu Desnoyers
       [not found]                                         ` <20090212003549.GU6694@linux.vnet.ibm.com>
  0 siblings, 2 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11 21:42 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 01:52:03PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Feb 11, 2009 at 01:35:20AM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > 
> > > > > > > > > > > [ . . . ]
> > > > > > > > > > > 
> > > > > > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > > > > > 
> > > > > > > > > > > > 8-way x86_64
> > > > > > > > > > > > E5405 @2 GHZ
> > > > > > > > > > > > 
> > > > > > > > > > > > ./urcutorture 8 perf
> > > > > > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > > > > > 
> > > > > > > > > > > > ./urcutorture 8 uperf
> > > > > > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > > > > > 
> > > > > > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > > > > > 
> > > > > > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > > > > > 
> > > > > > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > > > > > would be:
> > > > > > > > > > > 
> > > > > > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > > > > > 
> > > > > > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > > > > > 
> > > > > > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > > > > > 
> > > > > > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > > > > > where there is a single switch_qparity().
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I just did a mb() version of the urcu :
> > > > > > > > > > 
> > > > > > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > > > > > 
> > > > > > > > > > Time per read : 48.4086 cycles
> > > > > > > > > > (about 6-7 times slower, as expected)
> > > > > > > > > > 
> > > > > > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > > > > > 
> > > > > > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > > > > > 
> > > > > > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > > > > > in an invalid state.
> > > > > > > > > > 
> > > > > > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > > > > > 
> > > > > > > > > > It now appears after just a few seconds when running
> > > > > > > > > > ./test_urcu_yield 20 -r -n
> > > > > > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > > > > > 
> > > > > > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > > > > > 
> > > > > > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > > > > > 
> > > > > > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > > > > > see a failure:
> > > > > > > > > 
> > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > > > > > 
> > > > > > > > > This is my prototype version, with read-side memory barriers, no
> > > > > > > > > signals, and without your initialization-value speedup.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > > > > > a current version of your prototype, I could review it.
> > > > > > > 
> > > > > > > Look at:
> > > > > > > 
> > > > > > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > > > > > 
> > > > > > > In the git archive:
> > > > > > > 
> > > > > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > 
> > > > > > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > > > > > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> > > > > 
> > > > > Yep, this is before your optimization.
> > > > > 
> > > > 
> > > > Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
> > > > possible effect on the LSBs. That should work even if it overflows. OK.
> > > > That should even work with my optimization. But I somehow prefer the xor
> > > > (if it's not slower), because we really only need 1 bit to flip on and
> > > > off.
> > > > 
> > > > > > Another major difference between our tree is the lack of smp_mb() at the
> > > > > > end of flip_counter_and_wait() (in your tree).
> > > > > > 
> > > > > > Your code does :
> > > > > > 
> > > > > >   smp_mb()
> > > > > >   switch parity
> > > > > >   smp_mb()
> > > > > >   wait for each thread ongoing old gp
> > > > > >     <<<<<<< ---- missing smp_mb.
> > > > > >   switch parity
> > > > > >   smp_mb()
> > > > > >   wait for each thread ongoing old gp
> > > > > >   smp_mb()
> > > > > 
> > > > > This should be OK -- or am I missing a failure scenario?
> > > > > Keep in mind that I get failures only when omitting a counter
> > > > > flip, not with the above code.
> > > > > 
> > > > 
> > > > OK, it's good that you point out that the failure only occurs when
> > > > omitting the counter flip.
> > > > 
> > > > So if we leave out the mb() we can end up in a situation where a reader
> > > > thread is still in an ongoing old gp and we switch the parity. The big
> > > > question is : should we be concerned about this ?
> > > > 
> > > > From the writer point of view :
> > > > 
> > > > Given there is no data dependency between the parity update and the
> > > > per_thread(rcu_reader_gp, t) read done in the while loop waiting for
> > > > threads, and given even the compiler barrier() has no effect wrt the
> > > > last test done after the last iteration of the loop, we could think of
> > > > compiler optimizations doing the following to our code (let's focus on a
> > > > single loop of for_each_thread) :
> > > > 
> > > > transforming
> > > > 
> > > >                 while (rcu_old_gp_ongoing(t))
> > > >                         barrier();
> > > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > > 
> > > > into
> > > > 
> > > >                 if (!rcu_old_gp_ongoing(t))
> > > >                   goto end;
> > > >                 while (rcu_old_gp_ongoing(t))
> > > >                         barrier();
> > > > end:
> > > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > > 
> > > > This leaves the choice to the compiler to perform the rcu_gp_ctr
> > > > increment before the per_thread(rcu_reader_gp, t) read, because there is
> > > > no barrier.
> > > > 
> > > > Not only does this apply to the compiler, but also to the memory
> > > > barriers. We can end up in a situation where the CPU decides to to the
> > > > rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
> > > > given there is no data dependency between those two.
> > > > 
> > > > You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
> > > > t) read will order reads, but I don't think we should rely on this on
> > > > SMP. This is really supposed to be there just to make sure we don't end
> > > > up doing multiple variable reads on UP wrt to local interrupts.
> > > > 
> > > > You could also argue that rcu_gp_ctr is read within
> > > > rcu_old_gp_ongoing(), which should normally order the memory accesses.
> > > > It actually does only order memory access to the rcu_gp_ctr variable,
> > > > not the per_thread(rcu_reader_gp, t), because, here again, there if no
> > > > data dependency whatsoever between per_thread(rcu_reader_gp, t) and
> > > > rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
> > > > the rcu_gp_ctr increment, and only then could the
> > > > per_thread(rcu_reader_gp, t) variable be read to perform the test.
> > > > 
> > > > But I see that even in rcu_read_lock, there is no strict ordering
> > > > between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
> > > > I conclude that ordering between those two variables does not matter at
> > > > all. I also suspect that this is the core reason for doing 2 q.s. period
> > > > flip at each update.
> > > > 
> > > > Am I correct ?
> > > 
> > > I do not believe so -- please see my earlier email calling out the
> > > sequence of events leading to failure in the single-flip case:
> > > 
> > > 	http://lkml.org/lkml/2009/2/7/67
> > > 
> > 
> > Hrm, let me present it in a different, more straightfoward way :
> > 
> > In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> > 
> > There is a memory barrier here in the updater :
> > 
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi
> > 	od;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 
> I believe you were actually looking for a memory barrier here, not?
> I do not believe that your urcu.c has a memory barrier here, please
> see below.
> 
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi;
> > 	od;
> > 
> > However, in your C code of nest_32.c, there is none. So it is at the
> > very least an inconsistency between your code and your model.
> 
> The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:
> 
> synchronize_rcu()
> 
> 	switch_qparity()
> 
> 		force_mb_all_threads()
> 
> 		switch_next_urcu_qparity()  [Just does counter flip]
> 

Hrm... there would potentially be a missing mb() here.

> 		wait_for_quiescent_state()
> 
> 			Wait for all threads
> 
> 			force_mb_all_threads()
> 				My model does not represent this
> 				memory barrier, because it seemed to
> 				me that it was redundant with the
> 				following one.
> 

Yes, this one is redundant.

> 				I added it, no effect.
> 
> 	switch_qparity()
> 
> 		force_mb_all_threads()
> 
> 		switch_next_urcu_qparity()  [Just does counter flip]
> 

Same as above, potentially missing mb().

> 		wait_for_quiescent_state()
> 
> 			Wait for all threads
> 
> 			force_mb_all_threads()
> 
> The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
> follows:
> 
> synchronize_rcu()
> 
> 	flip_counter_and_wait()
> 
> 		flips counter
> 
> 		smp_mb();
> 
> 		Wait for threads
> 

this is the point where I wonder if we should add a mb() to your code.

> 	flip_counter_and_wait()
> 
> 		flips counter
> 
> 		smp_mb();
> 
> 		Wait for threads
> 
> So, if I am reading the code correctly, I have memory barriers
> everywhere you don't and vice versa.  ;-)
> 

Exactly. You have mb() between 
flips counter and (next) Wait for threads

I have mb() between
(previous) Wait for threads and flips counter

Both might be required. Or none. :)

> The reason that I believe that I do not need a memory barrier between
> the wait-for-threads and the subsequent flip is that the threads we
> are waiting for have to have already committed to the earlier value of
> the counter, and so changing the counter out of order has no effect.
> 
> Does this make sense, or am I confused?
> 

So if we remove the mb() as in your code, between the flips counter and
(next) Wait for thread, we are doing these operations in random order at
the write site:

Sequence 1 - what we expect
A.1 - flip counter
A.2 - read counter
B   - read other threads urcu_active_readers

So what happens if the CPU decides to reorder the unrelated
operations? We get :

Sequence 2
B   - read other threads urcu_active_readers
A.1 - flip counter
A.2 - read counter

Sequence 3
A.1 - flip counter
A.2 - read counter
B   - read other threads urcu_active_readers

Sequence 4
A.1 - flip counter
B   - read other threads urcu_active_readers
A.2 - read counter


Sequence 1, 3 and 4 are OK because the counter flip happens before we
read other thread's urcu_active_readers counts.

However, we have to consider Sequence 2 carefully, because we will read
other threads uru_active_readers count before those readers see that we
flipped the counter.

The reader side does either :

seq. 1
R.1 - read urcu_active_readers
S.2 - read counter
RS.2- write urcu_active_readers, depends on read counter and read
      urcu_active_readers

(with R.1 and S.2 in random order)

or

seq. 2
R.1 - read urcu_active_readers
R.2 - write urcu_active_readers, depends on read urcu_active_readers


So we could have the following reader+writer sequence :

Interleaved writer Sequence 2 and reader seq. 1.

Reader:
R.1 - read urcu_active_readers
S.2 - read counter
Writer:
B   - read other threads urcu_active_readers (there are none)
A.1 - flip counter
A.2 - read counter
Reader:
RS.2- write urcu_active_readers, depends on read counter and read
      urcu_active_readers

Here, the reader would have updated its counter as belonging to the old
q.s. period, but the writer will later wait for the new period. But
given the writer will eventually do a second flip+wait, the reader in
the other q.s. window will be caught by the second flip.

Therefore, we could be tempted to think that those mb() could be
unnecessary, which would lead to a scheme where urcu_active_readers and
urcu_gp_ctr are done in a completely random order one vs the other.
Let's see what it gives :

synchronize_rcu()

  force_mb_all_threads()  /*
                           * Orders pointer publication and 
                           * (urcu_active_readers/urcu_gp_ctr accesses)
                           */
  switch_qparity()

    switch_next_urcu_qparity()  [just does counter flip 0->1]

    wait_for_quiescent_state()

      wait for all threads in parity 0

  switch_qparity()

    switch_next_urcu_qparity()  [Just does counter flip 1->0]

    wait_for_quiescent_state()

      Wait for all threads in parity 1

  force_mb_all_threads()  /*
                           * Orders
                           * (urcu_active_readers/urcu_gp_ctr accesses)
                           * and old data removal.
                           */



*but* ! There is a reason why we don't want to do this. If

    switch_next_urcu_qparity()  [Just does counter flip 1->0]

happens before the end of the previous

      Wait for all threads in parity 0

We enter in a situation where all newly coming readers will see the
parity bit as 0, although we are still waiting for that parity to end.
We end up in a state when the writer can be blocked forever (no possible
progress) if there are steadily readers subscribed for the data.

Basically, to put it differently, we could simply remove the bit
flipping from the writer and wait for *all* readers to exit their
critical section (even the ones simply interested in the new pointer).
But this shares the same problem the version above has, which is that we
end up in a situation where the writer won't progress if there are
always readers in a critical section.

The same applies to 

    switch_next_urcu_qparity()  [Just does counter flip 0->1]

      wait for all threads in parity 0

If we don't put a mb() between those two (as I mistakenly did), we can
end up waiting for readers in parity 0 while the parity bit wasn't
flipped yet. oops. Same potential no-progress situation.

The ordering of memory reads in the reader for
urcu_active_readers/urcu_gp_ctr accesses does not seem to matter because
the data contains information about which q.s. period parity it is in.
In whichever order those variables are read seems to all work fine.

In the end, it's to insure that the writer will always progress that we
have to enforce smp_mb() between *all* switch_next_urcu_qparity and wait
for threads. Mine and yours.

Or maybe there is a detail I haven't correctly understood that insures
this already without the mb() in your code ?

> (BTW, I do not trust my model yet, as it currently cannot detect the
> failure case I pointed out earlier.  :-/  Here and I thought that the
> point of such models was to detect additional failure cases!!!)
> 

Yes, I'll have to dig deeper into it.

Mathieu

> 							Thanx, Paul
> 
> > > > > > I also wonder why you have a smp_mb() after spin_unlock() in your
> > > > > > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > > > > > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > > > > > your spin_lock/unlock primitives yet).
> > > > > 
> > > > > Perhaps things have changed, but last I knew, spin_lock() and
> > > > > spin_unlock() were only required to keep the critical section in, not
> > > > > to keep things out of the critical section.
> > > > > 
> > > > 
> > > > Hrm, reading Documentation/memory-barriers.txt again tells me things
> > > > might have changed (if I am reading correctly the section LOCKS VS
> > > > MEMORY ACCESSES).
> > > 
> > > In the 2.6.26 version of Documentation/memory-barriers.txt, there is
> > > the following near line 366:
> > > 
> > >  (5) LOCK operations.
> > > 
> > >      This acts as a one-way permeable barrier.  It guarantees that all memory
> > >      operations after the LOCK operation will appear to happen after the LOCK
> > >      operation with respect to the other components of the system.
> > > 
> > >      Memory operations that occur before a LOCK operation may appear to happen
> > >      after it completes.
> > > 
> > >      A LOCK operation should almost always be paired with an UNLOCK operation.
> > > 
> > > 
> > >  (6) UNLOCK operations.
> > > 
> > >      This also acts as a one-way permeable barrier.  It guarantees that all
> > >      memory operations before the UNLOCK operation will appear to happen before
> > >      the UNLOCK operation with respect to the other components of the system.
> > > 
> > >      Memory operations that occur after an UNLOCK operation may appear to
> > >      happen before it completes.
> > > 
> > >      LOCK and UNLOCK operations are guaranteed to appear with respect to each
> > >      other strictly in the order specified.
> > > 
> > >      The use of LOCK and UNLOCK operations generally precludes the need for
> > >      other sorts of memory barrier (but note the exceptions mentioned in the
> > >      subsection "MMIO write barrier").
> > > 
> > > > Correct me if I am wrong, but I don't think it makes sense to insure
> > > > memory barriers to keep accesses within the critical section and not
> > > > outside, because such memory access could well be another spinlock.
> > > 
> > > Almost, but not quite.  ;-)
> > > 
> > > > Therefore, we could end up in a situation where we have two locks, A and
> > > > B, taken in the following order in the source code :
> > > > 
> > > > LOCK A
> > > > 
> > > > UNLOCK A
> > > > 
> > > > LOCK B
> > > > 
> > > > UNLOCK B
> > > > 
> > > > Then, following your assumption, it would be possible for a CPU to do
> > > > the memory accesses associated to lock A and B in a random order one vs
> > > > the other. Given there would be no requirement to keep things out of
> > > > those respective critical sections, LOCK A could be taken within LOCK B,
> > > > and the opposite would also be valid.
> > > > 
> > > > Valid memory access orders :
> > > > 
> > > > 1)
> > > > LOCK A
> > > > LOCK B
> > > > UNLOCK B
> > > > UNLOCK A
> > > > 
> > > > 2)
> > > > LOCK B
> > > > LOCK A
> > > > UNLOCK A
> > > > UNLOCK B
> > > 
> > > #2 is wrong -- LOCK A is guaranteed to prohibit LOCK B from passing it,
> > > as that would be equivalent to letting LOCK A's critical section leak out.
> > > 
> > > > The only constraint that ensures we won't end up in this situation is
> > > > the fact that memory accesses done outside of the critical section stays
> > > > outside of the critical section.
> > > 
> > > Let's take it one transformation at a time:
> > > 
> > > 1.	LOCK A; UNLOCK A; LOCK B; UNLOCK B
> > > 
> > > 2.	LOCK A; LOCK B; UNLOCK A; UNLOCK B
> > > 
> > > 	This one is OK, because both the LOCK B and the UNLOCK A
> > > 	are permitted to allow more stuff to enter their respective
> > > 	critical sections.
> > > 
> > > 3.	LOCK B; LOCK A; UNLOCK A; UNLOCK B
> > > 
> > > 	This is -not- legal!  LOCK A is forbidden to allow LOCK B
> > > 	to escape its critical section.
> > > 
> > > Does this make sense?
> > > 
> > 
> > Ah, yes. Thanks for the explanation.
> > 
> > Mathieu
> > 
> > > 							Thanx, Paul
> > > 
> > > > Mathieu
> > > > 
> > > > 
> > > > 
> > > > > 							Thanx, Paul
> > > > > 
> > > > > > Mathieu
> > > > > > 
> > > > > > > 							Thanx, Paul
> > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > ltt-dev mailing list
> > > > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > > > 
> > > > > > 
> > > > > > -- 
> > > > > > Mathieu Desnoyers
> > > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-11 21:42                                       ` Mathieu Desnoyers
@ 2009-02-11 22:08                                         ` Mathieu Desnoyers
       [not found]                                         ` <20090212003549.GU6694@linux.vnet.ibm.com>
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-11 22:08 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 01:52:03PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Wed, Feb 11, 2009 at 01:35:20AM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Tue, Feb 10, 2009 at 07:57:01PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > On Tue, Feb 10, 2009 at 04:28:33PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > > On Tue, Feb 10, 2009 at 02:17:31PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > > > > On Mon, Feb 09, 2009 at 02:03:17AM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > [ . . . ]
> > > > > > > > > > > > 
> > > > > > > > > > > > > I just added modified rcutorture.h and api.h from your git tree
> > > > > > > > > > > > > specifically for an urcutorture program to the repository. Some results :
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 8-way x86_64
> > > > > > > > > > > > > E5405 @2 GHZ
> > > > > > > > > > > > > 
> > > > > > > > > > > > > ./urcutorture 8 perf
> > > > > > > > > > > > > n_reads: 1937650000  n_updates: 3  nreaders: 8  nupdaters: 1 duration: 1
> > > > > > > > > > > > > ns/read: 4.12871  ns/update: 3.33333e+08
> > > > > > > > > > > > > 
> > > > > > > > > > > > > ./urcutorture 8 uperf
> > > > > > > > > > > > > n_reads: 0  n_updates: 4413892  nreaders: 0  nupdaters: 8 duration: 1
> > > > > > > > > > > > > ns/read: nan  ns/update: 1812.46
> > > > > > > > > > > > > 
> > > > > > > > > > > > > n_reads: 98844204  n_updates: 10  n_mberror: 0
> > > > > > > > > > > > > rcu_stress_count: 98844171 33 0 0 0 0 0 0 0 0 0
> > > > > > > > > > > > > 
> > > > > > > > > > > > > However, I've tried removing the second switch_qparity() call, and the
> > > > > > > > > > > > > rcutorture test did not detect anything wrong. I also did a variation
> > > > > > > > > > > > > which calls the "sched_yield" version of the urcu, "urcutorture-yield".
> > > > > > > > > > > > 
> > > > > > > > > > > > My confusion -- I was testing my old approach where the memory barriers
> > > > > > > > > > > > are in rcu_read_lock() and rcu_read_unlock().  To force the failures in
> > > > > > > > > > > > your signal-handler-memory-barrier approach, I suspect that you are
> > > > > > > > > > > > going to need a bigger hammer.  In this case, one such bigger hammer
> > > > > > > > > > > > would be:
> > > > > > > > > > > > 
> > > > > > > > > > > > o	Just before exit from the signal handler, do a
> > > > > > > > > > > > 	pthread_cond_wait() under a pthread_mutex().
> > > > > > > > > > > > 
> > > > > > > > > > > > o	In force_mb_all_threads(), refrain from sending a signal to self.
> > > > > > > > > > > > 
> > > > > > > > > > > > 	Then it should be safe in force_mb_all_threads() to do a
> > > > > > > > > > > > 	pthread_cond_broadcast() under the same pthread_mutex().
> > > > > > > > > > > > 
> > > > > > > > > > > > This should raise the probability of seeing the failure in the case
> > > > > > > > > > > > where there is a single switch_qparity().
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I just did a mb() version of the urcu :
> > > > > > > > > > > 
> > > > > > > > > > > (uncomment CFLAGS=+-DDEBUG_FULL_MB in the Makefile)
> > > > > > > > > > > 
> > > > > > > > > > > Time per read : 48.4086 cycles
> > > > > > > > > > > (about 6-7 times slower, as expected)
> > > > > > > > > > > 
> > > > > > > > > > > This will be useful especially to increase the chance to trigger races.
> > > > > > > > > > > 
> > > > > > > > > > > I tried removing the second parity switch from the writer. The rcu
> > > > > > > > > > > torture test did not find the problem yet (maybe I am not using the
> > > > > > > > > > > correct parameters ? It does not run for more than 5 seconds).
> > > > > > > > > > > 
> > > > > > > > > > > So I added a "-n" option to test_urcu, so it can make the usleep(1)
> > > > > > > > > > > between the writes optional. I also changed the yield for a usleep with
> > > > > > > > > > > random delay. I also now use a circular buffer rather than malloc so we
> > > > > > > > > > > are sure the memory is not quickly reused by the writer and stays longer
> > > > > > > > > > > in an invalid state.
> > > > > > > > > > > 
> > > > > > > > > > > So what really make the problem appear quickly is to add a delay between
> > > > > > > > > > > the rcu_dereference and the assertion on the data validity in thr_reader.
> > > > > > > > > > > 
> > > > > > > > > > > It now appears after just a few seconds when running
> > > > > > > > > > > ./test_urcu_yield 20 -r -n
> > > > > > > > > > > Compiled with CFLAGS=+-DDEBUG_FULL_MB
> > > > > > > > > > > 
> > > > > > > > > > > It seem to be much harder to trigger with the signal-based version. It's
> > > > > > > > > > > expected, because the writer takes about 50 times longer to execute than
> > > > > > > > > > > with the -DDEBUG_FULL_MB version.
> > > > > > > > > > > 
> > > > > > > > > > > So I'll let the ./test_urcu_yield NN -r -n run for a while on the
> > > > > > > > > > > correct version (with DEBUG_FULL_MB) and see what it gives.
> > > > > > > > > > 
> > > > > > > > > > Hmmm...  I had worse luck this time, took three 10-second tries to
> > > > > > > > > > see a failure:
> > > > > > > > > > 
> > > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ ./rcu_nest32 1 stress
> > > > > > > > > > n_reads: 44682055  n_updates: 9609503  n_mberror: 0
> > > > > > > > > > rcu_stress_count: 44679377 2678 0 0 0 0 0 0 0 0 0
> > > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > > > n_reads: 42281884  n_updates: 9870129  n_mberror: 0
> > > > > > > > > > rcu_stress_count: 42277756 4128 0 0 0 0 0 0 0 0 0
> > > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$ !!
> > > > > > > > > > ./rcu_nest32 1 stress
> > > > > > > > > > n_reads: 41384304  n_updates: 10040805  n_mberror: 0
> > > > > > > > > > rcu_stress_count: 41380075 4228 1 0 0 0 0 0 0 0 0
> > > > > > > > > > paulmck@paulmck-laptop:~/paper/perfbook/CodeSamples/defer$
> > > > > > > > > > 
> > > > > > > > > > This is my prototype version, with read-side memory barriers, no
> > > > > > > > > > signals, and without your initialization-value speedup.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > It would be interesting to re-sync our trees, or if you can point me to
> > > > > > > > > a current version of your prototype, I could review it.
> > > > > > > > 
> > > > > > > > Look at:
> > > > > > > > 
> > > > > > > > 	CodeSamples/defer/rcu_nest32.[hc]
> > > > > > > > 
> > > > > > > > In the git archive:
> > > > > > > > 
> > > > > > > > 	git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/perfbook.git
> > > > > > > 
> > > > > > > flip_counter_and_wait : yours do rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT
> > > > > > > mine : rcu_gp_ctr ^= RCU_GP_CTR_BOTTOM_BIT.
> > > > > > 
> > > > > > Yep, this is before your optimization.
> > > > > > 
> > > > > 
> > > > > Hrm, and given the RCU_GP_CTR_BOTTOM_BIT is in the MSBs, there is no
> > > > > possible effect on the LSBs. That should work even if it overflows. OK.
> > > > > That should even work with my optimization. But I somehow prefer the xor
> > > > > (if it's not slower), because we really only need 1 bit to flip on and
> > > > > off.
> > > > > 
> > > > > > > Another major difference between our tree is the lack of smp_mb() at the
> > > > > > > end of flip_counter_and_wait() (in your tree).
> > > > > > > 
> > > > > > > Your code does :
> > > > > > > 
> > > > > > >   smp_mb()
> > > > > > >   switch parity
> > > > > > >   smp_mb()
> > > > > > >   wait for each thread ongoing old gp
> > > > > > >     <<<<<<< ---- missing smp_mb.
> > > > > > >   switch parity
> > > > > > >   smp_mb()
> > > > > > >   wait for each thread ongoing old gp
> > > > > > >   smp_mb()
> > > > > > 
> > > > > > This should be OK -- or am I missing a failure scenario?
> > > > > > Keep in mind that I get failures only when omitting a counter
> > > > > > flip, not with the above code.
> > > > > > 
> > > > > 
> > > > > OK, it's good that you point out that the failure only occurs when
> > > > > omitting the counter flip.
> > > > > 
> > > > > So if we leave out the mb() we can end up in a situation where a reader
> > > > > thread is still in an ongoing old gp and we switch the parity. The big
> > > > > question is : should we be concerned about this ?
> > > > > 
> > > > > From the writer point of view :
> > > > > 
> > > > > Given there is no data dependency between the parity update and the
> > > > > per_thread(rcu_reader_gp, t) read done in the while loop waiting for
> > > > > threads, and given even the compiler barrier() has no effect wrt the
> > > > > last test done after the last iteration of the loop, we could think of
> > > > > compiler optimizations doing the following to our code (let's focus on a
> > > > > single loop of for_each_thread) :
> > > > > 
> > > > > transforming
> > > > > 
> > > > >                 while (rcu_old_gp_ongoing(t))
> > > > >                         barrier();
> > > > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > > > 
> > > > > into
> > > > > 
> > > > >                 if (!rcu_old_gp_ongoing(t))
> > > > >                   goto end;
> > > > >                 while (rcu_old_gp_ongoing(t))
> > > > >                         barrier();
> > > > > end:
> > > > >                 rcu_gp_ctr += RCU_GP_CTR_BOTTOM_BIT;
> > > > > 
> > > > > This leaves the choice to the compiler to perform the rcu_gp_ctr
> > > > > increment before the per_thread(rcu_reader_gp, t) read, because there is
> > > > > no barrier.
> > > > > 
> > > > > Not only does this apply to the compiler, but also to the memory
> > > > > barriers. We can end up in a situation where the CPU decides to to the
> > > > > rcu_gp_ctr increment before reading the last rcu_old_gp_ongoing value,
> > > > > given there is no data dependency between those two.
> > > > > 
> > > > > You could argue that ACCESS_ONCE() around the per_thread(rcu_reader_gp,
> > > > > t) read will order reads, but I don't think we should rely on this on
> > > > > SMP. This is really supposed to be there just to make sure we don't end
> > > > > up doing multiple variable reads on UP wrt to local interrupts.
> > > > > 
> > > > > You could also argue that rcu_gp_ctr is read within
> > > > > rcu_old_gp_ongoing(), which should normally order the memory accesses.
> > > > > It actually does only order memory access to the rcu_gp_ctr variable,
> > > > > not the per_thread(rcu_reader_gp, t), because, here again, there if no
> > > > > data dependency whatsoever between per_thread(rcu_reader_gp, t) and
> > > > > rcu_gp_ctr. A possible scenario : rcu_gp_ctr could be read, then we have
> > > > > the rcu_gp_ctr increment, and only then could the
> > > > > per_thread(rcu_reader_gp, t) variable be read to perform the test.
> > > > > 
> > > > > But I see that even in rcu_read_lock, there is no strict ordering
> > > > > between __get_thread_var(rcu_reader_gp) and rcu_gp_ctr read. Therefore,
> > > > > I conclude that ordering between those two variables does not matter at
> > > > > all. I also suspect that this is the core reason for doing 2 q.s. period
> > > > > flip at each update.
> > > > > 
> > > > > Am I correct ?
> > > > 
> > > > I do not believe so -- please see my earlier email calling out the
> > > > sequence of events leading to failure in the single-flip case:
> > > > 
> > > > 	http://lkml.org/lkml/2009/2/7/67
> > > > 
> > > 
> > > Hrm, let me present it in a different, more straightfoward way :
> > > 
> > > In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> > > 
> > > There is a memory barrier here in the updater :
> > > 
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi
> > > 	od;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 
> > I believe you were actually looking for a memory barrier here, not?
> > I do not believe that your urcu.c has a memory barrier here, please
> > see below.
> > 
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi;
> > > 	od;
> > > 
> > > However, in your C code of nest_32.c, there is none. So it is at the
> > > very least an inconsistency between your code and your model.
> > 
> > The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:
> > 
> > synchronize_rcu()
> > 
> > 	switch_qparity()
> > 
> > 		force_mb_all_threads()
> > 
> > 		switch_next_urcu_qparity()  [Just does counter flip]
> > 
> 
> Hrm... there would potentially be a missing mb() here.
> 
> > 		wait_for_quiescent_state()
> > 
> > 			Wait for all threads
> > 
> > 			force_mb_all_threads()
> > 				My model does not represent this
> > 				memory barrier, because it seemed to
> > 				me that it was redundant with the
> > 				following one.
> > 
> 
> Yes, this one is redundant.
> 
> > 				I added it, no effect.
> > 
> > 	switch_qparity()
> > 
> > 		force_mb_all_threads()
> > 
> > 		switch_next_urcu_qparity()  [Just does counter flip]
> > 
> 
> Same as above, potentially missing mb().
> 
> > 		wait_for_quiescent_state()
> > 
> > 			Wait for all threads
> > 
> > 			force_mb_all_threads()
> > 
> > The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
> > follows:
> > 
> > synchronize_rcu()
> > 
> > 	flip_counter_and_wait()
> > 
> > 		flips counter
> > 
> > 		smp_mb();
> > 
> > 		Wait for threads
> > 
> 
> this is the point where I wonder if we should add a mb() to your code.
> 
> > 	flip_counter_and_wait()
> > 
> > 		flips counter
> > 
> > 		smp_mb();
> > 
> > 		Wait for threads
> > 
> > So, if I am reading the code correctly, I have memory barriers
> > everywhere you don't and vice versa.  ;-)
> > 
> 
> Exactly. You have mb() between 
> flips counter and (next) Wait for threads
> 
> I have mb() between
> (previous) Wait for threads and flips counter
> 
> Both might be required. Or none. :)
> 
> > The reason that I believe that I do not need a memory barrier between
> > the wait-for-threads and the subsequent flip is that the threads we
> > are waiting for have to have already committed to the earlier value of
> > the counter, and so changing the counter out of order has no effect.
> > 
> > Does this make sense, or am I confused?
> > 
> 
> So if we remove the mb() as in your code, between the flips counter and
> (next) Wait for thread, we are doing these operations in random order at
> the write site:
> 
> Sequence 1 - what we expect
> A.1 - flip counter
> A.2 - read counter
> B   - read other threads urcu_active_readers
> 
> So what happens if the CPU decides to reorder the unrelated
> operations? We get :
> 
> Sequence 2
> B   - read other threads urcu_active_readers
> A.1 - flip counter
> A.2 - read counter
> 
> Sequence 3
> A.1 - flip counter
> A.2 - read counter
> B   - read other threads urcu_active_readers
> 
> Sequence 4
> A.1 - flip counter
> B   - read other threads urcu_active_readers
> A.2 - read counter
> 
> 
> Sequence 1, 3 and 4 are OK because the counter flip happens before we
> read other thread's urcu_active_readers counts.
> 
> However, we have to consider Sequence 2 carefully, because we will read
> other threads uru_active_readers count before those readers see that we
> flipped the counter.
> 
> The reader side does either :
> 
> seq. 1
> R.1 - read urcu_active_readers
> S.2 - read counter
> RS.2- write urcu_active_readers, depends on read counter and read
>       urcu_active_readers
> 
> (with R.1 and S.2 in random order)
> 
> or
> 
> seq. 2
> R.1 - read urcu_active_readers
> R.2 - write urcu_active_readers, depends on read urcu_active_readers
> 
> 
> So we could have the following reader+writer sequence :
> 
> Interleaved writer Sequence 2 and reader seq. 1.
> 
> Reader:
> R.1 - read urcu_active_readers
> S.2 - read counter
> Writer:
> B   - read other threads urcu_active_readers (there are none)
> A.1 - flip counter
> A.2 - read counter
> Reader:
> RS.2- write urcu_active_readers, depends on read counter and read
>       urcu_active_readers
> 
> Here, the reader would have updated its counter as belonging to the old
> q.s. period, but the writer will later wait for the new period. But
> given the writer will eventually do a second flip+wait, the reader in
> the other q.s. window will be caught by the second flip.
> 
> Therefore, we could be tempted to think that those mb() could be
> unnecessary, which would lead to a scheme where urcu_active_readers and
> urcu_gp_ctr are done in a completely random order one vs the other.
> Let's see what it gives :
> 
> synchronize_rcu()
> 
>   force_mb_all_threads()  /*
>                            * Orders pointer publication and 
>                            * (urcu_active_readers/urcu_gp_ctr accesses)
>                            */
>   switch_qparity()
> 
>     switch_next_urcu_qparity()  [just does counter flip 0->1]
> 
>     wait_for_quiescent_state()
> 
>       wait for all threads in parity 0
> 
>   switch_qparity()
> 
>     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> 
>     wait_for_quiescent_state()
> 
>       Wait for all threads in parity 1
> 
>   force_mb_all_threads()  /*
>                            * Orders
>                            * (urcu_active_readers/urcu_gp_ctr accesses)
>                            * and old data removal.
>                            */
> 
> 
> 
> *but* ! There is a reason why we don't want to do this. If
> 
>     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> 
> happens before the end of the previous
> 
>       Wait for all threads in parity 0
> 
> We enter in a situation where all newly coming readers will see the
> parity bit as 0, although we are still waiting for that parity to end.
> We end up in a state when the writer can be blocked forever (no possible
> progress) if there are steadily readers subscribed for the data.
> 
> Basically, to put it differently, we could simply remove the bit
> flipping from the writer and wait for *all* readers to exit their
> critical section (even the ones simply interested in the new pointer).
> But this shares the same problem the version above has, which is that we
> end up in a situation where the writer won't progress if there are
> always readers in a critical section.
> 
> The same applies to 
> 
>     switch_next_urcu_qparity()  [Just does counter flip 0->1]
> 
>       wait for all threads in parity 0
> 
> If we don't put a mb() between those two (as I mistakenly did), we can
> end up waiting for readers in parity 0 while the parity bit wasn't
> flipped yet. oops. Same potential no-progress situation.
> 
> The ordering of memory reads in the reader for
> urcu_active_readers/urcu_gp_ctr accesses does not seem to matter because
> the data contains information about which q.s. period parity it is in.
> In whichever order those variables are read seems to all work fine.
> 
> In the end, it's to insure that the writer will always progress that we
> have to enforce smp_mb() between *all* switch_next_urcu_qparity and wait
> for threads. Mine and yours.

On a related note :

The code in rcu_old_gp_ongoing (in my git tree) uses ACCESS_ONCE around
urcu_active_readers and urcu_gp_ctr reads. I think the ACCESS_ONCE
around urcu_gp_crt is useless because urcu_gp_ctr is only being modified
by ourself and we hold a mutex.

However, making sure that urcu_active_readers is only read once is
clearly required.

Mathieu


> 
> Or maybe there is a detail I haven't correctly understood that insures
> this already without the mb() in your code ?
> 
> > (BTW, I do not trust my model yet, as it currently cannot detect the
> > failure case I pointed out earlier.  :-/  Here and I thought that the
> > point of such models was to detect additional failure cases!!!)
> > 
> 
> Yes, I'll have to dig deeper into it.
> 
> Mathieu
> 
> > 							Thanx, Paul
> > 
> > > > > > > I also wonder why you have a smp_mb() after spin_unlock() in your
> > > > > > > synchronize_rcu() -> if you follow the Linux kernel semantics for
> > > > > > > spinlocks, the smp_mb() should be implied. (but I have not looked at
> > > > > > > your spin_lock/unlock primitives yet).
> > > > > > 
> > > > > > Perhaps things have changed, but last I knew, spin_lock() and
> > > > > > spin_unlock() were only required to keep the critical section in, not
> > > > > > to keep things out of the critical section.
> > > > > > 
> > > > > 
> > > > > Hrm, reading Documentation/memory-barriers.txt again tells me things
> > > > > might have changed (if I am reading correctly the section LOCKS VS
> > > > > MEMORY ACCESSES).
> > > > 
> > > > In the 2.6.26 version of Documentation/memory-barriers.txt, there is
> > > > the following near line 366:
> > > > 
> > > >  (5) LOCK operations.
> > > > 
> > > >      This acts as a one-way permeable barrier.  It guarantees that all memory
> > > >      operations after the LOCK operation will appear to happen after the LOCK
> > > >      operation with respect to the other components of the system.
> > > > 
> > > >      Memory operations that occur before a LOCK operation may appear to happen
> > > >      after it completes.
> > > > 
> > > >      A LOCK operation should almost always be paired with an UNLOCK operation.
> > > > 
> > > > 
> > > >  (6) UNLOCK operations.
> > > > 
> > > >      This also acts as a one-way permeable barrier.  It guarantees that all
> > > >      memory operations before the UNLOCK operation will appear to happen before
> > > >      the UNLOCK operation with respect to the other components of the system.
> > > > 
> > > >      Memory operations that occur after an UNLOCK operation may appear to
> > > >      happen before it completes.
> > > > 
> > > >      LOCK and UNLOCK operations are guaranteed to appear with respect to each
> > > >      other strictly in the order specified.
> > > > 
> > > >      The use of LOCK and UNLOCK operations generally precludes the need for
> > > >      other sorts of memory barrier (but note the exceptions mentioned in the
> > > >      subsection "MMIO write barrier").
> > > > 
> > > > > Correct me if I am wrong, but I don't think it makes sense to insure
> > > > > memory barriers to keep accesses within the critical section and not
> > > > > outside, because such memory access could well be another spinlock.
> > > > 
> > > > Almost, but not quite.  ;-)
> > > > 
> > > > > Therefore, we could end up in a situation where we have two locks, A and
> > > > > B, taken in the following order in the source code :
> > > > > 
> > > > > LOCK A
> > > > > 
> > > > > UNLOCK A
> > > > > 
> > > > > LOCK B
> > > > > 
> > > > > UNLOCK B
> > > > > 
> > > > > Then, following your assumption, it would be possible for a CPU to do
> > > > > the memory accesses associated to lock A and B in a random order one vs
> > > > > the other. Given there would be no requirement to keep things out of
> > > > > those respective critical sections, LOCK A could be taken within LOCK B,
> > > > > and the opposite would also be valid.
> > > > > 
> > > > > Valid memory access orders :
> > > > > 
> > > > > 1)
> > > > > LOCK A
> > > > > LOCK B
> > > > > UNLOCK B
> > > > > UNLOCK A
> > > > > 
> > > > > 2)
> > > > > LOCK B
> > > > > LOCK A
> > > > > UNLOCK A
> > > > > UNLOCK B
> > > > 
> > > > #2 is wrong -- LOCK A is guaranteed to prohibit LOCK B from passing it,
> > > > as that would be equivalent to letting LOCK A's critical section leak out.
> > > > 
> > > > > The only constraint that ensures we won't end up in this situation is
> > > > > the fact that memory accesses done outside of the critical section stays
> > > > > outside of the critical section.
> > > > 
> > > > Let's take it one transformation at a time:
> > > > 
> > > > 1.	LOCK A; UNLOCK A; LOCK B; UNLOCK B
> > > > 
> > > > 2.	LOCK A; LOCK B; UNLOCK A; UNLOCK B
> > > > 
> > > > 	This one is OK, because both the LOCK B and the UNLOCK A
> > > > 	are permitted to allow more stuff to enter their respective
> > > > 	critical sections.
> > > > 
> > > > 3.	LOCK B; LOCK A; UNLOCK A; UNLOCK B
> > > > 
> > > > 	This is -not- legal!  LOCK A is forbidden to allow LOCK B
> > > > 	to escape its critical section.
> > > > 
> > > > Does this make sense?
> > > > 
> > > 
> > > Ah, yes. Thanks for the explanation.
> > > 
> > > Mathieu
> > > 
> > > > 							Thanx, Paul
> > > > 
> > > > > Mathieu
> > > > > 
> > > > > 
> > > > > 
> > > > > > 							Thanx, Paul
> > > > > > 
> > > > > > > Mathieu
> > > > > > > 
> > > > > > > > 							Thanx, Paul
> > > > > > > > 
> > > > > > > > _______________________________________________
> > > > > > > > ltt-dev mailing list
> > > > > > > > ltt-dev@lists.casi.polymtl.ca
> > > > > > > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > > > > > > 
> > > > > > > 
> > > > > > > -- 
> > > > > > > Mathieu Desnoyers
> > > > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > > > 
> > > > > 
> > > > > -- 
> > > > > Mathieu Desnoyers
> > > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
       [not found]                                         ` <20090212003549.GU6694@linux.vnet.ibm.com>
@ 2009-02-12  2:33                                           ` Paul E. McKenney
  2009-02-12  2:37                                             ` Paul E. McKenney
  2009-02-12  4:08                                             ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12  2:33 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 10333 bytes --]

On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> 
> [ . . . ]
> 
> > > > Hrm, let me present it in a different, more straightfoward way :
> > > > 
> > > > In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> > > > 
> > > > There is a memory barrier here in the updater :
> > > > 
> > > > 	do
> > > > 	:: 1 ->
> > > > 		if
> > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 			skip;
> > > > 		:: else -> break;
> > > > 		fi
> > > > 	od;
> > > > 	need_mb = 1;
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 
> > > I believe you were actually looking for a memory barrier here, not?
> > > I do not believe that your urcu.c has a memory barrier here, please
> > > see below.
> > > 
> > > > 	do
> > > > 	:: 1 ->
> > > > 		if
> > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 			skip;
> > > > 		:: else -> break;
> > > > 		fi;
> > > > 	od;
> > > > 
> > > > However, in your C code of nest_32.c, there is none. So it is at the
> > > > very least an inconsistency between your code and your model.
> > > 
> > > The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:
> > > 
> > > synchronize_rcu()
> > > 
> > > 	switch_qparity()
> > > 
> > > 		force_mb_all_threads()
> > > 
> > > 		switch_next_urcu_qparity()  [Just does counter flip]
> > > 
> > 
> > Hrm... there would potentially be a missing mb() here.
> 
> K, I added it to the model.
> 
> > > 		wait_for_quiescent_state()
> > > 
> > > 			Wait for all threads
> > > 
> > > 			force_mb_all_threads()
> > > 				My model does not represent this
> > > 				memory barrier, because it seemed to
> > > 				me that it was redundant with the
> > > 				following one.
> > > 
> > 
> > Yes, this one is redundant.
> 
> I left it in for now...
> 
> > > 				I added it, no effect.
> > > 
> > > 	switch_qparity()
> > > 
> > > 		force_mb_all_threads()
> > > 
> > > 		switch_next_urcu_qparity()  [Just does counter flip]
> > > 
> > 
> > Same as above, potentially missing mb().
> 
> I added it to the model.
> 
> > > 		wait_for_quiescent_state()
> > > 
> > > 			Wait for all threads
> > > 
> > > 			force_mb_all_threads()
> > > 
> > > The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
> > > follows:
> > > 
> > > synchronize_rcu()
> > > 
> > > 	flip_counter_and_wait()
> > > 
> > > 		flips counter
> > > 
> > > 		smp_mb();
> > > 
> > > 		Wait for threads
> > > 
> > 
> > this is the point where I wonder if we should add a mb() to your code.
> 
> Might well be, though I would argue for the very end, where I left out
> the smp_mb().  I clearly need to make another Promela model for this
> code, but we should probably focus on yours first, given that I don't
> have any use cases for mine.
> 
> > > 	flip_counter_and_wait()
> > > 
> > > 		flips counter
> > > 
> > > 		smp_mb();
> > > 
> > > 		Wait for threads
> 
> And I really do have an unlock followed by an smp_mb() at this point.
> 
> > > So, if I am reading the code correctly, I have memory barriers
> > > everywhere you don't and vice versa.  ;-)
> > > 
> > 
> > Exactly. You have mb() between 
> > flips counter and (next) Wait for threads
> > 
> > I have mb() between
> > (previous) Wait for threads and flips counter
> > 
> > Both might be required. Or none. :)
> 
> Well, adding in the two to yours still gets Promela failures, please
> see attached.  Nothing quite like a multi-thousand step failure case,
> I have to admit!  ;-)
> 
> > > The reason that I believe that I do not need a memory barrier between
> > > the wait-for-threads and the subsequent flip is that the threads we
> > > are waiting for have to have already committed to the earlier value of
> > > the counter, and so changing the counter out of order has no effect.
> > > 
> > > Does this make sense, or am I confused?
> > 
> > So if we remove the mb() as in your code, between the flips counter and
> > (next) Wait for thread, we are doing these operations in random order at
> > the write site:
> 
> I don't believe that I get to remove and mb()s from my code...
> 
> > Sequence 1 - what we expect
> > A.1 - flip counter
> > A.2 - read counter
> > B   - read other threads urcu_active_readers
> > 
> > So what happens if the CPU decides to reorder the unrelated
> > operations? We get :
> > 
> > Sequence 2
> > B   - read other threads urcu_active_readers
> > A.1 - flip counter
> > A.2 - read counter
> > 
> > Sequence 3
> > A.1 - flip counter
> > A.2 - read counter
> > B   - read other threads urcu_active_readers
> > 
> > Sequence 4
> > A.1 - flip counter
> > B   - read other threads urcu_active_readers
> > A.2 - read counter
> > 
> > 
> > Sequence 1, 3 and 4 are OK because the counter flip happens before we
> > read other thread's urcu_active_readers counts.
> > 
> > However, we have to consider Sequence 2 carefully, because we will read
> > other threads uru_active_readers count before those readers see that we
> > flipped the counter.
> > 
> > The reader side does either :
> > 
> > seq. 1
> > R.1 - read urcu_active_readers
> > S.2 - read counter
> > RS.2- write urcu_active_readers, depends on read counter and read
> >       urcu_active_readers
> > 
> > (with R.1 and S.2 in random order)
> > 
> > or
> > 
> > seq. 2
> > R.1 - read urcu_active_readers
> > R.2 - write urcu_active_readers, depends on read urcu_active_readers
> > 
> > 
> > So we could have the following reader+writer sequence :
> > 
> > Interleaved writer Sequence 2 and reader seq. 1.
> > 
> > Reader:
> > R.1 - read urcu_active_readers
> > S.2 - read counter
> > Writer:
> > B   - read other threads urcu_active_readers (there are none)
> > A.1 - flip counter
> > A.2 - read counter
> > Reader:
> > RS.2- write urcu_active_readers, depends on read counter and read
> >       urcu_active_readers
> > 
> > Here, the reader would have updated its counter as belonging to the old
> > q.s. period, but the writer will later wait for the new period. But
> > given the writer will eventually do a second flip+wait, the reader in
> > the other q.s. window will be caught by the second flip.
> > 
> > Therefore, we could be tempted to think that those mb() could be
> > unnecessary, which would lead to a scheme where urcu_active_readers and
> > urcu_gp_ctr are done in a completely random order one vs the other.
> > Let's see what it gives :
> > 
> > synchronize_rcu()
> > 
> >   force_mb_all_threads()  /*
> >                            * Orders pointer publication and 
> >                            * (urcu_active_readers/urcu_gp_ctr accesses)
> >                            */
> >   switch_qparity()
> > 
> >     switch_next_urcu_qparity()  [just does counter flip 0->1]
> > 
> >     wait_for_quiescent_state()
> > 
> >       wait for all threads in parity 0
> > 
> >   switch_qparity()
> > 
> >     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> > 
> >     wait_for_quiescent_state()
> > 
> >       Wait for all threads in parity 1
> > 
> >   force_mb_all_threads()  /*
> >                            * Orders
> >                            * (urcu_active_readers/urcu_gp_ctr accesses)
> >                            * and old data removal.
> >                            */
> > 
> > 
> > 
> > *but* ! There is a reason why we don't want to do this. If
> > 
> >     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> > 
> > happens before the end of the previous
> > 
> >       Wait for all threads in parity 0
> > 
> > We enter in a situation where all newly coming readers will see the
> > parity bit as 0, although we are still waiting for that parity to end.
> > We end up in a state when the writer can be blocked forever (no possible
> > progress) if there are steadily readers subscribed for the data.
> > 
> > Basically, to put it differently, we could simply remove the bit
> > flipping from the writer and wait for *all* readers to exit their
> > critical section (even the ones simply interested in the new pointer).
> > But this shares the same problem the version above has, which is that we
> > end up in a situation where the writer won't progress if there are
> > always readers in a critical section.
> > 
> > The same applies to 
> > 
> >     switch_next_urcu_qparity()  [Just does counter flip 0->1]
> > 
> >       wait for all threads in parity 0
> > 
> > If we don't put a mb() between those two (as I mistakenly did), we can
> > end up waiting for readers in parity 0 while the parity bit wasn't
> > flipped yet. oops. Same potential no-progress situation.
> > 
> > The ordering of memory reads in the reader for
> > urcu_active_readers/urcu_gp_ctr accesses does not seem to matter because
> > the data contains information about which q.s. period parity it is in.
> > In whichever order those variables are read seems to all work fine.
> > 
> > In the end, it's to insure that the writer will always progress that we
> > have to enforce smp_mb() between *all* switch_next_urcu_qparity and wait
> > for threads. Mine and yours.
> > 
> > Or maybe there is a detail I haven't correctly understood that insures
> > this already without the mb() in your code ?
> > 
> > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > point of such models was to detect additional failure cases!!!)
> > > 
> > 
> > Yes, I'll have to dig deeper into it.
> 
> Well, as I said, I attached the current model and the error trail.

And I had bugs in my model that allowed the rcu_read_lock() model
to nest indefinitely, which overflowed into the top bit, messing
things up.  :-/

Attached is a fixed model.  This model validates correctly (woo-hoo!).
Even better, gives the expected error if you comment out line 180 and
uncomment line 213, this latter corresponding to the error case I called
out a few days ago.

I will play with removing models of mb...

							Thanx, Paul

[-- Attachment #2: urcu.spin --]
[-- Type: text/plain, Size: 6864 bytes --]

/*
 * urcu.spin: Promela code to validate urcu.  See commit number
 *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
 *      git archive at git://lttng.org/userspace-rcu.git
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
 */

/* Promela validation variables. */

bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
byte reader_progress[4];
		  /* Count of read-side statement executions. */

/* urcu definitions and variables, taken straight from the algorithm. */

#define RCU_GP_CTR_BIT (1 << 7)
#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)

byte urcu_gp_ctr = 1;
byte urcu_active_readers = 0;

/* Model the RCU read-side critical section. */

proctype urcu_reader()
{
	bit done = 0;
	bit mbok;
	byte tmp;
	byte tmp_removed;
	byte tmp_free;

	/* Absorb any early requests for memory barriers. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;

	/*
	 * Each pass through this loop executes one read-side statement
	 * from the following code fragment:
	 *
	 *	rcu_read_lock(); [0a]
	 *	rcu_read_lock(); [0b]
	 *	p = rcu_dereference(global_p); [1]
	 *	x = p->data; [2]
	 *	rcu_read_unlock(); [3b]
	 *	rcu_read_unlock(); [3a]
	 *
	 * Because we are modeling a weak-memory machine, these statements
	 * can be seen in any order, the only restriction being that
	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
	 * is non-deterministic, the above is but one possible placement.
	 * Intestingly enough, this model validates all possible placements
	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
	 * with the only constraint being that the rcu_read_lock() must
	 * precede the rcu_read_unlock().
	 *
	 * We also respond to memory-barrier requests, but only if our
	 * execution happens to be ordered.  If the current state is
	 * misordered, we ignore memory-barrier requests.
	 */
	do
	:: 1 ->
		if
		:: reader_progress[0] < 2 -> /* [0a and 0b] */
			tmp = urcu_active_readers;
			if
			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
				tmp = urcu_gp_ctr;
				do
				:: (reader_progress[1] +
				    reader_progress[2] +
				    reader_progress[3] == 0) && need_mb == 1 ->
					need_mb = 0;
				:: 1 -> skip;
				:: 1 -> break;
				od;
				urcu_active_readers = tmp;
			 :: else ->
				urcu_active_readers = tmp + 1;
			fi;
			reader_progress[0] = reader_progress[0] + 1;
		:: reader_progress[1] == 0 -> /* [1] */
			tmp_removed = removed;
			reader_progress[1] = 1;
		:: reader_progress[2] == 0 -> /* [2] */
			tmp_free = free;
			reader_progress[2] = 1;
		:: ((reader_progress[0] > reader_progress[3]) &&
		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
			tmp = urcu_active_readers - 1;
			urcu_active_readers = tmp;
			reader_progress[3] = reader_progress[3] + 1;
		:: else -> break;
		fi;

		/* Process memory-barrier requests, if it is safe to do so. */
		atomic {
			mbok = 0;
			tmp = 0;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
				break;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				tmp = tmp + 1;
			:: tmp >= 4 ->
				done = 1;
				break;
			od;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				break;
			:: tmp >= 4 ->
				mbok = 1;
				break;
			od

		}

		if
		:: mbok == 1 ->
			/* We get here if mb processing is safe. */
			do
			:: need_mb == 1 ->
				need_mb = 0;
			:: 1 -> skip;
			:: 1 -> break;
			od;
		:: else -> skip;
		fi;

		/*
		 * Check to see if we have modeled the entire RCU read-side
		 * critical section, and leave if so.
		 */
		if
		:: done == 1 -> break;
		:: else -> skip;
		fi
	od;
	assert((tmp_free == 0) || (tmp_removed == 1));

	/* Process any late-arriving memory-barrier requests. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;
}

/* Model the RCU update process. */

proctype urcu_updater()
{
	/* Removal statement, e.g., list_del_rcu(). */
	removed = 1;

	/* synchronize_rcu(), first counter flip. */
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	do
	:: 1 ->
		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* Erroneous removal statement, e.g., list_del_rcu(). */
	/* removed = 1; */

	/* synchronize_rcu(), second counter flip. */
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	do
	:: 1 ->
		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* free-up step, e.g., kfree(). */
	free = 1;
}

/*
 * Initialize the array, spawn a reader and an updater.  Because readers
 * are independent of each other, only one reader is needed.
 */

init {
	atomic {
		reader_progress[0] = 0;
		reader_progress[1] = 0;
		reader_progress[2] = 0;
		reader_progress[3] = 0;
		run urcu_reader();
		run urcu_updater();
	}
}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  2:33                                           ` Paul E. McKenney
@ 2009-02-12  2:37                                             ` Paul E. McKenney
  2009-02-12  4:10                                               ` Mathieu Desnoyers
  2009-02-12  4:08                                             ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12  2:37 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1277 bytes --]

On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:

[ . . . ]

> > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > point of such models was to detect additional failure cases!!!)
> > > > 
> > > 
> > > Yes, I'll have to dig deeper into it.
> > 
> > Well, as I said, I attached the current model and the error trail.
> 
> And I had bugs in my model that allowed the rcu_read_lock() model
> to nest indefinitely, which overflowed into the top bit, messing
> things up.  :-/
> 
> Attached is a fixed model.  This model validates correctly (woo-hoo!).
> Even better, gives the expected error if you comment out line 180 and
> uncomment line 213, this latter corresponding to the error case I called
> out a few days ago.
> 
> I will play with removing models of mb...

And commenting out the models of mb between the counter flips and the
test for readers still passes validation, as expected, and as shown in
the attached Promela code.

						Thanx, Paul

[-- Attachment #2: urcu.spin --]
[-- Type: text/plain, Size: 6412 bytes --]

/*
 * urcu.spin: Promela code to validate urcu.  See commit number
 *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
 *      git archive at git://lttng.org/userspace-rcu.git
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
 */

/* Promela validation variables. */

bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
byte reader_progress[4];
		  /* Count of read-side statement executions. */

/* urcu definitions and variables, taken straight from the algorithm. */

#define RCU_GP_CTR_BIT (1 << 7)
#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)

byte urcu_gp_ctr = 1;
byte urcu_active_readers = 0;

/* Model the RCU read-side critical section. */

proctype urcu_reader()
{
	bit done = 0;
	bit mbok;
	byte tmp;
	byte tmp_removed;
	byte tmp_free;

	/* Absorb any early requests for memory barriers. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;

	/*
	 * Each pass through this loop executes one read-side statement
	 * from the following code fragment:
	 *
	 *	rcu_read_lock(); [0a]
	 *	rcu_read_lock(); [0b]
	 *	p = rcu_dereference(global_p); [1]
	 *	x = p->data; [2]
	 *	rcu_read_unlock(); [3b]
	 *	rcu_read_unlock(); [3a]
	 *
	 * Because we are modeling a weak-memory machine, these statements
	 * can be seen in any order, the only restriction being that
	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
	 * is non-deterministic, the above is but one possible placement.
	 * Intestingly enough, this model validates all possible placements
	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
	 * with the only constraint being that the rcu_read_lock() must
	 * precede the rcu_read_unlock().
	 *
	 * We also respond to memory-barrier requests, but only if our
	 * execution happens to be ordered.  If the current state is
	 * misordered, we ignore memory-barrier requests.
	 */
	do
	:: 1 ->
		if
		:: reader_progress[0] < 2 -> /* [0a and 0b] */
			tmp = urcu_active_readers;
			if
			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
				tmp = urcu_gp_ctr;
				do
				:: (reader_progress[1] +
				    reader_progress[2] +
				    reader_progress[3] == 0) && need_mb == 1 ->
					need_mb = 0;
				:: 1 -> skip;
				:: 1 -> break;
				od;
				urcu_active_readers = tmp;
			 :: else ->
				urcu_active_readers = tmp + 1;
			fi;
			reader_progress[0] = reader_progress[0] + 1;
		:: reader_progress[1] == 0 -> /* [1] */
			tmp_removed = removed;
			reader_progress[1] = 1;
		:: reader_progress[2] == 0 -> /* [2] */
			tmp_free = free;
			reader_progress[2] = 1;
		:: ((reader_progress[0] > reader_progress[3]) &&
		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
			tmp = urcu_active_readers - 1;
			urcu_active_readers = tmp;
			reader_progress[3] = reader_progress[3] + 1;
		:: else -> break;
		fi;

		/* Process memory-barrier requests, if it is safe to do so. */
		atomic {
			mbok = 0;
			tmp = 0;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
				break;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				tmp = tmp + 1;
			:: tmp >= 4 ->
				done = 1;
				break;
			od;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				break;
			:: tmp >= 4 ->
				mbok = 1;
				break;
			od

		}

		if
		:: mbok == 1 ->
			/* We get here if mb processing is safe. */
			do
			:: need_mb == 1 ->
				need_mb = 0;
			:: 1 -> skip;
			:: 1 -> break;
			od;
		:: else -> skip;
		fi;

		/*
		 * Check to see if we have modeled the entire RCU read-side
		 * critical section, and leave if so.
		 */
		if
		:: done == 1 -> break;
		:: else -> skip;
		fi
	od;
	assert((tmp_free == 0) || (tmp_removed == 1));

	/* Process any late-arriving memory-barrier requests. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;
}

/* Model the RCU update process. */

proctype urcu_updater()
{
	/* Removal statement, e.g., list_del_rcu(). */
	removed = 1;

	/* synchronize_rcu(), first counter flip. */
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	/* need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od; */
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* Erroneous removal statement, e.g., list_del_rcu(). */
	/* removed = 1; */

	/* synchronize_rcu(), second counter flip. */
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	/* need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od; */
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;
	need_mb = 1;
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* free-up step, e.g., kfree(). */
	free = 1;
}

/*
 * Initialize the array, spawn a reader and an updater.  Because readers
 * are independent of each other, only one reader is needed.
 */

init {
	atomic {
		reader_progress[0] = 0;
		reader_progress[1] = 0;
		reader_progress[2] = 0;
		reader_progress[3] = 0;
		run urcu_reader();
		run urcu_updater();
	}
}

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  2:33                                           ` Paul E. McKenney
  2009-02-12  2:37                                             ` Paul E. McKenney
@ 2009-02-12  4:08                                             ` Mathieu Desnoyers
  2009-02-12  5:01                                               ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12  4:08 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > 
> > [ . . . ]
> > 
> > > > > Hrm, let me present it in a different, more straightfoward way :
> > > > > 
> > > > > In you Promela model (here : http://lkml.org/lkml/2009/2/10/419)
> > > > > 
> > > > > There is a memory barrier here in the updater :
> > > > > 
> > > > > 	do
> > > > > 	:: 1 ->
> > > > > 		if
> > > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 			skip;
> > > > > 		:: else -> break;
> > > > > 		fi
> > > > > 	od;
> > > > > 	need_mb = 1;
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > 
> > > > I believe you were actually looking for a memory barrier here, not?
> > > > I do not believe that your urcu.c has a memory barrier here, please
> > > > see below.
> > > > 
> > > > > 	do
> > > > > 	:: 1 ->
> > > > > 		if
> > > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 			skip;
> > > > > 		:: else -> break;
> > > > > 		fi;
> > > > > 	od;
> > > > > 
> > > > > However, in your C code of nest_32.c, there is none. So it is at the
> > > > > very least an inconsistency between your code and your model.
> > > > 
> > > > The urcu.c 3a9e6e9df706b8d39af94d2f027210e2e7d4106e lays out as follows:
> > > > 
> > > > synchronize_rcu()
> > > > 
> > > > 	switch_qparity()
> > > > 
> > > > 		force_mb_all_threads()
> > > > 
> > > > 		switch_next_urcu_qparity()  [Just does counter flip]
> > > > 
> > > 
> > > Hrm... there would potentially be a missing mb() here.
> > 
> > K, I added it to the model.
> > 
> > > > 		wait_for_quiescent_state()
> > > > 
> > > > 			Wait for all threads
> > > > 
> > > > 			force_mb_all_threads()
> > > > 				My model does not represent this
> > > > 				memory barrier, because it seemed to
> > > > 				me that it was redundant with the
> > > > 				following one.
> > > > 
> > > 
> > > Yes, this one is redundant.
> > 
> > I left it in for now...
> > 
> > > > 				I added it, no effect.
> > > > 
> > > > 	switch_qparity()
> > > > 
> > > > 		force_mb_all_threads()
> > > > 
> > > > 		switch_next_urcu_qparity()  [Just does counter flip]
> > > > 
> > > 
> > > Same as above, potentially missing mb().
> > 
> > I added it to the model.
> > 
> > > > 		wait_for_quiescent_state()
> > > > 
> > > > 			Wait for all threads
> > > > 
> > > > 			force_mb_all_threads()
> > > > 
> > > > The rcu_nest32.c 6da793208a8f60ea41df60164ded85b4c5c5307d lays out as
> > > > follows:
> > > > 
> > > > synchronize_rcu()
> > > > 
> > > > 	flip_counter_and_wait()
> > > > 
> > > > 		flips counter
> > > > 
> > > > 		smp_mb();
> > > > 
> > > > 		Wait for threads
> > > > 
> > > 
> > > this is the point where I wonder if we should add a mb() to your code.
> > 
> > Might well be, though I would argue for the very end, where I left out
> > the smp_mb().  I clearly need to make another Promela model for this
> > code, but we should probably focus on yours first, given that I don't
> > have any use cases for mine.
> > 
> > > > 	flip_counter_and_wait()
> > > > 
> > > > 		flips counter
> > > > 
> > > > 		smp_mb();
> > > > 
> > > > 		Wait for threads
> > 
> > And I really do have an unlock followed by an smp_mb() at this point.
> > 
> > > > So, if I am reading the code correctly, I have memory barriers
> > > > everywhere you don't and vice versa.  ;-)
> > > > 
> > > 
> > > Exactly. You have mb() between 
> > > flips counter and (next) Wait for threads
> > > 
> > > I have mb() between
> > > (previous) Wait for threads and flips counter
> > > 
> > > Both might be required. Or none. :)
> > 
> > Well, adding in the two to yours still gets Promela failures, please
> > see attached.  Nothing quite like a multi-thousand step failure case,
> > I have to admit!  ;-)
> > 
> > > > The reason that I believe that I do not need a memory barrier between
> > > > the wait-for-threads and the subsequent flip is that the threads we
> > > > are waiting for have to have already committed to the earlier value of
> > > > the counter, and so changing the counter out of order has no effect.
> > > > 
> > > > Does this make sense, or am I confused?
> > > 
> > > So if we remove the mb() as in your code, between the flips counter and
> > > (next) Wait for thread, we are doing these operations in random order at
> > > the write site:
> > 
> > I don't believe that I get to remove and mb()s from my code...
> > 
> > > Sequence 1 - what we expect
> > > A.1 - flip counter
> > > A.2 - read counter
> > > B   - read other threads urcu_active_readers
> > > 
> > > So what happens if the CPU decides to reorder the unrelated
> > > operations? We get :
> > > 
> > > Sequence 2
> > > B   - read other threads urcu_active_readers
> > > A.1 - flip counter
> > > A.2 - read counter
> > > 
> > > Sequence 3
> > > A.1 - flip counter
> > > A.2 - read counter
> > > B   - read other threads urcu_active_readers
> > > 
> > > Sequence 4
> > > A.1 - flip counter
> > > B   - read other threads urcu_active_readers
> > > A.2 - read counter
> > > 
> > > 
> > > Sequence 1, 3 and 4 are OK because the counter flip happens before we
> > > read other thread's urcu_active_readers counts.
> > > 
> > > However, we have to consider Sequence 2 carefully, because we will read
> > > other threads uru_active_readers count before those readers see that we
> > > flipped the counter.
> > > 
> > > The reader side does either :
> > > 
> > > seq. 1
> > > R.1 - read urcu_active_readers
> > > S.2 - read counter
> > > RS.2- write urcu_active_readers, depends on read counter and read
> > >       urcu_active_readers
> > > 
> > > (with R.1 and S.2 in random order)
> > > 
> > > or
> > > 
> > > seq. 2
> > > R.1 - read urcu_active_readers
> > > R.2 - write urcu_active_readers, depends on read urcu_active_readers
> > > 
> > > 
> > > So we could have the following reader+writer sequence :
> > > 
> > > Interleaved writer Sequence 2 and reader seq. 1.
> > > 
> > > Reader:
> > > R.1 - read urcu_active_readers
> > > S.2 - read counter
> > > Writer:
> > > B   - read other threads urcu_active_readers (there are none)
> > > A.1 - flip counter
> > > A.2 - read counter
> > > Reader:
> > > RS.2- write urcu_active_readers, depends on read counter and read
> > >       urcu_active_readers
> > > 
> > > Here, the reader would have updated its counter as belonging to the old
> > > q.s. period, but the writer will later wait for the new period. But
> > > given the writer will eventually do a second flip+wait, the reader in
> > > the other q.s. window will be caught by the second flip.
> > > 
> > > Therefore, we could be tempted to think that those mb() could be
> > > unnecessary, which would lead to a scheme where urcu_active_readers and
> > > urcu_gp_ctr are done in a completely random order one vs the other.
> > > Let's see what it gives :
> > > 
> > > synchronize_rcu()
> > > 
> > >   force_mb_all_threads()  /*
> > >                            * Orders pointer publication and 
> > >                            * (urcu_active_readers/urcu_gp_ctr accesses)
> > >                            */
> > >   switch_qparity()
> > > 
> > >     switch_next_urcu_qparity()  [just does counter flip 0->1]
> > > 
> > >     wait_for_quiescent_state()
> > > 
> > >       wait for all threads in parity 0
> > > 
> > >   switch_qparity()
> > > 
> > >     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> > > 
> > >     wait_for_quiescent_state()
> > > 
> > >       Wait for all threads in parity 1
> > > 
> > >   force_mb_all_threads()  /*
> > >                            * Orders
> > >                            * (urcu_active_readers/urcu_gp_ctr accesses)
> > >                            * and old data removal.
> > >                            */
> > > 
> > > 
> > > 
> > > *but* ! There is a reason why we don't want to do this. If
> > > 
> > >     switch_next_urcu_qparity()  [Just does counter flip 1->0]
> > > 
> > > happens before the end of the previous
> > > 
> > >       Wait for all threads in parity 0
> > > 
> > > We enter in a situation where all newly coming readers will see the
> > > parity bit as 0, although we are still waiting for that parity to end.
> > > We end up in a state when the writer can be blocked forever (no possible
> > > progress) if there are steadily readers subscribed for the data.
> > > 
> > > Basically, to put it differently, we could simply remove the bit
> > > flipping from the writer and wait for *all* readers to exit their
> > > critical section (even the ones simply interested in the new pointer).
> > > But this shares the same problem the version above has, which is that we
> > > end up in a situation where the writer won't progress if there are
> > > always readers in a critical section.
> > > 
> > > The same applies to 
> > > 
> > >     switch_next_urcu_qparity()  [Just does counter flip 0->1]
> > > 
> > >       wait for all threads in parity 0
> > > 
> > > If we don't put a mb() between those two (as I mistakenly did), we can
> > > end up waiting for readers in parity 0 while the parity bit wasn't
> > > flipped yet. oops. Same potential no-progress situation.
> > > 
> > > The ordering of memory reads in the reader for
> > > urcu_active_readers/urcu_gp_ctr accesses does not seem to matter because
> > > the data contains information about which q.s. period parity it is in.
> > > In whichever order those variables are read seems to all work fine.
> > > 
> > > In the end, it's to insure that the writer will always progress that we
> > > have to enforce smp_mb() between *all* switch_next_urcu_qparity and wait
> > > for threads. Mine and yours.
> > > 
> > > Or maybe there is a detail I haven't correctly understood that insures
> > > this already without the mb() in your code ?
> > > 
> > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > point of such models was to detect additional failure cases!!!)
> > > > 
> > > 
> > > Yes, I'll have to dig deeper into it.
> > 
> > Well, as I said, I attached the current model and the error trail.
> 
> And I had bugs in my model that allowed the rcu_read_lock() model
> to nest indefinitely, which overflowed into the top bit, messing
> things up.  :-/
> 
> Attached is a fixed model.  This model validates correctly (woo-hoo!).
> Even better, gives the expected error if you comment out line 180 and
> uncomment line 213, this latter corresponding to the error case I called
> out a few days ago.
> 

Great ! :) I added this version to the git repository, hopefully it's ok
with you ?

> I will play with removing models of mb...
> 

OK, I see you already did..

Mathieu

> 							Thanx, Paul

Content-Description: urcu.spin
> /*
>  * urcu.spin: Promela code to validate urcu.  See commit number
>  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
>  *      git archive at git://lttng.org/userspace-rcu.git
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, write to the Free Software
>  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>  *
>  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
>  */
> 
> /* Promela validation variables. */
> 
> bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> byte reader_progress[4];
> 		  /* Count of read-side statement executions. */
> 
> /* urcu definitions and variables, taken straight from the algorithm. */
> 
> #define RCU_GP_CTR_BIT (1 << 7)
> #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> 
> byte urcu_gp_ctr = 1;
> byte urcu_active_readers = 0;
> 
> /* Model the RCU read-side critical section. */
> 
> proctype urcu_reader()
> {
> 	bit done = 0;
> 	bit mbok;
> 	byte tmp;
> 	byte tmp_removed;
> 	byte tmp_free;
> 
> 	/* Absorb any early requests for memory barriers. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> 
> 	/*
> 	 * Each pass through this loop executes one read-side statement
> 	 * from the following code fragment:
> 	 *
> 	 *	rcu_read_lock(); [0a]
> 	 *	rcu_read_lock(); [0b]
> 	 *	p = rcu_dereference(global_p); [1]
> 	 *	x = p->data; [2]
> 	 *	rcu_read_unlock(); [3b]
> 	 *	rcu_read_unlock(); [3a]
> 	 *
> 	 * Because we are modeling a weak-memory machine, these statements
> 	 * can be seen in any order, the only restriction being that
> 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> 	 * is non-deterministic, the above is but one possible placement.
> 	 * Intestingly enough, this model validates all possible placements
> 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> 	 * with the only constraint being that the rcu_read_lock() must
> 	 * precede the rcu_read_unlock().
> 	 *
> 	 * We also respond to memory-barrier requests, but only if our
> 	 * execution happens to be ordered.  If the current state is
> 	 * misordered, we ignore memory-barrier requests.
> 	 */
> 	do
> 	:: 1 ->
> 		if
> 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> 			tmp = urcu_active_readers;
> 			if
> 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> 				tmp = urcu_gp_ctr;
> 				do
> 				:: (reader_progress[1] +
> 				    reader_progress[2] +
> 				    reader_progress[3] == 0) && need_mb == 1 ->
> 					need_mb = 0;
> 				:: 1 -> skip;
> 				:: 1 -> break;
> 				od;
> 				urcu_active_readers = tmp;
> 			 :: else ->
> 				urcu_active_readers = tmp + 1;
> 			fi;
> 			reader_progress[0] = reader_progress[0] + 1;
> 		:: reader_progress[1] == 0 -> /* [1] */
> 			tmp_removed = removed;
> 			reader_progress[1] = 1;
> 		:: reader_progress[2] == 0 -> /* [2] */
> 			tmp_free = free;
> 			reader_progress[2] = 1;
> 		:: ((reader_progress[0] > reader_progress[3]) &&
> 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> 			tmp = urcu_active_readers - 1;
> 			urcu_active_readers = tmp;
> 			reader_progress[3] = reader_progress[3] + 1;
> 		:: else -> break;
> 		fi;
> 
> 		/* Process memory-barrier requests, if it is safe to do so. */
> 		atomic {
> 			mbok = 0;
> 			tmp = 0;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 				break;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				tmp = tmp + 1;
> 			:: tmp >= 4 ->
> 				done = 1;
> 				break;
> 			od;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				break;
> 			:: tmp >= 4 ->
> 				mbok = 1;
> 				break;
> 			od
> 
> 		}
> 
> 		if
> 		:: mbok == 1 ->
> 			/* We get here if mb processing is safe. */
> 			do
> 			:: need_mb == 1 ->
> 				need_mb = 0;
> 			:: 1 -> skip;
> 			:: 1 -> break;
> 			od;
> 		:: else -> skip;
> 		fi;
> 
> 		/*
> 		 * Check to see if we have modeled the entire RCU read-side
> 		 * critical section, and leave if so.
> 		 */
> 		if
> 		:: done == 1 -> break;
> 		:: else -> skip;
> 		fi
> 	od;
> 	assert((tmp_free == 0) || (tmp_removed == 1));
> 
> 	/* Process any late-arriving memory-barrier requests. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> }
> 
> /* Model the RCU update process. */
> 
> proctype urcu_updater()
> {
> 	/* Removal statement, e.g., list_del_rcu(). */
> 	removed = 1;
> 
> 	/* synchronize_rcu(), first counter flip. */
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	do
> 	:: 1 ->
> 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi
> 	od;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> 	/* removed = 1; */
> 
> 	/* synchronize_rcu(), second counter flip. */
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	do
> 	:: 1 ->
> 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi;
> 	od;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* free-up step, e.g., kfree(). */
> 	free = 1;
> }
> 
> /*
>  * Initialize the array, spawn a reader and an updater.  Because readers
>  * are independent of each other, only one reader is needed.
>  */
> 
> init {
> 	atomic {
> 		reader_progress[0] = 0;
> 		reader_progress[1] = 0;
> 		reader_progress[2] = 0;
> 		reader_progress[3] = 0;
> 		run urcu_reader();
> 		run urcu_updater();
> 	}
> }


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  2:37                                             ` Paul E. McKenney
@ 2009-02-12  4:10                                               ` Mathieu Desnoyers
  2009-02-12  5:09                                                 ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12  4:10 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> 
> [ . . . ]
> 
> > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > point of such models was to detect additional failure cases!!!)
> > > > > 
> > > > 
> > > > Yes, I'll have to dig deeper into it.
> > > 
> > > Well, as I said, I attached the current model and the error trail.
> > 
> > And I had bugs in my model that allowed the rcu_read_lock() model
> > to nest indefinitely, which overflowed into the top bit, messing
> > things up.  :-/
> > 
> > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > Even better, gives the expected error if you comment out line 180 and
> > uncomment line 213, this latter corresponding to the error case I called
> > out a few days ago.
> > 
> > I will play with removing models of mb...
> 
> And commenting out the models of mb between the counter flips and the
> test for readers still passes validation, as expected, and as shown in
> the attached Promela code.
> 

Hrm, in the email I sent you about the memory barrier, I said that it
would not make the algorithm incorrect, but that it would cause
situations where it would be impossible for the writer to do any
progress as long as there are readers active. I think we would have to
enhance the model or at least express this through some LTL statement to
validate this specific behavior.

Mathieu

> 						Thanx, Paul

Content-Description: urcu.spin
> /*
>  * urcu.spin: Promela code to validate urcu.  See commit number
>  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
>  *      git archive at git://lttng.org/userspace-rcu.git
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, write to the Free Software
>  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>  *
>  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
>  */
> 
> /* Promela validation variables. */
> 
> bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> byte reader_progress[4];
> 		  /* Count of read-side statement executions. */
> 
> /* urcu definitions and variables, taken straight from the algorithm. */
> 
> #define RCU_GP_CTR_BIT (1 << 7)
> #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> 
> byte urcu_gp_ctr = 1;
> byte urcu_active_readers = 0;
> 
> /* Model the RCU read-side critical section. */
> 
> proctype urcu_reader()
> {
> 	bit done = 0;
> 	bit mbok;
> 	byte tmp;
> 	byte tmp_removed;
> 	byte tmp_free;
> 
> 	/* Absorb any early requests for memory barriers. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> 
> 	/*
> 	 * Each pass through this loop executes one read-side statement
> 	 * from the following code fragment:
> 	 *
> 	 *	rcu_read_lock(); [0a]
> 	 *	rcu_read_lock(); [0b]
> 	 *	p = rcu_dereference(global_p); [1]
> 	 *	x = p->data; [2]
> 	 *	rcu_read_unlock(); [3b]
> 	 *	rcu_read_unlock(); [3a]
> 	 *
> 	 * Because we are modeling a weak-memory machine, these statements
> 	 * can be seen in any order, the only restriction being that
> 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> 	 * is non-deterministic, the above is but one possible placement.
> 	 * Intestingly enough, this model validates all possible placements
> 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> 	 * with the only constraint being that the rcu_read_lock() must
> 	 * precede the rcu_read_unlock().
> 	 *
> 	 * We also respond to memory-barrier requests, but only if our
> 	 * execution happens to be ordered.  If the current state is
> 	 * misordered, we ignore memory-barrier requests.
> 	 */
> 	do
> 	:: 1 ->
> 		if
> 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> 			tmp = urcu_active_readers;
> 			if
> 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> 				tmp = urcu_gp_ctr;
> 				do
> 				:: (reader_progress[1] +
> 				    reader_progress[2] +
> 				    reader_progress[3] == 0) && need_mb == 1 ->
> 					need_mb = 0;
> 				:: 1 -> skip;
> 				:: 1 -> break;
> 				od;
> 				urcu_active_readers = tmp;
> 			 :: else ->
> 				urcu_active_readers = tmp + 1;
> 			fi;
> 			reader_progress[0] = reader_progress[0] + 1;
> 		:: reader_progress[1] == 0 -> /* [1] */
> 			tmp_removed = removed;
> 			reader_progress[1] = 1;
> 		:: reader_progress[2] == 0 -> /* [2] */
> 			tmp_free = free;
> 			reader_progress[2] = 1;
> 		:: ((reader_progress[0] > reader_progress[3]) &&
> 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> 			tmp = urcu_active_readers - 1;
> 			urcu_active_readers = tmp;
> 			reader_progress[3] = reader_progress[3] + 1;
> 		:: else -> break;
> 		fi;
> 
> 		/* Process memory-barrier requests, if it is safe to do so. */
> 		atomic {
> 			mbok = 0;
> 			tmp = 0;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 				break;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				tmp = tmp + 1;
> 			:: tmp >= 4 ->
> 				done = 1;
> 				break;
> 			od;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				break;
> 			:: tmp >= 4 ->
> 				mbok = 1;
> 				break;
> 			od
> 
> 		}
> 
> 		if
> 		:: mbok == 1 ->
> 			/* We get here if mb processing is safe. */
> 			do
> 			:: need_mb == 1 ->
> 				need_mb = 0;
> 			:: 1 -> skip;
> 			:: 1 -> break;
> 			od;
> 		:: else -> skip;
> 		fi;
> 
> 		/*
> 		 * Check to see if we have modeled the entire RCU read-side
> 		 * critical section, and leave if so.
> 		 */
> 		if
> 		:: done == 1 -> break;
> 		:: else -> skip;
> 		fi
> 	od;
> 	assert((tmp_free == 0) || (tmp_removed == 1));
> 
> 	/* Process any late-arriving memory-barrier requests. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> }
> 
> /* Model the RCU update process. */
> 
> proctype urcu_updater()
> {
> 	/* Removal statement, e.g., list_del_rcu(). */
> 	removed = 1;
> 
> 	/* synchronize_rcu(), first counter flip. */
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	/* need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od; */
> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi
> 	od;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> 	/* removed = 1; */
> 
> 	/* synchronize_rcu(), second counter flip. */
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	/* need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od; */
> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi;
> 	od;
> 	need_mb = 1;
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* free-up step, e.g., kfree(). */
> 	free = 1;
> }
> 
> /*
>  * Initialize the array, spawn a reader and an updater.  Because readers
>  * are independent of each other, only one reader is needed.
>  */
> 
> init {
> 	atomic {
> 		reader_progress[0] = 0;
> 		reader_progress[1] = 0;
> 		reader_progress[2] = 0;
> 		reader_progress[3] = 0;
> 		run urcu_reader();
> 		run urcu_updater();
> 	}
> }

> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  4:08                                             ` Mathieu Desnoyers
@ 2009-02-12  5:01                                               ` Paul E. McKenney
  2009-02-12  7:05                                                 ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12  5:01 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:

[ . . . ]

> > And I had bugs in my model that allowed the rcu_read_lock() model
> > to nest indefinitely, which overflowed into the top bit, messing
> > things up.  :-/
> > 
> > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > Even better, gives the expected error if you comment out line 180 and
> > uncomment line 213, this latter corresponding to the error case I called
> > out a few days ago.
> > 
> 
> Great ! :) I added this version to the git repository, hopefully it's ok
> with you ?

Works for me!

> > I will play with removing models of mb...
> > 
> 
> OK, I see you already did..

I continued this, and surprisingly few are actually required, though
I don't fully trust the modeling of removed memory barriers.

							Thanx, Paul

> Mathieu
> 
> > 							Thanx, Paul
> 
> Content-Description: urcu.spin
> > /*
> >  * urcu.spin: Promela code to validate urcu.  See commit number
> >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> >  *      git archive at git://lttng.org/userspace-rcu.git
> >  *
> >  * This program is free software; you can redistribute it and/or modify
> >  * it under the terms of the GNU General Public License as published by
> >  * the Free Software Foundation; either version 2 of the License, or
> >  * (at your option) any later version.
> >  *
> >  * This program is distributed in the hope that it will be useful,
> >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >  * GNU General Public License for more details.
> >  *
> >  * You should have received a copy of the GNU General Public License
> >  * along with this program; if not, write to the Free Software
> >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> >  *
> >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> >  */
> > 
> > /* Promela validation variables. */
> > 
> > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > byte reader_progress[4];
> > 		  /* Count of read-side statement executions. */
> > 
> > /* urcu definitions and variables, taken straight from the algorithm. */
> > 
> > #define RCU_GP_CTR_BIT (1 << 7)
> > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > 
> > byte urcu_gp_ctr = 1;
> > byte urcu_active_readers = 0;
> > 
> > /* Model the RCU read-side critical section. */
> > 
> > proctype urcu_reader()
> > {
> > 	bit done = 0;
> > 	bit mbok;
> > 	byte tmp;
> > 	byte tmp_removed;
> > 	byte tmp_free;
> > 
> > 	/* Absorb any early requests for memory barriers. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > 
> > 	/*
> > 	 * Each pass through this loop executes one read-side statement
> > 	 * from the following code fragment:
> > 	 *
> > 	 *	rcu_read_lock(); [0a]
> > 	 *	rcu_read_lock(); [0b]
> > 	 *	p = rcu_dereference(global_p); [1]
> > 	 *	x = p->data; [2]
> > 	 *	rcu_read_unlock(); [3b]
> > 	 *	rcu_read_unlock(); [3a]
> > 	 *
> > 	 * Because we are modeling a weak-memory machine, these statements
> > 	 * can be seen in any order, the only restriction being that
> > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > 	 * is non-deterministic, the above is but one possible placement.
> > 	 * Intestingly enough, this model validates all possible placements
> > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > 	 * with the only constraint being that the rcu_read_lock() must
> > 	 * precede the rcu_read_unlock().
> > 	 *
> > 	 * We also respond to memory-barrier requests, but only if our
> > 	 * execution happens to be ordered.  If the current state is
> > 	 * misordered, we ignore memory-barrier requests.
> > 	 */
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > 			tmp = urcu_active_readers;
> > 			if
> > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > 				tmp = urcu_gp_ctr;
> > 				do
> > 				:: (reader_progress[1] +
> > 				    reader_progress[2] +
> > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > 					need_mb = 0;
> > 				:: 1 -> skip;
> > 				:: 1 -> break;
> > 				od;
> > 				urcu_active_readers = tmp;
> > 			 :: else ->
> > 				urcu_active_readers = tmp + 1;
> > 			fi;
> > 			reader_progress[0] = reader_progress[0] + 1;
> > 		:: reader_progress[1] == 0 -> /* [1] */
> > 			tmp_removed = removed;
> > 			reader_progress[1] = 1;
> > 		:: reader_progress[2] == 0 -> /* [2] */
> > 			tmp_free = free;
> > 			reader_progress[2] = 1;
> > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > 			tmp = urcu_active_readers - 1;
> > 			urcu_active_readers = tmp;
> > 			reader_progress[3] = reader_progress[3] + 1;
> > 		:: else -> break;
> > 		fi;
> > 
> > 		/* Process memory-barrier requests, if it is safe to do so. */
> > 		atomic {
> > 			mbok = 0;
> > 			tmp = 0;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 				break;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp >= 4 ->
> > 				done = 1;
> > 				break;
> > 			od;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				break;
> > 			:: tmp >= 4 ->
> > 				mbok = 1;
> > 				break;
> > 			od
> > 
> > 		}
> > 
> > 		if
> > 		:: mbok == 1 ->
> > 			/* We get here if mb processing is safe. */
> > 			do
> > 			:: need_mb == 1 ->
> > 				need_mb = 0;
> > 			:: 1 -> skip;
> > 			:: 1 -> break;
> > 			od;
> > 		:: else -> skip;
> > 		fi;
> > 
> > 		/*
> > 		 * Check to see if we have modeled the entire RCU read-side
> > 		 * critical section, and leave if so.
> > 		 */
> > 		if
> > 		:: done == 1 -> break;
> > 		:: else -> skip;
> > 		fi
> > 	od;
> > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > 
> > 	/* Process any late-arriving memory-barrier requests. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > }
> > 
> > /* Model the RCU update process. */
> > 
> > proctype urcu_updater()
> > {
> > 	/* Removal statement, e.g., list_del_rcu(). */
> > 	removed = 1;
> > 
> > 	/* synchronize_rcu(), first counter flip. */
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	do
> > 	:: 1 ->
> > 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> > 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi
> > 	od;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> > 	/* removed = 1; */
> > 
> > 	/* synchronize_rcu(), second counter flip. */
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	do
> > 	:: 1 ->
> > 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> > 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi;
> > 	od;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* free-up step, e.g., kfree(). */
> > 	free = 1;
> > }
> > 
> > /*
> >  * Initialize the array, spawn a reader and an updater.  Because readers
> >  * are independent of each other, only one reader is needed.
> >  */
> > 
> > init {
> > 	atomic {
> > 		reader_progress[0] = 0;
> > 		reader_progress[1] = 0;
> > 		reader_progress[2] = 0;
> > 		reader_progress[3] = 0;
> > 		run urcu_reader();
> > 		run urcu_updater();
> > 	}
> > }
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  4:10                                               ` Mathieu Desnoyers
@ 2009-02-12  5:09                                                 ` Paul E. McKenney
  2009-02-12  5:47                                                   ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12  5:09 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > 
> > [ . . . ]
> > 
> > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > 
> > > > > 
> > > > > Yes, I'll have to dig deeper into it.
> > > > 
> > > > Well, as I said, I attached the current model and the error trail.
> > > 
> > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > to nest indefinitely, which overflowed into the top bit, messing
> > > things up.  :-/
> > > 
> > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > Even better, gives the expected error if you comment out line 180 and
> > > uncomment line 213, this latter corresponding to the error case I called
> > > out a few days ago.
> > > 
> > > I will play with removing models of mb...
> > 
> > And commenting out the models of mb between the counter flips and the
> > test for readers still passes validation, as expected, and as shown in
> > the attached Promela code.
> > 
> 
> Hrm, in the email I sent you about the memory barrier, I said that it
> would not make the algorithm incorrect, but that it would cause
> situations where it would be impossible for the writer to do any
> progress as long as there are readers active. I think we would have to
> enhance the model or at least express this through some LTL statement to
> validate this specific behavior.

But if the writer fails to make progress, then the counter remains at a
given value, which causes readers to drain, which allows the writer to
eventually make progress again.  Right?

						Thanx, Paul

> Mathieu
> 
> > 						Thanx, Paul
> 
> Content-Description: urcu.spin
> > /*
> >  * urcu.spin: Promela code to validate urcu.  See commit number
> >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> >  *      git archive at git://lttng.org/userspace-rcu.git
> >  *
> >  * This program is free software; you can redistribute it and/or modify
> >  * it under the terms of the GNU General Public License as published by
> >  * the Free Software Foundation; either version 2 of the License, or
> >  * (at your option) any later version.
> >  *
> >  * This program is distributed in the hope that it will be useful,
> >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >  * GNU General Public License for more details.
> >  *
> >  * You should have received a copy of the GNU General Public License
> >  * along with this program; if not, write to the Free Software
> >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> >  *
> >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> >  */
> > 
> > /* Promela validation variables. */
> > 
> > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > byte reader_progress[4];
> > 		  /* Count of read-side statement executions. */
> > 
> > /* urcu definitions and variables, taken straight from the algorithm. */
> > 
> > #define RCU_GP_CTR_BIT (1 << 7)
> > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > 
> > byte urcu_gp_ctr = 1;
> > byte urcu_active_readers = 0;
> > 
> > /* Model the RCU read-side critical section. */
> > 
> > proctype urcu_reader()
> > {
> > 	bit done = 0;
> > 	bit mbok;
> > 	byte tmp;
> > 	byte tmp_removed;
> > 	byte tmp_free;
> > 
> > 	/* Absorb any early requests for memory barriers. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > 
> > 	/*
> > 	 * Each pass through this loop executes one read-side statement
> > 	 * from the following code fragment:
> > 	 *
> > 	 *	rcu_read_lock(); [0a]
> > 	 *	rcu_read_lock(); [0b]
> > 	 *	p = rcu_dereference(global_p); [1]
> > 	 *	x = p->data; [2]
> > 	 *	rcu_read_unlock(); [3b]
> > 	 *	rcu_read_unlock(); [3a]
> > 	 *
> > 	 * Because we are modeling a weak-memory machine, these statements
> > 	 * can be seen in any order, the only restriction being that
> > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > 	 * is non-deterministic, the above is but one possible placement.
> > 	 * Intestingly enough, this model validates all possible placements
> > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > 	 * with the only constraint being that the rcu_read_lock() must
> > 	 * precede the rcu_read_unlock().
> > 	 *
> > 	 * We also respond to memory-barrier requests, but only if our
> > 	 * execution happens to be ordered.  If the current state is
> > 	 * misordered, we ignore memory-barrier requests.
> > 	 */
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > 			tmp = urcu_active_readers;
> > 			if
> > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > 				tmp = urcu_gp_ctr;
> > 				do
> > 				:: (reader_progress[1] +
> > 				    reader_progress[2] +
> > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > 					need_mb = 0;
> > 				:: 1 -> skip;
> > 				:: 1 -> break;
> > 				od;
> > 				urcu_active_readers = tmp;
> > 			 :: else ->
> > 				urcu_active_readers = tmp + 1;
> > 			fi;
> > 			reader_progress[0] = reader_progress[0] + 1;
> > 		:: reader_progress[1] == 0 -> /* [1] */
> > 			tmp_removed = removed;
> > 			reader_progress[1] = 1;
> > 		:: reader_progress[2] == 0 -> /* [2] */
> > 			tmp_free = free;
> > 			reader_progress[2] = 1;
> > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > 			tmp = urcu_active_readers - 1;
> > 			urcu_active_readers = tmp;
> > 			reader_progress[3] = reader_progress[3] + 1;
> > 		:: else -> break;
> > 		fi;
> > 
> > 		/* Process memory-barrier requests, if it is safe to do so. */
> > 		atomic {
> > 			mbok = 0;
> > 			tmp = 0;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 				break;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp >= 4 ->
> > 				done = 1;
> > 				break;
> > 			od;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				break;
> > 			:: tmp >= 4 ->
> > 				mbok = 1;
> > 				break;
> > 			od
> > 
> > 		}
> > 
> > 		if
> > 		:: mbok == 1 ->
> > 			/* We get here if mb processing is safe. */
> > 			do
> > 			:: need_mb == 1 ->
> > 				need_mb = 0;
> > 			:: 1 -> skip;
> > 			:: 1 -> break;
> > 			od;
> > 		:: else -> skip;
> > 		fi;
> > 
> > 		/*
> > 		 * Check to see if we have modeled the entire RCU read-side
> > 		 * critical section, and leave if so.
> > 		 */
> > 		if
> > 		:: done == 1 -> break;
> > 		:: else -> skip;
> > 		fi
> > 	od;
> > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > 
> > 	/* Process any late-arriving memory-barrier requests. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > }
> > 
> > /* Model the RCU update process. */
> > 
> > proctype urcu_updater()
> > {
> > 	/* Removal statement, e.g., list_del_rcu(). */
> > 	removed = 1;
> > 
> > 	/* synchronize_rcu(), first counter flip. */
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	/* need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od; */
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi
> > 	od;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> > 	/* removed = 1; */
> > 
> > 	/* synchronize_rcu(), second counter flip. */
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	/* need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od; */
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi;
> > 	od;
> > 	need_mb = 1;
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* free-up step, e.g., kfree(). */
> > 	free = 1;
> > }
> > 
> > /*
> >  * Initialize the array, spawn a reader and an updater.  Because readers
> >  * are independent of each other, only one reader is needed.
> >  */
> > 
> > init {
> > 	atomic {
> > 		reader_progress[0] = 0;
> > 		reader_progress[1] = 0;
> > 		reader_progress[2] = 0;
> > 		reader_progress[3] = 0;
> > 		run urcu_reader();
> > 		run urcu_updater();
> > 	}
> > }
> 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  5:09                                                 ` Paul E. McKenney
@ 2009-02-12  5:47                                                   ` Mathieu Desnoyers
  2009-02-12 16:18                                                     ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12  5:47 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > 
> > > > > > 
> > > > > > Yes, I'll have to dig deeper into it.
> > > > > 
> > > > > Well, as I said, I attached the current model and the error trail.
> > > > 
> > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > things up.  :-/
> > > > 
> > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > Even better, gives the expected error if you comment out line 180 and
> > > > uncomment line 213, this latter corresponding to the error case I called
> > > > out a few days ago.
> > > > 
> > > > I will play with removing models of mb...
> > > 
> > > And commenting out the models of mb between the counter flips and the
> > > test for readers still passes validation, as expected, and as shown in
> > > the attached Promela code.
> > > 
> > 
> > Hrm, in the email I sent you about the memory barrier, I said that it
> > would not make the algorithm incorrect, but that it would cause
> > situations where it would be impossible for the writer to do any
> > progress as long as there are readers active. I think we would have to
> > enhance the model or at least express this through some LTL statement to
> > validate this specific behavior.
> 
> But if the writer fails to make progress, then the counter remains at a
> given value, which causes readers to drain, which allows the writer to
> eventually make progress again.  Right?
> 

Not necessarily. If we don't have the proper memory barriers, we can
have the writer waiting on, say, parity 0 *before* it has performed the
parity switch. Therefore, even newly coming readers will add up to
parity 0.

In your model, this is not detected, because eventually all readers will
execute, and only then the writer will be able to update the data. But
in reality, if we run a very busy 4096-cores machines where there is
always at least one reader active, the the writer will be stuck forever,
and that's really bad.

Mathieu

> 						Thanx, Paul
> 
> > Mathieu
> > 
> > > 						Thanx, Paul
> > 
> > Content-Description: urcu.spin
> > > /*
> > >  * urcu.spin: Promela code to validate urcu.  See commit number
> > >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> > >  *      git archive at git://lttng.org/userspace-rcu.git
> > >  *
> > >  * This program is free software; you can redistribute it and/or modify
> > >  * it under the terms of the GNU General Public License as published by
> > >  * the Free Software Foundation; either version 2 of the License, or
> > >  * (at your option) any later version.
> > >  *
> > >  * This program is distributed in the hope that it will be useful,
> > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > >  * GNU General Public License for more details.
> > >  *
> > >  * You should have received a copy of the GNU General Public License
> > >  * along with this program; if not, write to the Free Software
> > >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > >  *
> > >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> > >  */
> > > 
> > > /* Promela validation variables. */
> > > 
> > > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > > byte reader_progress[4];
> > > 		  /* Count of read-side statement executions. */
> > > 
> > > /* urcu definitions and variables, taken straight from the algorithm. */
> > > 
> > > #define RCU_GP_CTR_BIT (1 << 7)
> > > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > > 
> > > byte urcu_gp_ctr = 1;
> > > byte urcu_active_readers = 0;
> > > 
> > > /* Model the RCU read-side critical section. */
> > > 
> > > proctype urcu_reader()
> > > {
> > > 	bit done = 0;
> > > 	bit mbok;
> > > 	byte tmp;
> > > 	byte tmp_removed;
> > > 	byte tmp_free;
> > > 
> > > 	/* Absorb any early requests for memory barriers. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > 
> > > 	/*
> > > 	 * Each pass through this loop executes one read-side statement
> > > 	 * from the following code fragment:
> > > 	 *
> > > 	 *	rcu_read_lock(); [0a]
> > > 	 *	rcu_read_lock(); [0b]
> > > 	 *	p = rcu_dereference(global_p); [1]
> > > 	 *	x = p->data; [2]
> > > 	 *	rcu_read_unlock(); [3b]
> > > 	 *	rcu_read_unlock(); [3a]
> > > 	 *
> > > 	 * Because we are modeling a weak-memory machine, these statements
> > > 	 * can be seen in any order, the only restriction being that
> > > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > > 	 * is non-deterministic, the above is but one possible placement.
> > > 	 * Intestingly enough, this model validates all possible placements
> > > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > > 	 * with the only constraint being that the rcu_read_lock() must
> > > 	 * precede the rcu_read_unlock().
> > > 	 *
> > > 	 * We also respond to memory-barrier requests, but only if our
> > > 	 * execution happens to be ordered.  If the current state is
> > > 	 * misordered, we ignore memory-barrier requests.
> > > 	 */
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > > 			tmp = urcu_active_readers;
> > > 			if
> > > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > > 				tmp = urcu_gp_ctr;
> > > 				do
> > > 				:: (reader_progress[1] +
> > > 				    reader_progress[2] +
> > > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > > 					need_mb = 0;
> > > 				:: 1 -> skip;
> > > 				:: 1 -> break;
> > > 				od;
> > > 				urcu_active_readers = tmp;
> > > 			 :: else ->
> > > 				urcu_active_readers = tmp + 1;
> > > 			fi;
> > > 			reader_progress[0] = reader_progress[0] + 1;
> > > 		:: reader_progress[1] == 0 -> /* [1] */
> > > 			tmp_removed = removed;
> > > 			reader_progress[1] = 1;
> > > 		:: reader_progress[2] == 0 -> /* [2] */
> > > 			tmp_free = free;
> > > 			reader_progress[2] = 1;
> > > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > > 			tmp = urcu_active_readers - 1;
> > > 			urcu_active_readers = tmp;
> > > 			reader_progress[3] = reader_progress[3] + 1;
> > > 		:: else -> break;
> > > 		fi;
> > > 
> > > 		/* Process memory-barrier requests, if it is safe to do so. */
> > > 		atomic {
> > > 			mbok = 0;
> > > 			tmp = 0;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 				break;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp >= 4 ->
> > > 				done = 1;
> > > 				break;
> > > 			od;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				break;
> > > 			:: tmp >= 4 ->
> > > 				mbok = 1;
> > > 				break;
> > > 			od
> > > 
> > > 		}
> > > 
> > > 		if
> > > 		:: mbok == 1 ->
> > > 			/* We get here if mb processing is safe. */
> > > 			do
> > > 			:: need_mb == 1 ->
> > > 				need_mb = 0;
> > > 			:: 1 -> skip;
> > > 			:: 1 -> break;
> > > 			od;
> > > 		:: else -> skip;
> > > 		fi;
> > > 
> > > 		/*
> > > 		 * Check to see if we have modeled the entire RCU read-side
> > > 		 * critical section, and leave if so.
> > > 		 */
> > > 		if
> > > 		:: done == 1 -> break;
> > > 		:: else -> skip;
> > > 		fi
> > > 	od;
> > > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > > 
> > > 	/* Process any late-arriving memory-barrier requests. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > }
> > > 
> > > /* Model the RCU update process. */
> > > 
> > > proctype urcu_updater()
> > > {
> > > 	/* Removal statement, e.g., list_del_rcu(). */
> > > 	removed = 1;
> > > 
> > > 	/* synchronize_rcu(), first counter flip. */
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	/* need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od; */
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi
> > > 	od;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> > > 	/* removed = 1; */
> > > 
> > > 	/* synchronize_rcu(), second counter flip. */
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	/* need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od; */
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi;
> > > 	od;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* free-up step, e.g., kfree(). */
> > > 	free = 1;
> > > }
> > > 
> > > /*
> > >  * Initialize the array, spawn a reader and an updater.  Because readers
> > >  * are independent of each other, only one reader is needed.
> > >  */
> > > 
> > > init {
> > > 	atomic {
> > > 		reader_progress[0] = 0;
> > > 		reader_progress[1] = 0;
> > > 		reader_progress[2] = 0;
> > > 		reader_progress[3] = 0;
> > > 		run urcu_reader();
> > > 		run urcu_updater();
> > > 	}
> > > }
> > 
> > > _______________________________________________
> > > ltt-dev mailing list
> > > ltt-dev@lists.casi.polymtl.ca
> > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  5:01                                               ` Paul E. McKenney
@ 2009-02-12  7:05                                                 ` Mathieu Desnoyers
  2009-02-12 16:46                                                   ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12  7:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> 
> [ . . . ]
> 
> > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > to nest indefinitely, which overflowed into the top bit, messing
> > > things up.  :-/
> > > 
> > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > Even better, gives the expected error if you comment out line 180 and
> > > uncomment line 213, this latter corresponding to the error case I called
> > > out a few days ago.
> > > 
> > 
> > Great ! :) I added this version to the git repository, hopefully it's ok
> > with you ?
> 
> Works for me!
> 
> > > I will play with removing models of mb...
> > > 
> > 
> > OK, I see you already did..
> 
> I continued this, and surprisingly few are actually required, though
> I don't fully trust the modeling of removed memory barriers.
> 

On my side I cleaned up the code a lot, and actually added some barriers
;) Especially in the busy loops, where we expect the other thread's
value to change eventually between iterations. A smp_rmb() seems more
appropriate that barrier(). I also added a lot of comments about
barriers in the code, and made the reader side much easier to review.

Please feel free to comment on my added code comments.

Mathieu

> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 							Thanx, Paul
> > 
> > Content-Description: urcu.spin
> > > /*
> > >  * urcu.spin: Promela code to validate urcu.  See commit number
> > >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> > >  *      git archive at git://lttng.org/userspace-rcu.git
> > >  *
> > >  * This program is free software; you can redistribute it and/or modify
> > >  * it under the terms of the GNU General Public License as published by
> > >  * the Free Software Foundation; either version 2 of the License, or
> > >  * (at your option) any later version.
> > >  *
> > >  * This program is distributed in the hope that it will be useful,
> > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > >  * GNU General Public License for more details.
> > >  *
> > >  * You should have received a copy of the GNU General Public License
> > >  * along with this program; if not, write to the Free Software
> > >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > >  *
> > >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> > >  */
> > > 
> > > /* Promela validation variables. */
> > > 
> > > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > > byte reader_progress[4];
> > > 		  /* Count of read-side statement executions. */
> > > 
> > > /* urcu definitions and variables, taken straight from the algorithm. */
> > > 
> > > #define RCU_GP_CTR_BIT (1 << 7)
> > > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > > 
> > > byte urcu_gp_ctr = 1;
> > > byte urcu_active_readers = 0;
> > > 
> > > /* Model the RCU read-side critical section. */
> > > 
> > > proctype urcu_reader()
> > > {
> > > 	bit done = 0;
> > > 	bit mbok;
> > > 	byte tmp;
> > > 	byte tmp_removed;
> > > 	byte tmp_free;
> > > 
> > > 	/* Absorb any early requests for memory barriers. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > 
> > > 	/*
> > > 	 * Each pass through this loop executes one read-side statement
> > > 	 * from the following code fragment:
> > > 	 *
> > > 	 *	rcu_read_lock(); [0a]
> > > 	 *	rcu_read_lock(); [0b]
> > > 	 *	p = rcu_dereference(global_p); [1]
> > > 	 *	x = p->data; [2]
> > > 	 *	rcu_read_unlock(); [3b]
> > > 	 *	rcu_read_unlock(); [3a]
> > > 	 *
> > > 	 * Because we are modeling a weak-memory machine, these statements
> > > 	 * can be seen in any order, the only restriction being that
> > > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > > 	 * is non-deterministic, the above is but one possible placement.
> > > 	 * Intestingly enough, this model validates all possible placements
> > > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > > 	 * with the only constraint being that the rcu_read_lock() must
> > > 	 * precede the rcu_read_unlock().
> > > 	 *
> > > 	 * We also respond to memory-barrier requests, but only if our
> > > 	 * execution happens to be ordered.  If the current state is
> > > 	 * misordered, we ignore memory-barrier requests.
> > > 	 */
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > > 			tmp = urcu_active_readers;
> > > 			if
> > > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > > 				tmp = urcu_gp_ctr;
> > > 				do
> > > 				:: (reader_progress[1] +
> > > 				    reader_progress[2] +
> > > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > > 					need_mb = 0;
> > > 				:: 1 -> skip;
> > > 				:: 1 -> break;
> > > 				od;
> > > 				urcu_active_readers = tmp;
> > > 			 :: else ->
> > > 				urcu_active_readers = tmp + 1;
> > > 			fi;
> > > 			reader_progress[0] = reader_progress[0] + 1;
> > > 		:: reader_progress[1] == 0 -> /* [1] */
> > > 			tmp_removed = removed;
> > > 			reader_progress[1] = 1;
> > > 		:: reader_progress[2] == 0 -> /* [2] */
> > > 			tmp_free = free;
> > > 			reader_progress[2] = 1;
> > > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > > 			tmp = urcu_active_readers - 1;
> > > 			urcu_active_readers = tmp;
> > > 			reader_progress[3] = reader_progress[3] + 1;
> > > 		:: else -> break;
> > > 		fi;
> > > 
> > > 		/* Process memory-barrier requests, if it is safe to do so. */
> > > 		atomic {
> > > 			mbok = 0;
> > > 			tmp = 0;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 				break;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp >= 4 ->
> > > 				done = 1;
> > > 				break;
> > > 			od;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				break;
> > > 			:: tmp >= 4 ->
> > > 				mbok = 1;
> > > 				break;
> > > 			od
> > > 
> > > 		}
> > > 
> > > 		if
> > > 		:: mbok == 1 ->
> > > 			/* We get here if mb processing is safe. */
> > > 			do
> > > 			:: need_mb == 1 ->
> > > 				need_mb = 0;
> > > 			:: 1 -> skip;
> > > 			:: 1 -> break;
> > > 			od;
> > > 		:: else -> skip;
> > > 		fi;
> > > 
> > > 		/*
> > > 		 * Check to see if we have modeled the entire RCU read-side
> > > 		 * critical section, and leave if so.
> > > 		 */
> > > 		if
> > > 		:: done == 1 -> break;
> > > 		:: else -> skip;
> > > 		fi
> > > 	od;
> > > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > > 
> > > 	/* Process any late-arriving memory-barrier requests. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > }
> > > 
> > > /* Model the RCU update process. */
> > > 
> > > proctype urcu_updater()
> > > {
> > > 	/* Removal statement, e.g., list_del_rcu(). */
> > > 	removed = 1;
> > > 
> > > 	/* synchronize_rcu(), first counter flip. */
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	do
> > > 	:: 1 ->
> > > 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> > > 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi
> > > 	od;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* Erroneous removal statement, e.g., list_del_rcu(). */
> > > 	/* removed = 1; */
> > > 
> > > 	/* synchronize_rcu(), second counter flip. */
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	do
> > > 	:: 1 ->
> > > 		printf("urcu_gp_ctr=%x urcu_active_readers=%x\n", urcu_gp_ctr, urcu_active_readers);
> > > 		printf("urcu_gp_ctr&0x7f=%x urcu_active_readers&0x7f=%x\n", urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK, urcu_active_readers & ~RCU_GP_CTR_NEST_MASK);
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi;
> > > 	od;
> > > 	need_mb = 1;
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* free-up step, e.g., kfree(). */
> > > 	free = 1;
> > > }
> > > 
> > > /*
> > >  * Initialize the array, spawn a reader and an updater.  Because readers
> > >  * are independent of each other, only one reader is needed.
> > >  */
> > > 
> > > init {
> > > 	atomic {
> > > 		reader_progress[0] = 0;
> > > 		reader_progress[1] = 0;
> > > 		reader_progress[2] = 0;
> > > 		reader_progress[3] = 0;
> > > 		run urcu_reader();
> > > 		run urcu_updater();
> > > 	}
> > > }
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  5:47                                                   ` Mathieu Desnoyers
@ 2009-02-12 16:18                                                     ` Paul E. McKenney
  2009-02-12 18:40                                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 16:18 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4658 bytes --]

On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > 
> > > > > > > 
> > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > 
> > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > 
> > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > things up.  :-/
> > > > > 
> > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > out a few days ago.
> > > > > 
> > > > > I will play with removing models of mb...
> > > > 
> > > > And commenting out the models of mb between the counter flips and the
> > > > test for readers still passes validation, as expected, and as shown in
> > > > the attached Promela code.
> > > > 
> > > 
> > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > would not make the algorithm incorrect, but that it would cause
> > > situations where it would be impossible for the writer to do any
> > > progress as long as there are readers active. I think we would have to
> > > enhance the model or at least express this through some LTL statement to
> > > validate this specific behavior.
> > 
> > But if the writer fails to make progress, then the counter remains at a
> > given value, which causes readers to drain, which allows the writer to
> > eventually make progress again.  Right?
> > 
> 
> Not necessarily. If we don't have the proper memory barriers, we can
> have the writer waiting on, say, parity 0 *before* it has performed the
> parity switch. Therefore, even newly coming readers will add up to
> parity 0.

But the write that changes the parity will eventually make it out.
OK, so your argument is that we at least need a compiler barrier?

Regardless, please see attached for a modified version of the Promela
model that fully models omitting out the memory barrier that my
rcu_nest32.[hc] implementation omits.  (It is possible to partially
model removal of other memory barriers via #if 0, but to fully model
would need to enumerate the permutations as shown on lines 231-257.)

> In your model, this is not detected, because eventually all readers will
> execute, and only then the writer will be able to update the data. But
> in reality, if we run a very busy 4096-cores machines where there is
> always at least one reader active, the the writer will be stuck forever,
> and that's really bad.

Assuming that the reordering is done by the CPU, the write will
eventually get out -- it is stuck in (say) the store buffer, and the
cache line will eventually arrive, and then the value will eventually
be seen by the readers.

We might need a -compiler- barrier, but then again, I am not sure that
we are talking about the same memory barrier -- again, please see
attached lines 231-257 to see which one that I eliminated.

Also, the original model I sent out has a minor bug that prevents it
from fully modeling the nested-read-side case.  The patch below fixes this.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 urcu.spin |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
index e5bfff3..611464b 100644
--- a/formal-model/urcu.spin
+++ b/formal-model/urcu.spin
@@ -124,9 +124,13 @@ proctype urcu_reader()
 				break;
 			:: tmp < 4 && reader_progress[tmp] != 0 ->
 				tmp = tmp + 1;
-			:: tmp >= 4 ->
+			:: tmp >= 4 &&
+			   reader_progress[0] == reader_progress[3] ->
 				done = 1;
 				break;
+			:: tmp >= 4 &&
+			   reader_progress[0] != reader_progress[3] ->
+			   	break;
 			od;
 			do
 			:: tmp < 4 && reader_progress[tmp] == 0 ->

[-- Attachment #2: urcu_mbmin.spin --]
[-- Type: text/plain, Size: 7514 bytes --]

/*
 * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
 *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
 *      git archive at git://lttng.org/userspace-rcu.git, but with
 *	memory barriers removed.
 *
 * This program is free software; you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation; either version 2 of the License, or
 * (at your option) any later version.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with this program; if not, write to the Free Software
 * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
 *
 * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
 */

/* Promela validation variables. */

bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
byte reader_progress[4];
		  /* Count of read-side statement executions. */

/* urcu definitions and variables, taken straight from the algorithm. */

#define RCU_GP_CTR_BIT (1 << 7)
#define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)

byte urcu_gp_ctr = 1;
byte urcu_active_readers = 0;

/* Model the RCU read-side critical section. */

proctype urcu_reader()
{
	bit done = 0;
	bit mbok;
	byte tmp;
	byte tmp_removed;
	byte tmp_free;

	/* Absorb any early requests for memory barriers. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;

	/*
	 * Each pass through this loop executes one read-side statement
	 * from the following code fragment:
	 *
	 *	rcu_read_lock(); [0a]
	 *	rcu_read_lock(); [0b]
	 *	p = rcu_dereference(global_p); [1]
	 *	x = p->data; [2]
	 *	rcu_read_unlock(); [3b]
	 *	rcu_read_unlock(); [3a]
	 *
	 * Because we are modeling a weak-memory machine, these statements
	 * can be seen in any order, the only restriction being that
	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
	 * is non-deterministic, the above is but one possible placement.
	 * Intestingly enough, this model validates all possible placements
	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
	 * with the only constraint being that the rcu_read_lock() must
	 * precede the rcu_read_unlock().
	 *
	 * We also respond to memory-barrier requests, but only if our
	 * execution happens to be ordered.  If the current state is
	 * misordered, we ignore memory-barrier requests.
	 */
	do
	:: 1 ->
		if
		:: reader_progress[0] < 2 -> /* [0a and 0b] */
			tmp = urcu_active_readers;
			if
			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
				tmp = urcu_gp_ctr;
				do
				:: (reader_progress[1] +
				    reader_progress[2] +
				    reader_progress[3] == 0) && need_mb == 1 ->
					need_mb = 0;
				:: 1 -> skip;
				:: 1 -> break;
				od;
				urcu_active_readers = tmp;
			 :: else ->
				urcu_active_readers = tmp + 1;
			fi;
			reader_progress[0] = reader_progress[0] + 1;
		:: reader_progress[1] == 0 -> /* [1] */
			tmp_removed = removed;
			reader_progress[1] = 1;
		:: reader_progress[2] == 0 -> /* [2] */
			tmp_free = free;
			reader_progress[2] = 1;
		:: ((reader_progress[0] > reader_progress[3]) &&
		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
			tmp = urcu_active_readers - 1;
			urcu_active_readers = tmp;
			reader_progress[3] = reader_progress[3] + 1;
		:: else -> break;
		fi;

		/* Process memory-barrier requests, if it is safe to do so. */
		atomic {
			mbok = 0;
			tmp = 0;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
				break;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				tmp = tmp + 1;
			:: tmp >= 4 &&
			   reader_progress[0] == reader_progress[3] ->
				done = 1;
				break;
			:: tmp >= 4 &&
			   reader_progress[0] != reader_progress[3] ->
			   	break;
			od;
			do
			:: tmp < 4 && reader_progress[tmp] == 0 ->
				tmp = tmp + 1;
			:: tmp < 4 && reader_progress[tmp] != 0 ->
				break;
			:: tmp >= 4 ->
				mbok = 1;
				break;
			od

		}

		if
		:: mbok == 1 ->
			/* We get here if mb processing is safe. */
			do
			:: need_mb == 1 ->
				need_mb = 0;
			:: 1 -> skip;
			:: 1 -> break;
			od;
		:: else -> skip;
		fi;

		/*
		 * Check to see if we have modeled the entire RCU read-side
		 * critical section, and leave if so.
		 */
		if
		:: done == 1 -> break;
		:: else -> skip;
		fi
	od;
	assert((tmp_free == 0) || (tmp_removed == 1));

	/* Process any late-arriving memory-barrier requests. */
	do
	:: need_mb == 1 ->
		need_mb = 0;
	:: 1 -> skip;
	:: 1 -> break;
	od;
}

/* Model the RCU update process. */

proctype urcu_updater()
{
	byte tmp;

	/* prior synchronize_rcu(), second counter flip. */
	need_mb = 1; /* mb() A */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	need_mb = 1; /* mb() B */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi
	od;
	need_mb = 1; /* mb() C absolutely required by analogy with G */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* Removal statement, e.g., list_del_rcu(). */
	removed = 1;

	/* current synchronize_rcu(), first counter flip. */
	need_mb = 1; /* mb() D suggested */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	need_mb = 1;  /* mb() E required if D not present */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* current synchronize_rcu(), first-flip check plus second flip. */
	if
	:: 1 ->
		do
		:: 1 ->
			if
			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
				skip;
			:: else -> break;
			fi;
		od;
		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
	:: 1 ->
		tmp = urcu_gp_ctr;
		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
		do
		:: 1 ->
			if
			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
				skip;
			:: else -> break;
			fi;
		od;
	fi;

	/* current synchronize_rcu(), second counter flip check. */
	need_mb = 1; /* mb() F not required */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;
	do
	:: 1 ->
		if
		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
			skip;
		:: else -> break;
		fi;
	od;
	need_mb = 1; /* mb() G absolutely required */
	do
	:: need_mb == 1 -> skip;
	:: need_mb == 0 -> break;
	od;

	/* free-up step, e.g., kfree(). */
	free = 1;
}

/*
 * Initialize the array, spawn a reader and an updater.  Because readers
 * are independent of each other, only one reader is needed.
 */

init {
	atomic {
		reader_progress[0] = 0;
		reader_progress[1] = 0;
		reader_progress[2] = 0;
		reader_progress[3] = 0;
		run urcu_reader();
		run urcu_updater();
	}
}

[-- Attachment #3: urcu_mbmin.sh --]
[-- Type: application/x-sh, Size: 59 bytes --]

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12  7:05                                                 ` Mathieu Desnoyers
@ 2009-02-12 16:46                                                   ` Paul E. McKenney
  2009-02-12 19:29                                                     ` Mathieu Desnoyers
  2009-02-12 19:38                                                     ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 16:46 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > 
> > [ . . . ]
> > 
> > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > things up.  :-/
> > > > 
> > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > Even better, gives the expected error if you comment out line 180 and
> > > > uncomment line 213, this latter corresponding to the error case I called
> > > > out a few days ago.
> > > > 
> > > 
> > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > with you ?
> > 
> > Works for me!
> > 
> > > > I will play with removing models of mb...
> > > 
> > > OK, I see you already did..
> > 
> > I continued this, and surprisingly few are actually required, though
> > I don't fully trust the modeling of removed memory barriers.
> 
> On my side I cleaned up the code a lot, and actually added some barriers
> ;) Especially in the busy loops, where we expect the other thread's
> value to change eventually between iterations. A smp_rmb() seems more
> appropriate that barrier(). I also added a lot of comments about
> barriers in the code, and made the reader side much easier to review.
> 
> Please feel free to comment on my added code comments.

The torture test now looks much more familiar.  ;-)

I fixed some compiler warnings (in my original, sad to say), added an
ACCESS_ONCE() to rcu_read_lock() (also in my original), and downgraded
a few of your memory barriers with comments as to why.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 rcutorture.h |   11 +++++------
 urcu.c       |   12 ++++++++----
 urcu.h       |    2 +-
 3 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/rcutorture.h b/rcutorture.h
index bda2ad5..8ba6763 100644
--- a/rcutorture.h
+++ b/rcutorture.h
@@ -112,7 +112,6 @@ void *rcu_read_perf_test(void *arg)
 {
 	int i;
 	int me = (long)arg;
-	cpu_set_t mask;
 	long long n_reads_local = 0;
 
 	urcu_register_thread();
@@ -150,6 +149,7 @@ void *rcu_update_perf_test(void *arg)
 		n_updates_local++;
 	}
 	__get_thread_var(n_updates_pt) += n_updates_local;
+	return NULL;
 }
 
 void perftestinit(void)
@@ -242,7 +242,7 @@ struct rcu_stress {
 	int mbtest;
 };
 
-struct rcu_stress rcu_stress_array[RCU_STRESS_PIPE_LEN] = { 0 };
+struct rcu_stress rcu_stress_array[RCU_STRESS_PIPE_LEN] = { { 0 } };
 struct rcu_stress *rcu_stress_current;
 int rcu_stress_idx = 0;
 
@@ -314,19 +314,18 @@ void *rcu_update_stress_test(void *arg)
 		synchronize_rcu();
 		n_updates++;
 	}
+	return NULL;
 }
 
 void *rcu_fake_update_stress_test(void *arg)
 {
-	int i;
-	struct rcu_stress *p;
-
 	while (goflag == GOFLAG_INIT)
 		poll(NULL, 0, 1);
 	while (goflag == GOFLAG_RUN) {
 		synchronize_rcu();
 		poll(NULL, 0, 1);
 	}
+	return NULL;
 }
 
 void stresstest(int nreaders)
@@ -360,7 +359,7 @@ void stresstest(int nreaders)
 	wait_all_threads();
 	for_each_thread(t)
 		n_reads += per_thread(n_reads_pt, t);
-	printf("n_reads: %lld  n_updates: %ld  n_mberror: %ld\n",
+	printf("n_reads: %lld  n_updates: %ld  n_mberror: %d\n",
 	       n_reads, n_updates, n_mberror);
 	printf("rcu_stress_count:");
 	for (i = 0; i <= RCU_STRESS_PIPE_LEN; i++) {
diff --git a/urcu.c b/urcu.c
index f2aae34..a696439 100644
--- a/urcu.c
+++ b/urcu.c
@@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
 	 * BUSY-LOOP.
 	 */
 	while (sig_done < 1)
-		smp_rmb();	/* ensure we re-read sig-done */
+		barrier();	/* ensure compiler re-reads sig-done */
+				/* cache coherence guarantees CPU re-read. */
 	smp_mb();	/* read sig_done before ending the barrier */
 }
 
@@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
 	if (!reader_data)
 		return;
 	sig_done = 0;
-	smp_mb();	/* write sig_done before sending the signals */
+	/* smp_mb();	write sig_done before sending the signals */
+			/* redundant with barriers in pthread_kill(). */
 	for (index = reader_data; index < reader_data + num_readers; index++)
 		pthread_kill(index->tid, SIGURCU);
 	/*
@@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
 	 * BUSY-LOOP.
 	 */
 	while (sig_done < num_readers)
-		smp_rmb();	/* ensure we re-read sig-done */
+		barrier();	/* ensure compiler re-reads sig-done */
+				/* cache coherence guarantees CPU re-read. */
 	smp_mb();	/* read sig_done before ending the barrier */
 }
 #endif
@@ -181,7 +184,8 @@ void synchronize_rcu(void)
 	 * the writer waiting forever while new readers are always accessing
 	 * data (no progress).
 	 */
-	smp_mb();
+	/* smp_mb(); Don't need this one for CPU, only compiler. */
+	barrier();
 
 	switch_next_urcu_qparity();	/* 1 -> 0 */
 
diff --git a/urcu.h b/urcu.h
index 3eca5ea..79d9464 100644
--- a/urcu.h
+++ b/urcu.h
@@ -244,7 +244,7 @@ static inline void rcu_read_lock(void)
 	/* The data dependency "read urcu_gp_ctr, write urcu_active_readers",
 	 * serializes those two memory operations. */
 	if (likely(!(tmp & RCU_GP_CTR_NEST_MASK)))
-		urcu_active_readers = urcu_gp_ctr;
+		urcu_active_readers = ACCESS_ONCE(urcu_gp_ctr);
 	else
 		urcu_active_readers = tmp + RCU_GP_COUNT;
 	/*

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 16:18                                                     ` Paul E. McKenney
@ 2009-02-12 18:40                                                       ` Mathieu Desnoyers
  2009-02-12 20:28                                                         ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 18:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > > 
> > > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > > 
> > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > things up.  :-/
> > > > > > 
> > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > out a few days ago.
> > > > > > 
> > > > > > I will play with removing models of mb...
> > > > > 
> > > > > And commenting out the models of mb between the counter flips and the
> > > > > test for readers still passes validation, as expected, and as shown in
> > > > > the attached Promela code.
> > > > > 
> > > > 
> > > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > > would not make the algorithm incorrect, but that it would cause
> > > > situations where it would be impossible for the writer to do any
> > > > progress as long as there are readers active. I think we would have to
> > > > enhance the model or at least express this through some LTL statement to
> > > > validate this specific behavior.
> > > 
> > > But if the writer fails to make progress, then the counter remains at a
> > > given value, which causes readers to drain, which allows the writer to
> > > eventually make progress again.  Right?
> > > 
> > 
> > Not necessarily. If we don't have the proper memory barriers, we can
> > have the writer waiting on, say, parity 0 *before* it has performed the
> > parity switch. Therefore, even newly coming readers will add up to
> > parity 0.
> 
> But the write that changes the parity will eventually make it out.
> OK, so your argument is that we at least need a compiler barrier?
> 

It all depends on the assumptions we make. I am currently trying to
assume the most aggressive memory ordering I can think of. The model I
think about to represent it is that memory reads/writes are kept local
to the CPU until a memory barrier is encountered. I doubt it exists in
practice, bacause the CPU will eventually have to commit the information
to memory (hrm, are sure about this ?), but if we use that as a starting
point, I think this would cover the entire spectrum of possible memory
barriers issues. Also, it would be easy to verify formally. But maybe am
I going too far ?

> Regardless, please see attached for a modified version of the Promela
> model that fully models omitting out the memory barrier that my
> rcu_nest32.[hc] implementation omits.  (It is possible to partially
> model removal of other memory barriers via #if 0, but to fully model
> would need to enumerate the permutations as shown on lines 231-257.)
> 
> > In your model, this is not detected, because eventually all readers will
> > execute, and only then the writer will be able to update the data. But
> > in reality, if we run a very busy 4096-cores machines where there is
> > always at least one reader active, the the writer will be stuck forever,
> > and that's really bad.
> 
> Assuming that the reordering is done by the CPU, the write will
> eventually get out -- it is stuck in (say) the store buffer, and the
> cache line will eventually arrive, and then the value will eventually
> be seen by the readers.

Do we have guarantees that the data *will necessarily* get out of the
cpu write buffer at some point ?

> 
> We might need a -compiler- barrier, but then again, I am not sure that
> we are talking about the same memory barrier -- again, please see
> attached lines 231-257 to see which one that I eliminated.
> 

As long as we don't have "progress" validation to check our model, the
fact that it passes the current test does not tell much.

> Also, the original model I sent out has a minor bug that prevents it
> from fully modeling the nested-read-side case.  The patch below fixes this.
> 

Ok, merging the fix, thanks,

Mathieu

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  urcu.spin |    6 +++++-
>  1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
> index e5bfff3..611464b 100644
> --- a/formal-model/urcu.spin
> +++ b/formal-model/urcu.spin
> @@ -124,9 +124,13 @@ proctype urcu_reader()
>  				break;
>  			:: tmp < 4 && reader_progress[tmp] != 0 ->
>  				tmp = tmp + 1;
> -			:: tmp >= 4 ->
> +			:: tmp >= 4 &&
> +			   reader_progress[0] == reader_progress[3] ->
>  				done = 1;
>  				break;
> +			:: tmp >= 4 &&
> +			   reader_progress[0] != reader_progress[3] ->
> +			   	break;
>  			od;
>  			do
>  			:: tmp < 4 && reader_progress[tmp] == 0 ->

Content-Description: urcu_mbmin.spin
> /*
>  * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
>  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
>  *      git archive at git://lttng.org/userspace-rcu.git, but with
>  *	memory barriers removed.
>  *
>  * This program is free software; you can redistribute it and/or modify
>  * it under the terms of the GNU General Public License as published by
>  * the Free Software Foundation; either version 2 of the License, or
>  * (at your option) any later version.
>  *
>  * This program is distributed in the hope that it will be useful,
>  * but WITHOUT ANY WARRANTY; without even the implied warranty of
>  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
>  * GNU General Public License for more details.
>  *
>  * You should have received a copy of the GNU General Public License
>  * along with this program; if not, write to the Free Software
>  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
>  *
>  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
>  */
> 
> /* Promela validation variables. */
> 
> bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> byte reader_progress[4];
> 		  /* Count of read-side statement executions. */
> 
> /* urcu definitions and variables, taken straight from the algorithm. */
> 
> #define RCU_GP_CTR_BIT (1 << 7)
> #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> 
> byte urcu_gp_ctr = 1;
> byte urcu_active_readers = 0;
> 
> /* Model the RCU read-side critical section. */
> 
> proctype urcu_reader()
> {
> 	bit done = 0;
> 	bit mbok;
> 	byte tmp;
> 	byte tmp_removed;
> 	byte tmp_free;
> 
> 	/* Absorb any early requests for memory barriers. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> 
> 	/*
> 	 * Each pass through this loop executes one read-side statement
> 	 * from the following code fragment:
> 	 *
> 	 *	rcu_read_lock(); [0a]
> 	 *	rcu_read_lock(); [0b]
> 	 *	p = rcu_dereference(global_p); [1]
> 	 *	x = p->data; [2]
> 	 *	rcu_read_unlock(); [3b]
> 	 *	rcu_read_unlock(); [3a]
> 	 *
> 	 * Because we are modeling a weak-memory machine, these statements
> 	 * can be seen in any order, the only restriction being that
> 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> 	 * is non-deterministic, the above is but one possible placement.
> 	 * Intestingly enough, this model validates all possible placements
> 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> 	 * with the only constraint being that the rcu_read_lock() must
> 	 * precede the rcu_read_unlock().
> 	 *
> 	 * We also respond to memory-barrier requests, but only if our
> 	 * execution happens to be ordered.  If the current state is
> 	 * misordered, we ignore memory-barrier requests.
> 	 */
> 	do
> 	:: 1 ->
> 		if
> 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> 			tmp = urcu_active_readers;
> 			if
> 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> 				tmp = urcu_gp_ctr;
> 				do
> 				:: (reader_progress[1] +
> 				    reader_progress[2] +
> 				    reader_progress[3] == 0) && need_mb == 1 ->
> 					need_mb = 0;
> 				:: 1 -> skip;
> 				:: 1 -> break;
> 				od;
> 				urcu_active_readers = tmp;
> 			 :: else ->
> 				urcu_active_readers = tmp + 1;
> 			fi;
> 			reader_progress[0] = reader_progress[0] + 1;
> 		:: reader_progress[1] == 0 -> /* [1] */
> 			tmp_removed = removed;
> 			reader_progress[1] = 1;
> 		:: reader_progress[2] == 0 -> /* [2] */
> 			tmp_free = free;
> 			reader_progress[2] = 1;
> 		:: ((reader_progress[0] > reader_progress[3]) &&
> 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> 			tmp = urcu_active_readers - 1;
> 			urcu_active_readers = tmp;
> 			reader_progress[3] = reader_progress[3] + 1;
> 		:: else -> break;
> 		fi;
> 
> 		/* Process memory-barrier requests, if it is safe to do so. */
> 		atomic {
> 			mbok = 0;
> 			tmp = 0;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 				break;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				tmp = tmp + 1;
> 			:: tmp >= 4 &&
> 			   reader_progress[0] == reader_progress[3] ->
> 				done = 1;
> 				break;
> 			:: tmp >= 4 &&
> 			   reader_progress[0] != reader_progress[3] ->
> 			   	break;
> 			od;
> 			do
> 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 				tmp = tmp + 1;
> 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> 				break;
> 			:: tmp >= 4 ->
> 				mbok = 1;
> 				break;
> 			od
> 
> 		}
> 
> 		if
> 		:: mbok == 1 ->
> 			/* We get here if mb processing is safe. */
> 			do
> 			:: need_mb == 1 ->
> 				need_mb = 0;
> 			:: 1 -> skip;
> 			:: 1 -> break;
> 			od;
> 		:: else -> skip;
> 		fi;
> 
> 		/*
> 		 * Check to see if we have modeled the entire RCU read-side
> 		 * critical section, and leave if so.
> 		 */
> 		if
> 		:: done == 1 -> break;
> 		:: else -> skip;
> 		fi
> 	od;
> 	assert((tmp_free == 0) || (tmp_removed == 1));
> 
> 	/* Process any late-arriving memory-barrier requests. */
> 	do
> 	:: need_mb == 1 ->
> 		need_mb = 0;
> 	:: 1 -> skip;
> 	:: 1 -> break;
> 	od;
> }
> 
> /* Model the RCU update process. */
> 
> proctype urcu_updater()
> {
> 	byte tmp;
> 
> 	/* prior synchronize_rcu(), second counter flip. */
> 	need_mb = 1; /* mb() A */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	need_mb = 1; /* mb() B */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi
> 	od;
> 	need_mb = 1; /* mb() C absolutely required by analogy with G */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* Removal statement, e.g., list_del_rcu(). */
> 	removed = 1;
> 
> 	/* current synchronize_rcu(), first counter flip. */
> 	need_mb = 1; /* mb() D suggested */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	need_mb = 1;  /* mb() E required if D not present */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* current synchronize_rcu(), first-flip check plus second flip. */
> 	if
> 	:: 1 ->
> 		do
> 		:: 1 ->
> 			if
> 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 				skip;
> 			:: else -> break;
> 			fi;
> 		od;
> 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 	:: 1 ->
> 		tmp = urcu_gp_ctr;
> 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> 		do
> 		:: 1 ->
> 			if
> 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
> 				skip;
> 			:: else -> break;
> 			fi;
> 		od;
> 	fi;
> 
> 	/* current synchronize_rcu(), second counter flip check. */
> 	need_mb = 1; /* mb() F not required */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 	do
> 	:: 1 ->
> 		if
> 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> 			skip;
> 		:: else -> break;
> 		fi;
> 	od;
> 	need_mb = 1; /* mb() G absolutely required */
> 	do
> 	:: need_mb == 1 -> skip;
> 	:: need_mb == 0 -> break;
> 	od;
> 
> 	/* free-up step, e.g., kfree(). */
> 	free = 1;
> }
> 
> /*
>  * Initialize the array, spawn a reader and an updater.  Because readers
>  * are independent of each other, only one reader is needed.
>  */
> 
> init {
> 	atomic {
> 		reader_progress[0] = 0;
> 		reader_progress[1] = 0;
> 		reader_progress[2] = 0;
> 		reader_progress[3] = 0;
> 		run urcu_reader();
> 		run urcu_updater();
> 	}
> }



-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 16:46                                                   ` Paul E. McKenney
@ 2009-02-12 19:29                                                     ` Mathieu Desnoyers
  2009-02-12 20:02                                                       ` Paul E. McKenney
  2009-02-12 19:38                                                     ` Mathieu Desnoyers
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 19:29 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Bryan Wu, uclinux-dist-devel


* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
[...]
> diff --git a/urcu.c b/urcu.c
> index f2aae34..a696439 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
>  	 * BUSY-LOOP.
>  	 */
>  	while (sig_done < 1)
> -		smp_rmb();	/* ensure we re-read sig-done */
> +		barrier();	/* ensure compiler re-reads sig-done */
> +				/* cache coherence guarantees CPU re-read. */

OK, this is where I think our points of view differ. Please refer to
http://lkml.org/lkml/2007/6/18/299.

Basically, cpu_relax() used in the Linux kernel has an
architecture-specific implementation which *could* include a smp_rmb()
if the architecture doesn't notice writes done by other CPUs. I think
Blackfin is the only architecture currently supported by the Linux
kernel which defines cpu_relax() as a smp_mb(), because it does not have
cache coherency.

Therefore, I propose that we create a memory barrier macro which is
defined as a 
  barrier()   when the cpu has cache coherency
  cache flush when the cpu does not have cache coherency and is
              compiled with smp support.

We could call that

  smp_wmc() (for memory-coherency or memory commit)
  smp_rmc()
  smp_mc()

It would be a good way to identify the location where data exchange
between memory and the local cache are is required in the algorithm.
What do you think ?

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 16:46                                                   ` Paul E. McKenney
  2009-02-12 19:29                                                     ` Mathieu Desnoyers
@ 2009-02-12 19:38                                                     ` Mathieu Desnoyers
  2009-02-12 20:17                                                       ` Paul E. McKenney
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 19:38 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

Replying to a separate portion of the mail with less CC :


> On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > things up.  :-/
> > > > > 
> > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > out a few days ago.
> > > > > 
> > > > 
> > > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > > with you ?
> > > 
> > > Works for me!
> > > 
> > > > > I will play with removing models of mb...
> > > > 
> > > > OK, I see you already did..
> > > 
> > > I continued this, and surprisingly few are actually required, though
> > > I don't fully trust the modeling of removed memory barriers.
> > 
> > On my side I cleaned up the code a lot, and actually added some barriers
> > ;) Especially in the busy loops, where we expect the other thread's
> > value to change eventually between iterations. A smp_rmb() seems more
> > appropriate that barrier(). I also added a lot of comments about
> > barriers in the code, and made the reader side much easier to review.
> > 
> > Please feel free to comment on my added code comments.
> 
> The torture test now looks much more familiar.  ;-)
> 
> I fixed some compiler warnings (in my original, sad to say), added an
> ACCESS_ONCE() to rcu_read_lock() (also in my original),

Yes, I thought about this ACCESS_ONCE during my sleep.. just did not
have to to update the source yet. :)

Merged. Thanks !

[...]

> --- a/urcu.c
> +++ b/urcu.c
> @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
>  	 * BUSY-LOOP.
>  	 */
>  	while (sig_done < 1)
> -		smp_rmb();	/* ensure we re-read sig-done */
> +		barrier();	/* ensure compiler re-reads sig-done */
> +				/* cache coherence guarantees CPU re-read. */

That could be a smp_rmc() ? (see other mail)

>  	smp_mb();	/* read sig_done before ending the barrier */
>  }
>  
> @@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
>  	if (!reader_data)
>  		return;
>  	sig_done = 0;
> -	smp_mb();	/* write sig_done before sending the signals */
> +	/* smp_mb();	write sig_done before sending the signals */
> +			/* redundant with barriers in pthread_kill(). */

Absolutely not. pthread_kill does not send a signal to self in every
case because the writer thread has not requirement to register itself.
It *could* be registered as a reader too, but does not have to.

>  	for (index = reader_data; index < reader_data + num_readers; index++)
>  		pthread_kill(index->tid, SIGURCU);
>  	/*
> @@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
>  	 * BUSY-LOOP.
>  	 */
>  	while (sig_done < num_readers)
> -		smp_rmb();	/* ensure we re-read sig-done */
> +		barrier();	/* ensure compiler re-reads sig-done */
> +				/* cache coherence guarantees CPU re-read. */

That could be a smp_rmc() ?

>  	smp_mb();	/* read sig_done before ending the barrier */
>  }
>  #endif
> @@ -181,7 +184,8 @@ void synchronize_rcu(void)
>  	 * the writer waiting forever while new readers are always accessing
>  	 * data (no progress).
>  	 */
> -	smp_mb();
> +	/* smp_mb(); Don't need this one for CPU, only compiler. */
> +	barrier();

smp_mc() ?

>  
>  	switch_next_urcu_qparity();	/* 1 -> 0 */
>  

Side-note :
on archs without cache coherency, all smp_[rw ]mb would turn into a
cache flush.

Mathieu


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 19:29                                                     ` Mathieu Desnoyers
@ 2009-02-12 20:02                                                       ` Paul E. McKenney
  2009-02-12 20:09                                                         ` Mathieu Desnoyers
  2009-02-12 20:13                                                         ` Linus Torvalds
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 20:02 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Bryan Wu, uclinux-dist-devel

On Thu, Feb 12, 2009 at 02:29:41PM -0500, Mathieu Desnoyers wrote:
> 
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> [...]
> > diff --git a/urcu.c b/urcu.c
> > index f2aae34..a696439 100644
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> >  	 * BUSY-LOOP.
> >  	 */
> >  	while (sig_done < 1)
> > -		smp_rmb();	/* ensure we re-read sig-done */
> > +		barrier();	/* ensure compiler re-reads sig-done */
> > +				/* cache coherence guarantees CPU re-read. */
> 
> OK, this is where I think our points of view differ. Please refer to
> http://lkml.org/lkml/2007/6/18/299.
> 
> Basically, cpu_relax() used in the Linux kernel has an
> architecture-specific implementation which *could* include a smp_rmb()
> if the architecture doesn't notice writes done by other CPUs. I think
> Blackfin is the only architecture currently supported by the Linux
> kernel which defines cpu_relax() as a smp_mb(), because it does not have
> cache coherency.
> 
> Therefore, I propose that we create a memory barrier macro which is
> defined as a 
>   barrier()   when the cpu has cache coherency
>   cache flush when the cpu does not have cache coherency and is
>               compiled with smp support.
> 
> We could call that
> 
>   smp_wmc() (for memory-coherency or memory commit)
>   smp_rmc()
>   smp_mc()
> 
> It would be a good way to identify the location where data exchange
> between memory and the local cache are is required in the algorithm.
> What do you think ?

Actually the best way to do this would be:

	while (ACCESS_ONCE(sig_done) < 1)
		continue;

If ACCESS_ONCE() needs to be made architecture-specific to make this
really work on Blackfin, we should make that change.  And, now that
you mention it, I have heard rumors that other CPU families can violate
cache coherence in some circumstances.

So perhaps ACCESS_ONCE() becomes:

#ifdef CONFIG_ARCH_CACHE_COHERENT
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
#else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
#define ACCESS_ONCE(x)     ({ \
				typeof(x) _________x1; \
				_________x1 = (*(volatile typeof(x) *)&(x)); \
				cpu_relax(); \
				(_________x1); \
				})
#endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */

Seem reasonable?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:02                                                       ` Paul E. McKenney
@ 2009-02-12 20:09                                                         ` Mathieu Desnoyers
  2009-02-12 20:35                                                           ` Paul E. McKenney
  2009-02-12 20:13                                                         ` Linus Torvalds
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 20:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Bryan Wu, uclinux-dist-devel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 02:29:41PM -0500, Mathieu Desnoyers wrote:
> > 
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > [...]
> > > diff --git a/urcu.c b/urcu.c
> > > index f2aae34..a696439 100644
> > > --- a/urcu.c
> > > +++ b/urcu.c
> > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > >  	 * BUSY-LOOP.
> > >  	 */
> > >  	while (sig_done < 1)
> > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > +				/* cache coherence guarantees CPU re-read. */
> > 
> > OK, this is where I think our points of view differ. Please refer to
> > http://lkml.org/lkml/2007/6/18/299.
> > 
> > Basically, cpu_relax() used in the Linux kernel has an
> > architecture-specific implementation which *could* include a smp_rmb()
> > if the architecture doesn't notice writes done by other CPUs. I think
> > Blackfin is the only architecture currently supported by the Linux
> > kernel which defines cpu_relax() as a smp_mb(), because it does not have
> > cache coherency.
> > 
> > Therefore, I propose that we create a memory barrier macro which is
> > defined as a 
> >   barrier()   when the cpu has cache coherency
> >   cache flush when the cpu does not have cache coherency and is
> >               compiled with smp support.
> > 
> > We could call that
> > 
> >   smp_wmc() (for memory-coherency or memory commit)
> >   smp_rmc()
> >   smp_mc()
> > 
> > It would be a good way to identify the location where data exchange
> > between memory and the local cache are is required in the algorithm.
> > What do you think ?
> 
> Actually the best way to do this would be:
> 
> 	while (ACCESS_ONCE(sig_done) < 1)
> 		continue;
> 

Interesting idea. Maybe we should define an accessor for the data write
too ?

But I suspect that in a lot of situations, what we will really want is
to do a bunch of read/writes, and only at a particular point do the
cache flush.

> If ACCESS_ONCE() needs to be made architecture-specific to make this
> really work on Blackfin, we should make that change.  And, now that
> you mention it, I have heard rumors that other CPU families can violate
> cache coherence in some circumstances.
> 
> So perhaps ACCESS_ONCE() becomes:
> 
> #ifdef CONFIG_ARCH_CACHE_COHERENT
> #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> #else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> #define ACCESS_ONCE(x)     ({ \
> 				typeof(x) _________x1; \
> 				_________x1 = (*(volatile typeof(x) *)&(x)); \
> 				cpu_relax(); \

I don't think cpu_relax would be the correct primitive to use here. We
definitely don't want a "rep; nop;" or anything like this which _slows
down_ the access. It's just a different goal we are pursuing. So using
something like smp_rmc within the ACCESS_ONCE() macro in this case as I
proposed in the other mail still seems to make sense.

Mathieu

> 				(_________x1); \
> 				})
> #endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
> 
> Seem reasonable?
> 
> 							Thanx, Paul
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:02                                                       ` Paul E. McKenney
  2009-02-12 20:09                                                         ` Mathieu Desnoyers
@ 2009-02-12 20:13                                                         ` Linus Torvalds
  2009-02-12 20:39                                                           ` Paul E. McKenney
  2009-02-14  4:58                                                           ` Robin Getz
  1 sibling, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2009-02-12 20:13 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Bryan Wu, uclinux-dist-devel



On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> 
> Actually the best way to do this would be:
> 
> 	while (ACCESS_ONCE(sig_done) < 1)
> 		continue;
> 
> If ACCESS_ONCE() needs to be made architecture-specific to make this
> really work on Blackfin, we should make that change.

I really wouldn't want to mix up compiler barriers and cache barriers this 
way. 

I think "cpu_relax()" is likely the right thing to piggy-back on for 
broken cache-coherency.

> And, now that you mention it, I have heard rumors that other CPU 
> families can violate cache coherence in some circumstances.

I personally suspect that the BF pseudo-SMP code is just broken, and that 
it likely has tons of subtle bugs and races - because we _do_ depend on 
cache coherency at least for accessing objects next to each other. I just 
never personally felt like I had the energy to care deeply enough.

But I draw the line at making ACCESS_ONCE() imply anything but a compiler 
optimization issue.

		Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 19:38                                                     ` Mathieu Desnoyers
@ 2009-02-12 20:17                                                       ` Paul E. McKenney
  2009-02-12 21:53                                                         ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 20:17 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Thu, Feb 12, 2009 at 02:38:26PM -0500, Mathieu Desnoyers wrote:
> Replying to a separate portion of the mail with less CC :
> 
> 
> > On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > things up.  :-/
> > > > > > 
> > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > out a few days ago.
> > > > > > 
> > > > > 
> > > > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > > > with you ?
> > > > 
> > > > Works for me!
> > > > 
> > > > > > I will play with removing models of mb...
> > > > > 
> > > > > OK, I see you already did..
> > > > 
> > > > I continued this, and surprisingly few are actually required, though
> > > > I don't fully trust the modeling of removed memory barriers.
> > > 
> > > On my side I cleaned up the code a lot, and actually added some barriers
> > > ;) Especially in the busy loops, where we expect the other thread's
> > > value to change eventually between iterations. A smp_rmb() seems more
> > > appropriate that barrier(). I also added a lot of comments about
> > > barriers in the code, and made the reader side much easier to review.
> > > 
> > > Please feel free to comment on my added code comments.
> > 
> > The torture test now looks much more familiar.  ;-)
> > 
> > I fixed some compiler warnings (in my original, sad to say), added an
> > ACCESS_ONCE() to rcu_read_lock() (also in my original),
> 
> Yes, I thought about this ACCESS_ONCE during my sleep.. just did not
> have to to update the source yet. :)
> 
> Merged. Thanks !
> 
> [...]
> 
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> >  	 * BUSY-LOOP.
> >  	 */
> >  	while (sig_done < 1)
> > -		smp_rmb();	/* ensure we re-read sig-done */
> > +		barrier();	/* ensure compiler re-reads sig-done */
> > +				/* cache coherence guarantees CPU re-read. */
> 
> That could be a smp_rmc() ? (see other mail)

I prefer making ACCESS_ONCE() actually having the full semantics implied
by its name.  ;-)

See patch at end of this email.

> >  	smp_mb();	/* read sig_done before ending the barrier */
> >  }
> >  
> > @@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
> >  	if (!reader_data)
> >  		return;
> >  	sig_done = 0;
> > -	smp_mb();	/* write sig_done before sending the signals */
> > +	/* smp_mb();	write sig_done before sending the signals */
> > +			/* redundant with barriers in pthread_kill(). */
> 
> Absolutely not. pthread_kill does not send a signal to self in every
> case because the writer thread has not requirement to register itself.
> It *could* be registered as a reader too, but does not have to.

No, not the barrier in the signal handler, but rather the barriers in
the system call invoked by pthread_kill().

> >  	for (index = reader_data; index < reader_data + num_readers; index++)
> >  		pthread_kill(index->tid, SIGURCU);
> >  	/*
> > @@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
> >  	 * BUSY-LOOP.
> >  	 */
> >  	while (sig_done < num_readers)
> > -		smp_rmb();	/* ensure we re-read sig-done */
> > +		barrier();	/* ensure compiler re-reads sig-done */
> > +				/* cache coherence guarantees CPU re-read. */
> 
> That could be a smp_rmc() ?

Again, prefer:

	while (ACCESS_ONCE() < num_readers)

after upgrading ACCESS_ONCE() to provide the full semantics.

I will send a patch.

> >  	smp_mb();	/* read sig_done before ending the barrier */
> >  }
> >  #endif
> > @@ -181,7 +184,8 @@ void synchronize_rcu(void)
> >  	 * the writer waiting forever while new readers are always accessing
> >  	 * data (no progress).
> >  	 */
> > -	smp_mb();
> > +	/* smp_mb(); Don't need this one for CPU, only compiler. */
> > +	barrier();
> 
> smp_mc() ?

ACCESS_ONCE().

> >  
> >  	switch_next_urcu_qparity();	/* 1 -> 0 */
> >  
> 
> Side-note :
> on archs without cache coherency, all smp_[rw ]mb would turn into a
> cache flush.

So I might need more in my ACCESS_ONCE() below.

Add .gitignore files, and redefine accesses in terms of a new
ACCESS_ONCE().

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 .gitignore              |    9 +++++++++
 formal-model/.gitignore |    3 +++
 urcu.c                  |   10 ++++------
 urcu.h                  |   12 ++++++++++++
 4 files changed, 28 insertions(+), 6 deletions(-)

diff --git a/.gitignore b/.gitignore
new file mode 100644
index 0000000..29aa7e5
--- /dev/null
+++ b/.gitignore
@@ -0,0 +1,9 @@
+test_rwlock_timing
+test_urcu
+test_urcu_timing
+test_urcu_yield
+urcu-asm.o
+urcu.o
+urcutorture
+urcutorture-yield
+urcu-yield.o
diff --git a/formal-model/.gitignore b/formal-model/.gitignore
new file mode 100644
index 0000000..49fdd8a
--- /dev/null
+++ b/formal-model/.gitignore
@@ -0,0 +1,3 @@
+pan
+pan.*
+urcu.spin.trail
diff --git a/urcu.c b/urcu.c
index a696439..f61d4c3 100644
--- a/urcu.c
+++ b/urcu.c
@@ -98,9 +98,8 @@ static void force_mb_single_thread(pthread_t tid)
 	 * Wait for sighandler (and thus mb()) to execute on every thread.
 	 * BUSY-LOOP.
 	 */
-	while (sig_done < 1)
-		barrier();	/* ensure compiler re-reads sig-done */
-				/* cache coherence guarantees CPU re-read. */
+	while (ACCESS_ONCE(sig_done) < 1)
+		continue;
 	smp_mb();	/* read sig_done before ending the barrier */
 }
 
@@ -122,9 +121,8 @@ static void force_mb_all_threads(void)
 	 * Wait for sighandler (and thus mb()) to execute on every thread.
 	 * BUSY-LOOP.
 	 */
-	while (sig_done < num_readers)
-		barrier();	/* ensure compiler re-reads sig-done */
-				/* cache coherence guarantees CPU re-read. */
+	while (ACCESS_ONCE(sig_done) < num_readers)
+		continue;
 	smp_mb();	/* read sig_done before ending the barrier */
 }
 #endif
diff --git a/urcu.h b/urcu.h
index 79d9464..dd040a5 100644
--- a/urcu.h
+++ b/urcu.h
@@ -98,6 +98,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
 /* Nop everywhere except on alpha. */
 #define smp_read_barrier_depends()
 
+#define CONFIG_ARCH_CACHE_COHERENT
+#define cpu_relax barrier
+
 /*
  * Prevent the compiler from merging or refetching accesses.  The compiler
  * is also forbidden from reordering successive instances of ACCESS_ONCE(),
@@ -110,7 +113,16 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
  * use is to mediate communication between process-level code and irq/NMI
  * handlers, all running on the same CPU.
  */
+#ifdef CONFIG_ARCH_CACHE_COHERENT
 #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
+#else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
+#define ACCESS_ONCE(x)     ({ \
+				typeof(x) _________x1; \
+				_________x1 = (*(volatile typeof(x) *)&(x)); \
+				cpu_relax(); \
+				(_________x1); \
+				})
+#endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
 
 /**
  * rcu_dereference - fetch an RCU-protected pointer in an

^ permalink raw reply related	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 18:40                                                       ` Mathieu Desnoyers
@ 2009-02-12 20:28                                                         ` Paul E. McKenney
  2009-02-12 21:27                                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 20:28 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Thu, Feb 12, 2009 at 01:40:30PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > > > 
> > > > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > > > 
> > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > things up.  :-/
> > > > > > > 
> > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > out a few days ago.
> > > > > > > 
> > > > > > > I will play with removing models of mb...
> > > > > > 
> > > > > > And commenting out the models of mb between the counter flips and the
> > > > > > test for readers still passes validation, as expected, and as shown in
> > > > > > the attached Promela code.
> > > > > > 
> > > > > 
> > > > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > > > would not make the algorithm incorrect, but that it would cause
> > > > > situations where it would be impossible for the writer to do any
> > > > > progress as long as there are readers active. I think we would have to
> > > > > enhance the model or at least express this through some LTL statement to
> > > > > validate this specific behavior.
> > > > 
> > > > But if the writer fails to make progress, then the counter remains at a
> > > > given value, which causes readers to drain, which allows the writer to
> > > > eventually make progress again.  Right?
> > > > 
> > > 
> > > Not necessarily. If we don't have the proper memory barriers, we can
> > > have the writer waiting on, say, parity 0 *before* it has performed the
> > > parity switch. Therefore, even newly coming readers will add up to
> > > parity 0.
> > 
> > But the write that changes the parity will eventually make it out.
> > OK, so your argument is that we at least need a compiler barrier?
> 
> It all depends on the assumptions we make. I am currently trying to
> assume the most aggressive memory ordering I can think of. The model I
> think about to represent it is that memory reads/writes are kept local
> to the CPU until a memory barrier is encountered. I doubt it exists in
> practice, bacause the CPU will eventually have to commit the information
> to memory (hrm, are sure about this ?), but if we use that as a starting
> point, I think this would cover the entire spectrum of possible memory
> barriers issues. Also, it would be easy to verify formally. But maybe am
> I going too far ?

I believe that you are going a bit too far.  After all, if you make that
assumption, the CPU could just never make anything visible.  After all,
the memory barrier doesn't say "make the previous stuff visible now",
it instead says "if you make anything after the barrier visible to a
given other CPU, then you must also make everything before the barrier
visible to that CPU".

> > Regardless, please see attached for a modified version of the Promela
> > model that fully models omitting out the memory barrier that my
> > rcu_nest32.[hc] implementation omits.  (It is possible to partially
> > model removal of other memory barriers via #if 0, but to fully model
> > would need to enumerate the permutations as shown on lines 231-257.)
> > 
> > > In your model, this is not detected, because eventually all readers will
> > > execute, and only then the writer will be able to update the data. But
> > > in reality, if we run a very busy 4096-cores machines where there is
> > > always at least one reader active, the the writer will be stuck forever,
> > > and that's really bad.
> > 
> > Assuming that the reordering is done by the CPU, the write will
> > eventually get out -- it is stuck in (say) the store buffer, and the
> > cache line will eventually arrive, and then the value will eventually
> > be seen by the readers.
> 
> Do we have guarantees that the data *will necessarily* get out of the
> cpu write buffer at some point ?

It has to, given a finite CPU write buffer, interrupts, and the like.
The actual CPU designs interact with a cache-coherence protocol, so
the stuff lives in the store buffer only for as long as it takes for
the corresponding cache line to be owned by this CPU.

> > We might need a -compiler- barrier, but then again, I am not sure that
> > we are talking about the same memory barrier -- again, please see
> > attached lines 231-257 to see which one that I eliminated.
> 
> As long as we don't have "progress" validation to check our model, the
> fact that it passes the current test does not tell much.

Without agreeing or disagreeing with this statement for the moment,
would you be willing to tell me whether or not the memory barrier
eliminated by lines 231-257 of the model was the one that you were
talking about?  ;-)

I might consider eventually adding progress validation to the model,
but am currently a bit overdosed on Promela...

> > Also, the original model I sent out has a minor bug that prevents it
> > from fully modeling the nested-read-side case.  The patch below fixes this.
> 
> Ok, merging the fix, thanks,

Thank you!

							Thanx, Paul

> Mathieu
> 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> > 
> >  urcu.spin |    6 +++++-
> >  1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
> > index e5bfff3..611464b 100644
> > --- a/formal-model/urcu.spin
> > +++ b/formal-model/urcu.spin
> > @@ -124,9 +124,13 @@ proctype urcu_reader()
> >  				break;
> >  			:: tmp < 4 && reader_progress[tmp] != 0 ->
> >  				tmp = tmp + 1;
> > -			:: tmp >= 4 ->
> > +			:: tmp >= 4 &&
> > +			   reader_progress[0] == reader_progress[3] ->
> >  				done = 1;
> >  				break;
> > +			:: tmp >= 4 &&
> > +			   reader_progress[0] != reader_progress[3] ->
> > +			   	break;
> >  			od;
> >  			do
> >  			:: tmp < 4 && reader_progress[tmp] == 0 ->
> 
> Content-Description: urcu_mbmin.spin
> > /*
> >  * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
> >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> >  *      git archive at git://lttng.org/userspace-rcu.git, but with
> >  *	memory barriers removed.
> >  *
> >  * This program is free software; you can redistribute it and/or modify
> >  * it under the terms of the GNU General Public License as published by
> >  * the Free Software Foundation; either version 2 of the License, or
> >  * (at your option) any later version.
> >  *
> >  * This program is distributed in the hope that it will be useful,
> >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> >  * GNU General Public License for more details.
> >  *
> >  * You should have received a copy of the GNU General Public License
> >  * along with this program; if not, write to the Free Software
> >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> >  *
> >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> >  */
> > 
> > /* Promela validation variables. */
> > 
> > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > byte reader_progress[4];
> > 		  /* Count of read-side statement executions. */
> > 
> > /* urcu definitions and variables, taken straight from the algorithm. */
> > 
> > #define RCU_GP_CTR_BIT (1 << 7)
> > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > 
> > byte urcu_gp_ctr = 1;
> > byte urcu_active_readers = 0;
> > 
> > /* Model the RCU read-side critical section. */
> > 
> > proctype urcu_reader()
> > {
> > 	bit done = 0;
> > 	bit mbok;
> > 	byte tmp;
> > 	byte tmp_removed;
> > 	byte tmp_free;
> > 
> > 	/* Absorb any early requests for memory barriers. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > 
> > 	/*
> > 	 * Each pass through this loop executes one read-side statement
> > 	 * from the following code fragment:
> > 	 *
> > 	 *	rcu_read_lock(); [0a]
> > 	 *	rcu_read_lock(); [0b]
> > 	 *	p = rcu_dereference(global_p); [1]
> > 	 *	x = p->data; [2]
> > 	 *	rcu_read_unlock(); [3b]
> > 	 *	rcu_read_unlock(); [3a]
> > 	 *
> > 	 * Because we are modeling a weak-memory machine, these statements
> > 	 * can be seen in any order, the only restriction being that
> > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > 	 * is non-deterministic, the above is but one possible placement.
> > 	 * Intestingly enough, this model validates all possible placements
> > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > 	 * with the only constraint being that the rcu_read_lock() must
> > 	 * precede the rcu_read_unlock().
> > 	 *
> > 	 * We also respond to memory-barrier requests, but only if our
> > 	 * execution happens to be ordered.  If the current state is
> > 	 * misordered, we ignore memory-barrier requests.
> > 	 */
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > 			tmp = urcu_active_readers;
> > 			if
> > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > 				tmp = urcu_gp_ctr;
> > 				do
> > 				:: (reader_progress[1] +
> > 				    reader_progress[2] +
> > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > 					need_mb = 0;
> > 				:: 1 -> skip;
> > 				:: 1 -> break;
> > 				od;
> > 				urcu_active_readers = tmp;
> > 			 :: else ->
> > 				urcu_active_readers = tmp + 1;
> > 			fi;
> > 			reader_progress[0] = reader_progress[0] + 1;
> > 		:: reader_progress[1] == 0 -> /* [1] */
> > 			tmp_removed = removed;
> > 			reader_progress[1] = 1;
> > 		:: reader_progress[2] == 0 -> /* [2] */
> > 			tmp_free = free;
> > 			reader_progress[2] = 1;
> > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > 			tmp = urcu_active_readers - 1;
> > 			urcu_active_readers = tmp;
> > 			reader_progress[3] = reader_progress[3] + 1;
> > 		:: else -> break;
> > 		fi;
> > 
> > 		/* Process memory-barrier requests, if it is safe to do so. */
> > 		atomic {
> > 			mbok = 0;
> > 			tmp = 0;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 				break;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp >= 4 &&
> > 			   reader_progress[0] == reader_progress[3] ->
> > 				done = 1;
> > 				break;
> > 			:: tmp >= 4 &&
> > 			   reader_progress[0] != reader_progress[3] ->
> > 			   	break;
> > 			od;
> > 			do
> > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 				tmp = tmp + 1;
> > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > 				break;
> > 			:: tmp >= 4 ->
> > 				mbok = 1;
> > 				break;
> > 			od
> > 
> > 		}
> > 
> > 		if
> > 		:: mbok == 1 ->
> > 			/* We get here if mb processing is safe. */
> > 			do
> > 			:: need_mb == 1 ->
> > 				need_mb = 0;
> > 			:: 1 -> skip;
> > 			:: 1 -> break;
> > 			od;
> > 		:: else -> skip;
> > 		fi;
> > 
> > 		/*
> > 		 * Check to see if we have modeled the entire RCU read-side
> > 		 * critical section, and leave if so.
> > 		 */
> > 		if
> > 		:: done == 1 -> break;
> > 		:: else -> skip;
> > 		fi
> > 	od;
> > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > 
> > 	/* Process any late-arriving memory-barrier requests. */
> > 	do
> > 	:: need_mb == 1 ->
> > 		need_mb = 0;
> > 	:: 1 -> skip;
> > 	:: 1 -> break;
> > 	od;
> > }
> > 
> > /* Model the RCU update process. */
> > 
> > proctype urcu_updater()
> > {
> > 	byte tmp;
> > 
> > 	/* prior synchronize_rcu(), second counter flip. */
> > 	need_mb = 1; /* mb() A */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	need_mb = 1; /* mb() B */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi
> > 	od;
> > 	need_mb = 1; /* mb() C absolutely required by analogy with G */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* Removal statement, e.g., list_del_rcu(). */
> > 	removed = 1;
> > 
> > 	/* current synchronize_rcu(), first counter flip. */
> > 	need_mb = 1; /* mb() D suggested */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	need_mb = 1;  /* mb() E required if D not present */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* current synchronize_rcu(), first-flip check plus second flip. */
> > 	if
> > 	:: 1 ->
> > 		do
> > 		:: 1 ->
> > 			if
> > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 				skip;
> > 			:: else -> break;
> > 			fi;
> > 		od;
> > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 	:: 1 ->
> > 		tmp = urcu_gp_ctr;
> > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > 		do
> > 		:: 1 ->
> > 			if
> > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
> > 				skip;
> > 			:: else -> break;
> > 			fi;
> > 		od;
> > 	fi;
> > 
> > 	/* current synchronize_rcu(), second counter flip check. */
> > 	need_mb = 1; /* mb() F not required */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 	do
> > 	:: 1 ->
> > 		if
> > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > 			skip;
> > 		:: else -> break;
> > 		fi;
> > 	od;
> > 	need_mb = 1; /* mb() G absolutely required */
> > 	do
> > 	:: need_mb == 1 -> skip;
> > 	:: need_mb == 0 -> break;
> > 	od;
> > 
> > 	/* free-up step, e.g., kfree(). */
> > 	free = 1;
> > }
> > 
> > /*
> >  * Initialize the array, spawn a reader and an updater.  Because readers
> >  * are independent of each other, only one reader is needed.
> >  */
> > 
> > init {
> > 	atomic {
> > 		reader_progress[0] = 0;
> > 		reader_progress[1] = 0;
> > 		reader_progress[2] = 0;
> > 		reader_progress[3] = 0;
> > 		run urcu_reader();
> > 		run urcu_updater();
> > 	}
> > }
> 
> 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:09                                                         ` Mathieu Desnoyers
@ 2009-02-12 20:35                                                           ` Paul E. McKenney
  2009-02-12 21:15                                                             ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 20:35 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Bryan Wu, uclinux-dist-devel

On Thu, Feb 12, 2009 at 03:09:37PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Feb 12, 2009 at 02:29:41PM -0500, Mathieu Desnoyers wrote:
> > > 
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > [...]
> > > > diff --git a/urcu.c b/urcu.c
> > > > index f2aae34..a696439 100644
> > > > --- a/urcu.c
> > > > +++ b/urcu.c
> > > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > > >  	 * BUSY-LOOP.
> > > >  	 */
> > > >  	while (sig_done < 1)
> > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > +				/* cache coherence guarantees CPU re-read. */
> > > 
> > > OK, this is where I think our points of view differ. Please refer to
> > > http://lkml.org/lkml/2007/6/18/299.
> > > 
> > > Basically, cpu_relax() used in the Linux kernel has an
> > > architecture-specific implementation which *could* include a smp_rmb()
> > > if the architecture doesn't notice writes done by other CPUs. I think
> > > Blackfin is the only architecture currently supported by the Linux
> > > kernel which defines cpu_relax() as a smp_mb(), because it does not have
> > > cache coherency.
> > > 
> > > Therefore, I propose that we create a memory barrier macro which is
> > > defined as a 
> > >   barrier()   when the cpu has cache coherency
> > >   cache flush when the cpu does not have cache coherency and is
> > >               compiled with smp support.
> > > 
> > > We could call that
> > > 
> > >   smp_wmc() (for memory-coherency or memory commit)
> > >   smp_rmc()
> > >   smp_mc()
> > > 
> > > It would be a good way to identify the location where data exchange
> > > between memory and the local cache are is required in the algorithm.
> > > What do you think ?
> > 
> > Actually the best way to do this would be:
> > 
> > 	while (ACCESS_ONCE(sig_done) < 1)
> > 		continue;
> > 
> 
> Interesting idea. Maybe we should define an accessor for the data write
> too ?

I like having just ACCESS_ONCE(), but I suppose I could live with
separate LOAD_ONCE() and STORE_ONCE() primitives.

But I am not yet convinced that this is needed, as I am not aware of any
architecture that would buffer writes forever.  (Doesn't mean that there
isn't one, but it does not make sense to complicate the API just on
speculation.)

> But I suspect that in a lot of situations, what we will really want is
> to do a bunch of read/writes, and only at a particular point do the
> cache flush.

That could happen, and in fact is why the separate
smp_read_barrier_depends() primitive exists.  But I -strongly-
discourage its use -- code using rcu_dereference() is -much- easier to
read and understand.  So if the series of reads/writes was short, I
would say to just bite the bullet and take the multiple primitives.

If nothing else, this might encourage hardware manufacturers to do the
right thing.  ;-)

> > If ACCESS_ONCE() needs to be made architecture-specific to make this
> > really work on Blackfin, we should make that change.  And, now that
> > you mention it, I have heard rumors that other CPU families can violate
> > cache coherence in some circumstances.
> > 
> > So perhaps ACCESS_ONCE() becomes:
> > 
> > #ifdef CONFIG_ARCH_CACHE_COHERENT
> > #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> > #else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > #define ACCESS_ONCE(x)     ({ \
> > 				typeof(x) _________x1; \
> > 				_________x1 = (*(volatile typeof(x) *)&(x)); \
> > 				cpu_relax(); \
> 
> I don't think cpu_relax would be the correct primitive to use here. We
> definitely don't want a "rep; nop;" or anything like this which _slows
> down_ the access. It's just a different goal we are pursuing. So using
> something like smp_rmc within the ACCESS_ONCE() macro in this case as I
> proposed in the other mail still seems to make sense.

Well, x86 would have CONFIG_ARCH_CACHE_COHERENT, so it would instead
use the old definition -- so no "rep; nop;" in any case.

Probably whatever takes the place of cpu_relax() is arch-dependent
anyway.

							Thanx, Paul

> Mathieu
> 
> > 				(_________x1); \
> > 				})
> > #endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > 
> > Seem reasonable?
> > 
> > 							Thanx, Paul
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:13                                                         ` Linus Torvalds
@ 2009-02-12 20:39                                                           ` Paul E. McKenney
  2009-02-12 21:15                                                             ` Linus Torvalds
  2009-02-14  4:58                                                           ` Robin Getz
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 20:39 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Bryan Wu, uclinux-dist-devel

On Thu, Feb 12, 2009 at 12:13:29PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > 
> > Actually the best way to do this would be:
> > 
> > 	while (ACCESS_ONCE(sig_done) < 1)
> > 		continue;
> > 
> > If ACCESS_ONCE() needs to be made architecture-specific to make this
> > really work on Blackfin, we should make that change.
> 
> I really wouldn't want to mix up compiler barriers and cache barriers this 
> way. 
> 
> I think "cpu_relax()" is likely the right thing to piggy-back on for 
> broken cache-coherency.
> 
> > And, now that you mention it, I have heard rumors that other CPU 
> > families can violate cache coherence in some circumstances.
> 
> I personally suspect that the BF pseudo-SMP code is just broken, and that 
> it likely has tons of subtle bugs and races - because we _do_ depend on 
> cache coherency at least for accessing objects next to each other. I just 
> never personally felt like I had the energy to care deeply enough.
> 
> But I draw the line at making ACCESS_ONCE() imply anything but a compiler 
> optimization issue.

In other words, you are arguing for using ACCESS_ONCE() in the loops,
but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
broken?

I am OK with that, just wanting to make sure I understand what you are
asking us to do.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:39                                                           ` Paul E. McKenney
@ 2009-02-12 21:15                                                             ` Linus Torvalds
  2009-02-12 21:59                                                               ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2009-02-12 21:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Bryan Wu, uclinux-dist-devel



On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> 
> In other words, you are arguing for using ACCESS_ONCE() in the loops,
> but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> broken?

Well, I _also_ argue that if you have a busy loop, you'd better have a 
cpu_relax() in there somewhere anyway. If you don't, you have a bug.

So I think the BF approach is "borderline broken", but I think it should 
work, if BF just has whatever appropriate cache flush in its cpu_relax.

			Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:35                                                           ` Paul E. McKenney
@ 2009-02-12 21:15                                                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 21:15 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: ltt-dev, linux-kernel, Linus Torvalds, Bryan Wu, uclinux-dist-devel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 03:09:37PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Thu, Feb 12, 2009 at 02:29:41PM -0500, Mathieu Desnoyers wrote:
> > > > 
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > [...]
> > > > > diff --git a/urcu.c b/urcu.c
> > > > > index f2aae34..a696439 100644
> > > > > --- a/urcu.c
> > > > > +++ b/urcu.c
> > > > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > > > >  	 * BUSY-LOOP.
> > > > >  	 */
> > > > >  	while (sig_done < 1)
> > > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > > +				/* cache coherence guarantees CPU re-read. */
> > > > 
> > > > OK, this is where I think our points of view differ. Please refer to
> > > > http://lkml.org/lkml/2007/6/18/299.
> > > > 
> > > > Basically, cpu_relax() used in the Linux kernel has an
> > > > architecture-specific implementation which *could* include a smp_rmb()
> > > > if the architecture doesn't notice writes done by other CPUs. I think
> > > > Blackfin is the only architecture currently supported by the Linux
> > > > kernel which defines cpu_relax() as a smp_mb(), because it does not have
> > > > cache coherency.
> > > > 
> > > > Therefore, I propose that we create a memory barrier macro which is
> > > > defined as a 
> > > >   barrier()   when the cpu has cache coherency
> > > >   cache flush when the cpu does not have cache coherency and is
> > > >               compiled with smp support.
> > > > 
> > > > We could call that
> > > > 
> > > >   smp_wmc() (for memory-coherency or memory commit)
> > > >   smp_rmc()
> > > >   smp_mc()
> > > > 
> > > > It would be a good way to identify the location where data exchange
> > > > between memory and the local cache are is required in the algorithm.
> > > > What do you think ?
> > > 
> > > Actually the best way to do this would be:
> > > 
> > > 	while (ACCESS_ONCE(sig_done) < 1)
> > > 		continue;
> > > 
> > 
> > Interesting idea. Maybe we should define an accessor for the data write
> > too ?
> 
> I like having just ACCESS_ONCE(), but I suppose I could live with
> separate LOAD_ONCE() and STORE_ONCE() primitives.
> 
> But I am not yet convinced that this is needed, as I am not aware of any
> architecture that would buffer writes forever.  (Doesn't mean that there
> isn't one, but it does not make sense to complicate the API just on
> speculation.)
> 

Blackfin has a non-coherent cache, which is the equivalent of buffering
writes forever and never reader data from memory unless it's required
to. Given the increasing amount of cores we currently see and the always
lower power consumption, I foresee that we might have to deal with such
model in a future closer than what we expect.

> > But I suspect that in a lot of situations, what we will really want is
> > to do a bunch of read/writes, and only at a particular point do the
> > cache flush.
> 
> That could happen, and in fact is why the separate
> smp_read_barrier_depends() primitive exists.  But I -strongly-
> discourage its use -- code using rcu_dereference() is -much- easier to
> read and understand.  So if the series of reads/writes was short, I
> would say to just bite the bullet and take the multiple primitives.
> 

Hrm, how about :

LOAD_REMOTE(), STORE_REMOTE()

So this leaves ACCESS_ONCE() as a simple compiler optimization
restriction, which is the meaning it always had.



#ifdef CONFIG_HAVE_MEM_COHERENCY
/*
 * Caches are coherent, no need to flush them.
 */
#define mc()	barrier()
#define rmc()	barrier()
#define wmc()	barrier()
#else
#error "The architecture must create its own cache flush primitives"
#define mc()	arch_cache_flush()
#define rmc()	arch_cache_flush_read()
#define wmc()	arch_cache_flush_write()
#endif


#ifdef CONFIG_HAVE_MEM_COHERENCY

/* x86 32/64 specific */
#ifdef CONFIG_HAVE_FENCE
#define mb()    asm volatile("mfence":::"memory")
#define rmb()   asm volatile("lfence":::"memory")
#define wmb()   asm volatile("sfence"::: "memory")
#else
/*
 * Some non-Intel clones support out of order store. wmb() ceases to be a
 * nop for these.
 */
#define mb()    asm volatile("lock; addl $0,0(%%esp)":::"memory")
#define rmb()   asm volatile("lock; addl $0,0(%%esp)":::"memory")
#define wmb()   asm volatile("lock; addl $0,0(%%esp)"::: "memory")
#endif

#else /* !CONFIG_HAVE_MEM_COHERENCY */

/*
 * Without cache coherency, the memory barriers become cache flushes.
 */
#define mb()    mc()
#define rmb()   rmc()
#define wmb()   wmc()

#endif /* !CONFIG_HAVE_MEM_COHERENCY */


#ifdef CONFIG_SMP
#define smp_mb()	mb()
#define smp_rmb()	rmb()
#define smp_wmb()	wmb()
#define smp_mc()	mc()
#define smp_rmc()	rmc()
#define smp_wmc()	wmc()
#else
#define smp_mb()	barrier()
#define smp_rmb()	barrier()
#define smp_wmb()	barrier()
#define smp_mc()	barrier()
#define smp_rmc()	barrier()
#define smp_wmc()	barrier()
#endif


#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))

/*
 * Load a data from remote memory, doing a cache flush if required.
 */
#define LOAD_REMOTE(p)	       ({ \
				smp_rmc(); \
				typeof(p) _________p1 = ACCESS_ONCE(p); \
				(_________p1); \
				})

/*
 * Store v into x, where x is located in remote memory. Performs the required
 * cache flush after writing.
 */
#define STORE_REMOTE(x, v) \
	do { \
		(x) = (v); \
		smp_wmc; \
	} while (0)

/**
 * rcu_dereference - fetch an RCU-protected pointer in an
 * RCU read-side critical section.  This pointer may later
 * be safely dereferenced.
 *
 * Inserts memory barriers on architectures that require them
 * (currently only the Alpha), and, more importantly, documents
 * exactly which pointers are protected by RCU.
 */

#define rcu_dereference(p)     ({ \
				typeof(p) _________p1 = LOAD_REMOTE(p); \
				smp_read_barrier_depends(); \
				(_________p1); \
				})


So we would use "LOAD_REMOTE" instead of ACCESS_ONCE when we want to
read the remote pointer atomically. STORE_REMOTE would be used to write
to it.

Reading can be done in batch with :

smp_rmc();
... read many values ...


Writing :
... write many values ...
smp_wmc();


> If nothing else, this might encourage hardware manufacturers to do the
> right thing.  ;-)
> 

The "right thing" has a cost and, sadly, the tradeoff is different in
multi-cpu low-power architectures.

> > > If ACCESS_ONCE() needs to be made architecture-specific to make this
> > > really work on Blackfin, we should make that change.  And, now that
> > > you mention it, I have heard rumors that other CPU families can violate
> > > cache coherence in some circumstances.
> > > 
> > > So perhaps ACCESS_ONCE() becomes:
> > > 
> > > #ifdef CONFIG_ARCH_CACHE_COHERENT
> > > #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> > > #else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > > #define ACCESS_ONCE(x)     ({ \
> > > 				typeof(x) _________x1; \
> > > 				_________x1 = (*(volatile typeof(x) *)&(x)); \
> > > 				cpu_relax(); \
> > 

Hrm, and also we would need a smp_rmc() *before* reading x here... so
the ACCESS_ONCE implementation above only works in loops, which is bad.

> > I don't think cpu_relax would be the correct primitive to use here. We
> > definitely don't want a "rep; nop;" or anything like this which _slows
> > down_ the access. It's just a different goal we are pursuing. So using
> > something like smp_rmc within the ACCESS_ONCE() macro in this case as I
> > proposed in the other mail still seems to make sense.
> 
> Well, x86 would have CONFIG_ARCH_CACHE_COHERENT, so it would instead
> use the old definition -- so no "rep; nop;" in any case.

Which has a performance impact, and is very wrong. We are not only using
ACCESS_ONCE in busy-loops....

Mathieu

> 
> Probably whatever takes the place of cpu_relax() is arch-dependent
> anyway.
> 
> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > 				(_________x1); \
> > > 				})
> > > #endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > > 
> > > Seem reasonable?
> > > 
> > > 							Thanx, Paul
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:28                                                         ` Paul E. McKenney
@ 2009-02-12 21:27                                                           ` Mathieu Desnoyers
  2009-02-12 23:26                                                             ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 21:27 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 01:40:30PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > > > > 
> > > > > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > > > > 
> > > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > > things up.  :-/
> > > > > > > > 
> > > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > > out a few days ago.
> > > > > > > > 
> > > > > > > > I will play with removing models of mb...
> > > > > > > 
> > > > > > > And commenting out the models of mb between the counter flips and the
> > > > > > > test for readers still passes validation, as expected, and as shown in
> > > > > > > the attached Promela code.
> > > > > > > 
> > > > > > 
> > > > > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > > > > would not make the algorithm incorrect, but that it would cause
> > > > > > situations where it would be impossible for the writer to do any
> > > > > > progress as long as there are readers active. I think we would have to
> > > > > > enhance the model or at least express this through some LTL statement to
> > > > > > validate this specific behavior.
> > > > > 
> > > > > But if the writer fails to make progress, then the counter remains at a
> > > > > given value, which causes readers to drain, which allows the writer to
> > > > > eventually make progress again.  Right?
> > > > > 
> > > > 
> > > > Not necessarily. If we don't have the proper memory barriers, we can
> > > > have the writer waiting on, say, parity 0 *before* it has performed the
> > > > parity switch. Therefore, even newly coming readers will add up to
> > > > parity 0.
> > > 
> > > But the write that changes the parity will eventually make it out.
> > > OK, so your argument is that we at least need a compiler barrier?
> > 
> > It all depends on the assumptions we make. I am currently trying to
> > assume the most aggressive memory ordering I can think of. The model I
> > think about to represent it is that memory reads/writes are kept local
> > to the CPU until a memory barrier is encountered. I doubt it exists in
> > practice, bacause the CPU will eventually have to commit the information
> > to memory (hrm, are sure about this ?), but if we use that as a starting
> > point, I think this would cover the entire spectrum of possible memory
> > barriers issues. Also, it would be easy to verify formally. But maybe am
> > I going too far ?
> 
> I believe that you are going a bit too far.  After all, if you make that
> assumption, the CPU could just never make anything visible.  After all,
> the memory barrier doesn't say "make the previous stuff visible now",
> it instead says "if you make anything after the barrier visible to a
> given other CPU, then you must also make everything before the barrier
> visible to that CPU".
> 
> > > Regardless, please see attached for a modified version of the Promela
> > > model that fully models omitting out the memory barrier that my
> > > rcu_nest32.[hc] implementation omits.  (It is possible to partially
> > > model removal of other memory barriers via #if 0, but to fully model
> > > would need to enumerate the permutations as shown on lines 231-257.)
> > > 
> > > > In your model, this is not detected, because eventually all readers will
> > > > execute, and only then the writer will be able to update the data. But
> > > > in reality, if we run a very busy 4096-cores machines where there is
> > > > always at least one reader active, the the writer will be stuck forever,
> > > > and that's really bad.
> > > 
> > > Assuming that the reordering is done by the CPU, the write will
> > > eventually get out -- it is stuck in (say) the store buffer, and the
> > > cache line will eventually arrive, and then the value will eventually
> > > be seen by the readers.
> > 
> > Do we have guarantees that the data *will necessarily* get out of the
> > cpu write buffer at some point ?
> 
> It has to, given a finite CPU write buffer, interrupts, and the like.
> The actual CPU designs interact with a cache-coherence protocol, so
> the stuff lives in the store buffer only for as long as it takes for
> the corresponding cache line to be owned by this CPU.
> 
> > > We might need a -compiler- barrier, but then again, I am not sure that
> > > we are talking about the same memory barrier -- again, please see
> > > attached lines 231-257 to see which one that I eliminated.
> > 
> > As long as we don't have "progress" validation to check our model, the
> > fact that it passes the current test does not tell much.
> 
> Without agreeing or disagreeing with this statement for the moment,
> would you be willing to tell me whether or not the memory barrier
> eliminated by lines 231-257 of the model was the one that you were
> talking about?  ;-)
> 

So we are taking about :

/* current synchronize_rcu(), first-flip check plus second flip. */

which does not have any memory barrier anymore. This corresponds to my
current :

       /*
         * Wait for previous parity to be empty of readers.
         */
        wait_for_quiescent_state();     /* Wait readers in parity 0 */

        /*
         * Must finish waiting for quiescent state for parity 0 before
         * committing qparity update to memory. Failure to do so could result in
         * the writer waiting forever while new readers are always accessing
         * data (no progress).
         */
        smp_mc();

        switch_next_urcu_qparity();     /* 1 -> 0 */

So the memory barrier is not needed, but a compiler barrier is needed on
arch with cache coherency, and a cache flush is needed on architectures
without cache coherency.

BTW, I think all the three smp_mb() that were in this function can be
turned into smp_mc().

Therefore, if we assume memory coherency, only barrier()s would be
needed between the switch/q.s. wait/switch/q.s. wait.

Mathieu


> I might consider eventually adding progress validation to the model,
> but am currently a bit overdosed on Promela...
> 
> > > Also, the original model I sent out has a minor bug that prevents it
> > > from fully modeling the nested-read-side case.  The patch below fixes this.
> > 
> > Ok, merging the fix, thanks,
> 
> Thank you!
> 
> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > ---
> > > 
> > >  urcu.spin |    6 +++++-
> > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
> > > index e5bfff3..611464b 100644
> > > --- a/formal-model/urcu.spin
> > > +++ b/formal-model/urcu.spin
> > > @@ -124,9 +124,13 @@ proctype urcu_reader()
> > >  				break;
> > >  			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > >  				tmp = tmp + 1;
> > > -			:: tmp >= 4 ->
> > > +			:: tmp >= 4 &&
> > > +			   reader_progress[0] == reader_progress[3] ->
> > >  				done = 1;
> > >  				break;
> > > +			:: tmp >= 4 &&
> > > +			   reader_progress[0] != reader_progress[3] ->
> > > +			   	break;
> > >  			od;
> > >  			do
> > >  			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > 
> > Content-Description: urcu_mbmin.spin
> > > /*
> > >  * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
> > >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> > >  *      git archive at git://lttng.org/userspace-rcu.git, but with
> > >  *	memory barriers removed.
> > >  *
> > >  * This program is free software; you can redistribute it and/or modify
> > >  * it under the terms of the GNU General Public License as published by
> > >  * the Free Software Foundation; either version 2 of the License, or
> > >  * (at your option) any later version.
> > >  *
> > >  * This program is distributed in the hope that it will be useful,
> > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > >  * GNU General Public License for more details.
> > >  *
> > >  * You should have received a copy of the GNU General Public License
> > >  * along with this program; if not, write to the Free Software
> > >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > >  *
> > >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> > >  */
> > > 
> > > /* Promela validation variables. */
> > > 
> > > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > > byte reader_progress[4];
> > > 		  /* Count of read-side statement executions. */
> > > 
> > > /* urcu definitions and variables, taken straight from the algorithm. */
> > > 
> > > #define RCU_GP_CTR_BIT (1 << 7)
> > > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > > 
> > > byte urcu_gp_ctr = 1;
> > > byte urcu_active_readers = 0;
> > > 
> > > /* Model the RCU read-side critical section. */
> > > 
> > > proctype urcu_reader()
> > > {
> > > 	bit done = 0;
> > > 	bit mbok;
> > > 	byte tmp;
> > > 	byte tmp_removed;
> > > 	byte tmp_free;
> > > 
> > > 	/* Absorb any early requests for memory barriers. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > 
> > > 	/*
> > > 	 * Each pass through this loop executes one read-side statement
> > > 	 * from the following code fragment:
> > > 	 *
> > > 	 *	rcu_read_lock(); [0a]
> > > 	 *	rcu_read_lock(); [0b]
> > > 	 *	p = rcu_dereference(global_p); [1]
> > > 	 *	x = p->data; [2]
> > > 	 *	rcu_read_unlock(); [3b]
> > > 	 *	rcu_read_unlock(); [3a]
> > > 	 *
> > > 	 * Because we are modeling a weak-memory machine, these statements
> > > 	 * can be seen in any order, the only restriction being that
> > > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > > 	 * is non-deterministic, the above is but one possible placement.
> > > 	 * Intestingly enough, this model validates all possible placements
> > > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > > 	 * with the only constraint being that the rcu_read_lock() must
> > > 	 * precede the rcu_read_unlock().
> > > 	 *
> > > 	 * We also respond to memory-barrier requests, but only if our
> > > 	 * execution happens to be ordered.  If the current state is
> > > 	 * misordered, we ignore memory-barrier requests.
> > > 	 */
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > > 			tmp = urcu_active_readers;
> > > 			if
> > > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > > 				tmp = urcu_gp_ctr;
> > > 				do
> > > 				:: (reader_progress[1] +
> > > 				    reader_progress[2] +
> > > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > > 					need_mb = 0;
> > > 				:: 1 -> skip;
> > > 				:: 1 -> break;
> > > 				od;
> > > 				urcu_active_readers = tmp;
> > > 			 :: else ->
> > > 				urcu_active_readers = tmp + 1;
> > > 			fi;
> > > 			reader_progress[0] = reader_progress[0] + 1;
> > > 		:: reader_progress[1] == 0 -> /* [1] */
> > > 			tmp_removed = removed;
> > > 			reader_progress[1] = 1;
> > > 		:: reader_progress[2] == 0 -> /* [2] */
> > > 			tmp_free = free;
> > > 			reader_progress[2] = 1;
> > > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > > 			tmp = urcu_active_readers - 1;
> > > 			urcu_active_readers = tmp;
> > > 			reader_progress[3] = reader_progress[3] + 1;
> > > 		:: else -> break;
> > > 		fi;
> > > 
> > > 		/* Process memory-barrier requests, if it is safe to do so. */
> > > 		atomic {
> > > 			mbok = 0;
> > > 			tmp = 0;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 				break;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp >= 4 &&
> > > 			   reader_progress[0] == reader_progress[3] ->
> > > 				done = 1;
> > > 				break;
> > > 			:: tmp >= 4 &&
> > > 			   reader_progress[0] != reader_progress[3] ->
> > > 			   	break;
> > > 			od;
> > > 			do
> > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 				tmp = tmp + 1;
> > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > 				break;
> > > 			:: tmp >= 4 ->
> > > 				mbok = 1;
> > > 				break;
> > > 			od
> > > 
> > > 		}
> > > 
> > > 		if
> > > 		:: mbok == 1 ->
> > > 			/* We get here if mb processing is safe. */
> > > 			do
> > > 			:: need_mb == 1 ->
> > > 				need_mb = 0;
> > > 			:: 1 -> skip;
> > > 			:: 1 -> break;
> > > 			od;
> > > 		:: else -> skip;
> > > 		fi;
> > > 
> > > 		/*
> > > 		 * Check to see if we have modeled the entire RCU read-side
> > > 		 * critical section, and leave if so.
> > > 		 */
> > > 		if
> > > 		:: done == 1 -> break;
> > > 		:: else -> skip;
> > > 		fi
> > > 	od;
> > > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > > 
> > > 	/* Process any late-arriving memory-barrier requests. */
> > > 	do
> > > 	:: need_mb == 1 ->
> > > 		need_mb = 0;
> > > 	:: 1 -> skip;
> > > 	:: 1 -> break;
> > > 	od;
> > > }
> > > 
> > > /* Model the RCU update process. */
> > > 
> > > proctype urcu_updater()
> > > {
> > > 	byte tmp;
> > > 
> > > 	/* prior synchronize_rcu(), second counter flip. */
> > > 	need_mb = 1; /* mb() A */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	need_mb = 1; /* mb() B */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi
> > > 	od;
> > > 	need_mb = 1; /* mb() C absolutely required by analogy with G */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* Removal statement, e.g., list_del_rcu(). */
> > > 	removed = 1;
> > > 
> > > 	/* current synchronize_rcu(), first counter flip. */
> > > 	need_mb = 1; /* mb() D suggested */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	need_mb = 1;  /* mb() E required if D not present */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* current synchronize_rcu(), first-flip check plus second flip. */
> > > 	if
> > > 	:: 1 ->
> > > 		do
> > > 		:: 1 ->
> > > 			if
> > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 				skip;
> > > 			:: else -> break;
> > > 			fi;
> > > 		od;
> > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 	:: 1 ->
> > > 		tmp = urcu_gp_ctr;
> > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > 		do
> > > 		:: 1 ->
> > > 			if
> > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
> > > 				skip;
> > > 			:: else -> break;
> > > 			fi;
> > > 		od;
> > > 	fi;
> > > 
> > > 	/* current synchronize_rcu(), second counter flip check. */
> > > 	need_mb = 1; /* mb() F not required */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 	do
> > > 	:: 1 ->
> > > 		if
> > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > 			skip;
> > > 		:: else -> break;
> > > 		fi;
> > > 	od;
> > > 	need_mb = 1; /* mb() G absolutely required */
> > > 	do
> > > 	:: need_mb == 1 -> skip;
> > > 	:: need_mb == 0 -> break;
> > > 	od;
> > > 
> > > 	/* free-up step, e.g., kfree(). */
> > > 	free = 1;
> > > }
> > > 
> > > /*
> > >  * Initialize the array, spawn a reader and an updater.  Because readers
> > >  * are independent of each other, only one reader is needed.
> > >  */
> > > 
> > > init {
> > > 	atomic {
> > > 		reader_progress[0] = 0;
> > > 		reader_progress[1] = 0;
> > > 		reader_progress[2] = 0;
> > > 		reader_progress[3] = 0;
> > > 		run urcu_reader();
> > > 		run urcu_updater();
> > > 	}
> > > }
> > 
> > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:17                                                       ` Paul E. McKenney
@ 2009-02-12 21:53                                                         ` Mathieu Desnoyers
  2009-02-12 23:04                                                           ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-12 21:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 02:38:26PM -0500, Mathieu Desnoyers wrote:
> > Replying to a separate portion of the mail with less CC :
> > 
> > 
> > > On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > things up.  :-/
> > > > > > > 
> > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > out a few days ago.
> > > > > > > 
> > > > > > 
> > > > > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > > > > with you ?
> > > > > 
> > > > > Works for me!
> > > > > 
> > > > > > > I will play with removing models of mb...
> > > > > > 
> > > > > > OK, I see you already did..
> > > > > 
> > > > > I continued this, and surprisingly few are actually required, though
> > > > > I don't fully trust the modeling of removed memory barriers.
> > > > 
> > > > On my side I cleaned up the code a lot, and actually added some barriers
> > > > ;) Especially in the busy loops, where we expect the other thread's
> > > > value to change eventually between iterations. A smp_rmb() seems more
> > > > appropriate that barrier(). I also added a lot of comments about
> > > > barriers in the code, and made the reader side much easier to review.
> > > > 
> > > > Please feel free to comment on my added code comments.
> > > 
> > > The torture test now looks much more familiar.  ;-)
> > > 
> > > I fixed some compiler warnings (in my original, sad to say), added an
> > > ACCESS_ONCE() to rcu_read_lock() (also in my original),
> > 
> > Yes, I thought about this ACCESS_ONCE during my sleep.. just did not
> > have to to update the source yet. :)
> > 
> > Merged. Thanks !
> > 
> > [...]
> > 
> > > --- a/urcu.c
> > > +++ b/urcu.c
> > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > >  	 * BUSY-LOOP.
> > >  	 */
> > >  	while (sig_done < 1)
> > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > +				/* cache coherence guarantees CPU re-read. */
> > 
> > That could be a smp_rmc() ? (see other mail)
> 
> I prefer making ACCESS_ONCE() actually having the full semantics implied
> by its name.  ;-)
> 
> See patch at end of this email.
> 

See my email about LOAD_REMOTE/STORE_REMOTE :)

> > >  	smp_mb();	/* read sig_done before ending the barrier */
> > >  }
> > >  
> > > @@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
> > >  	if (!reader_data)
> > >  		return;
> > >  	sig_done = 0;
> > > -	smp_mb();	/* write sig_done before sending the signals */
> > > +	/* smp_mb();	write sig_done before sending the signals */
> > > +			/* redundant with barriers in pthread_kill(). */
> > 
> > Absolutely not. pthread_kill does not send a signal to self in every
> > case because the writer thread has not requirement to register itself.
> > It *could* be registered as a reader too, but does not have to.
> 
> No, not the barrier in the signal handler, but rather the barriers in
> the system call invoked by pthread_kill().
> 

The barrier implied by going through a system call does not imply cache
flushing AFAIK. So we would have to at least leave a big comment here
saying that the kernel has to provide such guarantee. So under that
comment I would leave a smp_mc();.

> > >  	for (index = reader_data; index < reader_data + num_readers; index++)
> > >  		pthread_kill(index->tid, SIGURCU);
> > >  	/*
> > > @@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
> > >  	 * BUSY-LOOP.
> > >  	 */
> > >  	while (sig_done < num_readers)
> > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > +				/* cache coherence guarantees CPU re-read. */
> > 
> > That could be a smp_rmc() ?
> 
> Again, prefer:
> 
> 	while (ACCESS_ONCE() < num_readers)
> 
> after upgrading ACCESS_ONCE() to provide the full semantics.
> 
> I will send a patch.
> 

I'll use a variation :

        while (LOAD_REMOTE(sig_done) < num_readers)
                cpu_relax();


> > >  	smp_mb();	/* read sig_done before ending the barrier */
> > >  }
> > >  #endif
> > > @@ -181,7 +184,8 @@ void synchronize_rcu(void)
> > >  	 * the writer waiting forever while new readers are always accessing
> > >  	 * data (no progress).
> > >  	 */
> > > -	smp_mb();
> > > +	/* smp_mb(); Don't need this one for CPU, only compiler. */
> > > +	barrier();
> > 
> > smp_mc() ?
> 
> ACCESS_ONCE().
> 

Ah, this is what I dislike about using :

  STORE_REMOTE(x, v);
...
  if (LOAD_REMOTE(y) ...)
rather than
  x = v;
  smp_mc();
  if (y ...)

We will end up in a situation where we do 2 cache flushes rather than a
single one. So wherever possible, I would be tempted to leave the
smp_mc().


> > >  
> > >  	switch_next_urcu_qparity();	/* 1 -> 0 */
> > >  
> > 
> > Side-note :
> > on archs without cache coherency, all smp_[rw ]mb would turn into a
> > cache flush.
> 
> So I might need more in my ACCESS_ONCE() below.
> 
> Add .gitignore files, and redefine accesses in terms of a new
> ACCESS_ONCE().
> 

I'll merge the .gitignore file, thanks,

Please see my updated git tree.

Mathieu

> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> ---
> 
>  .gitignore              |    9 +++++++++
>  formal-model/.gitignore |    3 +++
>  urcu.c                  |   10 ++++------
>  urcu.h                  |   12 ++++++++++++
>  4 files changed, 28 insertions(+), 6 deletions(-)
> 
> diff --git a/.gitignore b/.gitignore
> new file mode 100644
> index 0000000..29aa7e5
> --- /dev/null
> +++ b/.gitignore
> @@ -0,0 +1,9 @@
> +test_rwlock_timing
> +test_urcu
> +test_urcu_timing
> +test_urcu_yield
> +urcu-asm.o
> +urcu.o
> +urcutorture
> +urcutorture-yield
> +urcu-yield.o
> diff --git a/formal-model/.gitignore b/formal-model/.gitignore
> new file mode 100644
> index 0000000..49fdd8a
> --- /dev/null
> +++ b/formal-model/.gitignore
> @@ -0,0 +1,3 @@
> +pan
> +pan.*
> +urcu.spin.trail
> diff --git a/urcu.c b/urcu.c
> index a696439..f61d4c3 100644
> --- a/urcu.c
> +++ b/urcu.c
> @@ -98,9 +98,8 @@ static void force_mb_single_thread(pthread_t tid)
>  	 * Wait for sighandler (and thus mb()) to execute on every thread.
>  	 * BUSY-LOOP.
>  	 */
> -	while (sig_done < 1)
> -		barrier();	/* ensure compiler re-reads sig-done */
> -				/* cache coherence guarantees CPU re-read. */
> +	while (ACCESS_ONCE(sig_done) < 1)
> +		continue;
>  	smp_mb();	/* read sig_done before ending the barrier */
>  }
>  
> @@ -122,9 +121,8 @@ static void force_mb_all_threads(void)
>  	 * Wait for sighandler (and thus mb()) to execute on every thread.
>  	 * BUSY-LOOP.
>  	 */
> -	while (sig_done < num_readers)
> -		barrier();	/* ensure compiler re-reads sig-done */
> -				/* cache coherence guarantees CPU re-read. */
> +	while (ACCESS_ONCE(sig_done) < num_readers)
> +		continue;
>  	smp_mb();	/* read sig_done before ending the barrier */
>  }
>  #endif
> diff --git a/urcu.h b/urcu.h
> index 79d9464..dd040a5 100644
> --- a/urcu.h
> +++ b/urcu.h
> @@ -98,6 +98,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
>  /* Nop everywhere except on alpha. */
>  #define smp_read_barrier_depends()
>  
> +#define CONFIG_ARCH_CACHE_COHERENT
> +#define cpu_relax barrier
> +
>  /*
>   * Prevent the compiler from merging or refetching accesses.  The compiler
>   * is also forbidden from reordering successive instances of ACCESS_ONCE(),
> @@ -110,7 +113,16 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
>   * use is to mediate communication between process-level code and irq/NMI
>   * handlers, all running on the same CPU.
>   */
> +#ifdef CONFIG_ARCH_CACHE_COHERENT
>  #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> +#else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> +#define ACCESS_ONCE(x)     ({ \
> +				typeof(x) _________x1; \
> +				_________x1 = (*(volatile typeof(x) *)&(x)); \
> +				cpu_relax(); \
> +				(_________x1); \
> +				})
> +#endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
>  
>  /**
>   * rcu_dereference - fetch an RCU-protected pointer in an
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 21:15                                                             ` Linus Torvalds
@ 2009-02-12 21:59                                                               ` Paul E. McKenney
  2009-02-13 13:50                                                                 ` Nick Piggin
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 21:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, ltt-dev, linux-kernel, Bryan Wu, uclinux-dist-devel

On Thu, Feb 12, 2009 at 01:15:08PM -0800, Linus Torvalds wrote:
> 
> 
> On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > 
> > In other words, you are arguing for using ACCESS_ONCE() in the loops,
> > but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> > broken?
> 
> Well, I _also_ argue that if you have a busy loop, you'd better have a 
> cpu_relax() in there somewhere anyway. If you don't, you have a bug.
> 
> So I think the BF approach is "borderline broken", but I think it should 
> work, if BF just has whatever appropriate cache flush in its cpu_relax.

OK, got it.  Keep ACCESS_ONCE() as is, make sure any busy-wait
loops contain a cpu_relax().  A given busy loop might or might not
need ACCESS_ONCE(), but that decision is independent of hardware
considerations.

Ah, and blackfin's cpu_relax() does seem to have migrated from barrier()
to smp_mb() recently, so sounds good to me!!!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 21:53                                                         ` Mathieu Desnoyers
@ 2009-02-12 23:04                                                           ` Paul E. McKenney
  2009-02-13 12:49                                                             ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 23:04 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Thu, Feb 12, 2009 at 04:53:41PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Feb 12, 2009 at 02:38:26PM -0500, Mathieu Desnoyers wrote:
> > > Replying to a separate portion of the mail with less CC :
> > > 
> > > 
> > > > On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > > things up.  :-/
> > > > > > > > 
> > > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > > out a few days ago.
> > > > > > > > 
> > > > > > > 
> > > > > > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > > > > > with you ?
> > > > > > 
> > > > > > Works for me!
> > > > > > 
> > > > > > > > I will play with removing models of mb...
> > > > > > > 
> > > > > > > OK, I see you already did..
> > > > > > 
> > > > > > I continued this, and surprisingly few are actually required, though
> > > > > > I don't fully trust the modeling of removed memory barriers.
> > > > > 
> > > > > On my side I cleaned up the code a lot, and actually added some barriers
> > > > > ;) Especially in the busy loops, where we expect the other thread's
> > > > > value to change eventually between iterations. A smp_rmb() seems more
> > > > > appropriate that barrier(). I also added a lot of comments about
> > > > > barriers in the code, and made the reader side much easier to review.
> > > > > 
> > > > > Please feel free to comment on my added code comments.
> > > > 
> > > > The torture test now looks much more familiar.  ;-)
> > > > 
> > > > I fixed some compiler warnings (in my original, sad to say), added an
> > > > ACCESS_ONCE() to rcu_read_lock() (also in my original),
> > > 
> > > Yes, I thought about this ACCESS_ONCE during my sleep.. just did not
> > > have to to update the source yet. :)
> > > 
> > > Merged. Thanks !
> > > 
> > > [...]
> > > 
> > > > --- a/urcu.c
> > > > +++ b/urcu.c
> > > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > > >  	 * BUSY-LOOP.
> > > >  	 */
> > > >  	while (sig_done < 1)
> > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > +				/* cache coherence guarantees CPU re-read. */
> > > 
> > > That could be a smp_rmc() ? (see other mail)
> > 
> > I prefer making ACCESS_ONCE() actually having the full semantics implied
> > by its name.  ;-)
> > 
> > See patch at end of this email.
> > 
> 
> See my email about LOAD_REMOTE/STORE_REMOTE :)
> 
> > > >  	smp_mb();	/* read sig_done before ending the barrier */
> > > >  }
> > > >  
> > > > @@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
> > > >  	if (!reader_data)
> > > >  		return;
> > > >  	sig_done = 0;
> > > > -	smp_mb();	/* write sig_done before sending the signals */
> > > > +	/* smp_mb();	write sig_done before sending the signals */
> > > > +			/* redundant with barriers in pthread_kill(). */
> > > 
> > > Absolutely not. pthread_kill does not send a signal to self in every
> > > case because the writer thread has not requirement to register itself.
> > > It *could* be registered as a reader too, but does not have to.
> > 
> > No, not the barrier in the signal handler, but rather the barriers in
> > the system call invoked by pthread_kill().
> > 
> 
> The barrier implied by going through a system call does not imply cache
> flushing AFAIK. So we would have to at least leave a big comment here
> saying that the kernel has to provide such guarantee. So under that
> comment I would leave a smp_mc();.
> 
> > > >  	for (index = reader_data; index < reader_data + num_readers; index++)
> > > >  		pthread_kill(index->tid, SIGURCU);
> > > >  	/*
> > > > @@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
> > > >  	 * BUSY-LOOP.
> > > >  	 */
> > > >  	while (sig_done < num_readers)
> > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > +				/* cache coherence guarantees CPU re-read. */
> > > 
> > > That could be a smp_rmc() ?
> > 
> > Again, prefer:
> > 
> > 	while (ACCESS_ONCE() < num_readers)
> > 
> > after upgrading ACCESS_ONCE() to provide the full semantics.
> > 
> > I will send a patch.
> 
> I'll use a variation :
> 
>         while (LOAD_REMOTE(sig_done) < num_readers)
>                 cpu_relax();

I suspect that LOAD_SHARED() and STORE_SHARED() would be more intuitive.
But shouldn't we align with the Linux-kernel usage where reasonable?
(Yes, this can be a moving target, but there isn't much else that
currently supports this level of SMP function on quite the variety of
CPU architectures.)

> > > >  	smp_mb();	/* read sig_done before ending the barrier */
> > > >  }
> > > >  #endif
> > > > @@ -181,7 +184,8 @@ void synchronize_rcu(void)
> > > >  	 * the writer waiting forever while new readers are always accessing
> > > >  	 * data (no progress).
> > > >  	 */
> > > > -	smp_mb();
> > > > +	/* smp_mb(); Don't need this one for CPU, only compiler. */
> > > > +	barrier();
> > > 
> > > smp_mc() ?
> > 
> > ACCESS_ONCE().
> > 
> 
> Ah, this is what I dislike about using :
> 
>   STORE_REMOTE(x, v);
> ...
>   if (LOAD_REMOTE(y) ...)
> rather than
>   x = v;
>   smp_mc();
>   if (y ...)
> 
> We will end up in a situation where we do 2 cache flushes rather than a
> single one. So wherever possible, I would be tempted to leave the
> smp_mc().

Ummm...  There is a very real reason why I moved from bare
smp_read_barrier_depends() calls to rcu_dereference().  Code with an
rcu_dereference() style is -much- easier to read.

So I would flip that -- use the per-variable API unless you see
measureable system-level pain.  Because the variable-free API will
inflict very real readability pain!

The problem is that the relationship of the variable-free API to the
variables it is supposed to constrain gets lost.  With the per-variable
APIs, the relationship is obvious and explicit.

> > > >  
> > > >  	switch_next_urcu_qparity();	/* 1 -> 0 */
> > > >  
> > > 
> > > Side-note :
> > > on archs without cache coherency, all smp_[rw ]mb would turn into a
> > > cache flush.
> > 
> > So I might need more in my ACCESS_ONCE() below.
> > 
> > Add .gitignore files, and redefine accesses in terms of a new
> > ACCESS_ONCE().
> 
> I'll merge the .gitignore file, thanks,

Sounds good!

> Please see my updated git tree.

Will do!

							Thanx, Paul

> Mathieu
> 
> > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > ---
> > 
> >  .gitignore              |    9 +++++++++
> >  formal-model/.gitignore |    3 +++
> >  urcu.c                  |   10 ++++------
> >  urcu.h                  |   12 ++++++++++++
> >  4 files changed, 28 insertions(+), 6 deletions(-)
> > 
> > diff --git a/.gitignore b/.gitignore
> > new file mode 100644
> > index 0000000..29aa7e5
> > --- /dev/null
> > +++ b/.gitignore
> > @@ -0,0 +1,9 @@
> > +test_rwlock_timing
> > +test_urcu
> > +test_urcu_timing
> > +test_urcu_yield
> > +urcu-asm.o
> > +urcu.o
> > +urcutorture
> > +urcutorture-yield
> > +urcu-yield.o
> > diff --git a/formal-model/.gitignore b/formal-model/.gitignore
> > new file mode 100644
> > index 0000000..49fdd8a
> > --- /dev/null
> > +++ b/formal-model/.gitignore
> > @@ -0,0 +1,3 @@
> > +pan
> > +pan.*
> > +urcu.spin.trail
> > diff --git a/urcu.c b/urcu.c
> > index a696439..f61d4c3 100644
> > --- a/urcu.c
> > +++ b/urcu.c
> > @@ -98,9 +98,8 @@ static void force_mb_single_thread(pthread_t tid)
> >  	 * Wait for sighandler (and thus mb()) to execute on every thread.
> >  	 * BUSY-LOOP.
> >  	 */
> > -	while (sig_done < 1)
> > -		barrier();	/* ensure compiler re-reads sig-done */
> > -				/* cache coherence guarantees CPU re-read. */
> > +	while (ACCESS_ONCE(sig_done) < 1)
> > +		continue;
> >  	smp_mb();	/* read sig_done before ending the barrier */
> >  }
> >  
> > @@ -122,9 +121,8 @@ static void force_mb_all_threads(void)
> >  	 * Wait for sighandler (and thus mb()) to execute on every thread.
> >  	 * BUSY-LOOP.
> >  	 */
> > -	while (sig_done < num_readers)
> > -		barrier();	/* ensure compiler re-reads sig-done */
> > -				/* cache coherence guarantees CPU re-read. */
> > +	while (ACCESS_ONCE(sig_done) < num_readers)
> > +		continue;
> >  	smp_mb();	/* read sig_done before ending the barrier */
> >  }
> >  #endif
> > diff --git a/urcu.h b/urcu.h
> > index 79d9464..dd040a5 100644
> > --- a/urcu.h
> > +++ b/urcu.h
> > @@ -98,6 +98,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
> >  /* Nop everywhere except on alpha. */
> >  #define smp_read_barrier_depends()
> >  
> > +#define CONFIG_ARCH_CACHE_COHERENT
> > +#define cpu_relax barrier
> > +
> >  /*
> >   * Prevent the compiler from merging or refetching accesses.  The compiler
> >   * is also forbidden from reordering successive instances of ACCESS_ONCE(),
> > @@ -110,7 +113,16 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
> >   * use is to mediate communication between process-level code and irq/NMI
> >   * handlers, all running on the same CPU.
> >   */
> > +#ifdef CONFIG_ARCH_CACHE_COHERENT
> >  #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> > +#else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > +#define ACCESS_ONCE(x)     ({ \
> > +				typeof(x) _________x1; \
> > +				_________x1 = (*(volatile typeof(x) *)&(x)); \
> > +				cpu_relax(); \
> > +				(_________x1); \
> > +				})
> > +#endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
> >  
> >  /**
> >   * rcu_dereference - fetch an RCU-protected pointer in an
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 21:27                                                           ` Mathieu Desnoyers
@ 2009-02-12 23:26                                                             ` Paul E. McKenney
  2009-02-13 13:12                                                               ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-12 23:26 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: ltt-dev, linux-kernel

On Thu, Feb 12, 2009 at 04:27:12PM -0500, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Thu, Feb 12, 2009 at 01:40:30PM -0500, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > 
> > > > > > > > [ . . . ]
> > > > > > > > 
> > > > > > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > > > > > 
> > > > > > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > > > > > 
> > > > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > > > things up.  :-/
> > > > > > > > > 
> > > > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > > > out a few days ago.
> > > > > > > > > 
> > > > > > > > > I will play with removing models of mb...
> > > > > > > > 
> > > > > > > > And commenting out the models of mb between the counter flips and the
> > > > > > > > test for readers still passes validation, as expected, and as shown in
> > > > > > > > the attached Promela code.
> > > > > > > > 
> > > > > > > 
> > > > > > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > > > > > would not make the algorithm incorrect, but that it would cause
> > > > > > > situations where it would be impossible for the writer to do any
> > > > > > > progress as long as there are readers active. I think we would have to
> > > > > > > enhance the model or at least express this through some LTL statement to
> > > > > > > validate this specific behavior.
> > > > > > 
> > > > > > But if the writer fails to make progress, then the counter remains at a
> > > > > > given value, which causes readers to drain, which allows the writer to
> > > > > > eventually make progress again.  Right?
> > > > > > 
> > > > > 
> > > > > Not necessarily. If we don't have the proper memory barriers, we can
> > > > > have the writer waiting on, say, parity 0 *before* it has performed the
> > > > > parity switch. Therefore, even newly coming readers will add up to
> > > > > parity 0.
> > > > 
> > > > But the write that changes the parity will eventually make it out.
> > > > OK, so your argument is that we at least need a compiler barrier?
> > > 
> > > It all depends on the assumptions we make. I am currently trying to
> > > assume the most aggressive memory ordering I can think of. The model I
> > > think about to represent it is that memory reads/writes are kept local
> > > to the CPU until a memory barrier is encountered. I doubt it exists in
> > > practice, bacause the CPU will eventually have to commit the information
> > > to memory (hrm, are sure about this ?), but if we use that as a starting
> > > point, I think this would cover the entire spectrum of possible memory
> > > barriers issues. Also, it would be easy to verify formally. But maybe am
> > > I going too far ?
> > 
> > I believe that you are going a bit too far.  After all, if you make that
> > assumption, the CPU could just never make anything visible.  After all,
> > the memory barrier doesn't say "make the previous stuff visible now",
> > it instead says "if you make anything after the barrier visible to a
> > given other CPU, then you must also make everything before the barrier
> > visible to that CPU".
> > 
> > > > Regardless, please see attached for a modified version of the Promela
> > > > model that fully models omitting out the memory barrier that my
> > > > rcu_nest32.[hc] implementation omits.  (It is possible to partially
> > > > model removal of other memory barriers via #if 0, but to fully model
> > > > would need to enumerate the permutations as shown on lines 231-257.)
> > > > 
> > > > > In your model, this is not detected, because eventually all readers will
> > > > > execute, and only then the writer will be able to update the data. But
> > > > > in reality, if we run a very busy 4096-cores machines where there is
> > > > > always at least one reader active, the the writer will be stuck forever,
> > > > > and that's really bad.
> > > > 
> > > > Assuming that the reordering is done by the CPU, the write will
> > > > eventually get out -- it is stuck in (say) the store buffer, and the
> > > > cache line will eventually arrive, and then the value will eventually
> > > > be seen by the readers.
> > > 
> > > Do we have guarantees that the data *will necessarily* get out of the
> > > cpu write buffer at some point ?
> > 
> > It has to, given a finite CPU write buffer, interrupts, and the like.
> > The actual CPU designs interact with a cache-coherence protocol, so
> > the stuff lives in the store buffer only for as long as it takes for
> > the corresponding cache line to be owned by this CPU.
> > 
> > > > We might need a -compiler- barrier, but then again, I am not sure that
> > > > we are talking about the same memory barrier -- again, please see
> > > > attached lines 231-257 to see which one that I eliminated.
> > > 
> > > As long as we don't have "progress" validation to check our model, the
> > > fact that it passes the current test does not tell much.
> > 
> > Without agreeing or disagreeing with this statement for the moment,
> > would you be willing to tell me whether or not the memory barrier
> > eliminated by lines 231-257 of the model was the one that you were
> > talking about?  ;-)
> > 
> 
> So we are taking about :
> 
> /* current synchronize_rcu(), first-flip check plus second flip. */
> 
> which does not have any memory barrier anymore. This corresponds to my
> current :
> 
>        /*
>          * Wait for previous parity to be empty of readers.
>          */
>         wait_for_quiescent_state();     /* Wait readers in parity 0 */
> 
>         /*
>          * Must finish waiting for quiescent state for parity 0 before
>          * committing qparity update to memory. Failure to do so could result in
>          * the writer waiting forever while new readers are always accessing
>          * data (no progress).
>          */
>         smp_mc();
> 
>         switch_next_urcu_qparity();     /* 1 -> 0 */
> 
> So the memory barrier is not needed, but a compiler barrier is needed on
> arch with cache coherency, and a cache flush is needed on architectures
> without cache coherency.
> 
> BTW, I think all the three smp_mb() that were in this function can be
> turned into smp_mc().

Verifying this requires merging more code into the interleaving -- it
is necessary to model all permutations of the statements.  Even that
isn't always quite right, as Promela treats each statement as atomic.
(I might be able to pull a trick like I did on the read side, but the
data dependencies are a bit uglier on the update side.)

That said, I did do a crude check by #if-ing out the individual barriers
on the update side.  This is semi-plausible, because the read side is
primarily unordered.  The results are that the final memory barrier
(just before exiting synchronize_rcu()) is absolutely required, as is
at least one of the first two memory barriers.

But I don't trust this analysis -- it is an approximation to an
approximation, which is not what you want for this sort of job.

> Therefore, if we assume memory coherency, only barrier()s would be
> needed between the switch/q.s. wait/switch/q.s. wait.

I must admit that the need to assume that some platforms fail to
implement cache coherence comes as a bit of a nasty shock...

							Thanx, Paul

> Mathieu
> 
> 
> > I might consider eventually adding progress validation to the model,
> > but am currently a bit overdosed on Promela...
> > 
> > > > Also, the original model I sent out has a minor bug that prevents it
> > > > from fully modeling the nested-read-side case.  The patch below fixes this.
> > > 
> > > Ok, merging the fix, thanks,
> > 
> > Thank you!
> > 
> > 							Thanx, Paul
> > 
> > > Mathieu
> > > 
> > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > ---
> > > > 
> > > >  urcu.spin |    6 +++++-
> > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
> > > > index e5bfff3..611464b 100644
> > > > --- a/formal-model/urcu.spin
> > > > +++ b/formal-model/urcu.spin
> > > > @@ -124,9 +124,13 @@ proctype urcu_reader()
> > > >  				break;
> > > >  			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > >  				tmp = tmp + 1;
> > > > -			:: tmp >= 4 ->
> > > > +			:: tmp >= 4 &&
> > > > +			   reader_progress[0] == reader_progress[3] ->
> > > >  				done = 1;
> > > >  				break;
> > > > +			:: tmp >= 4 &&
> > > > +			   reader_progress[0] != reader_progress[3] ->
> > > > +			   	break;
> > > >  			od;
> > > >  			do
> > > >  			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > 
> > > Content-Description: urcu_mbmin.spin
> > > > /*
> > > >  * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
> > > >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> > > >  *      git archive at git://lttng.org/userspace-rcu.git, but with
> > > >  *	memory barriers removed.
> > > >  *
> > > >  * This program is free software; you can redistribute it and/or modify
> > > >  * it under the terms of the GNU General Public License as published by
> > > >  * the Free Software Foundation; either version 2 of the License, or
> > > >  * (at your option) any later version.
> > > >  *
> > > >  * This program is distributed in the hope that it will be useful,
> > > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > >  * GNU General Public License for more details.
> > > >  *
> > > >  * You should have received a copy of the GNU General Public License
> > > >  * along with this program; if not, write to the Free Software
> > > >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > > >  *
> > > >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> > > >  */
> > > > 
> > > > /* Promela validation variables. */
> > > > 
> > > > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > > > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > > > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > > > byte reader_progress[4];
> > > > 		  /* Count of read-side statement executions. */
> > > > 
> > > > /* urcu definitions and variables, taken straight from the algorithm. */
> > > > 
> > > > #define RCU_GP_CTR_BIT (1 << 7)
> > > > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > > > 
> > > > byte urcu_gp_ctr = 1;
> > > > byte urcu_active_readers = 0;
> > > > 
> > > > /* Model the RCU read-side critical section. */
> > > > 
> > > > proctype urcu_reader()
> > > > {
> > > > 	bit done = 0;
> > > > 	bit mbok;
> > > > 	byte tmp;
> > > > 	byte tmp_removed;
> > > > 	byte tmp_free;
> > > > 
> > > > 	/* Absorb any early requests for memory barriers. */
> > > > 	do
> > > > 	:: need_mb == 1 ->
> > > > 		need_mb = 0;
> > > > 	:: 1 -> skip;
> > > > 	:: 1 -> break;
> > > > 	od;
> > > > 
> > > > 	/*
> > > > 	 * Each pass through this loop executes one read-side statement
> > > > 	 * from the following code fragment:
> > > > 	 *
> > > > 	 *	rcu_read_lock(); [0a]
> > > > 	 *	rcu_read_lock(); [0b]
> > > > 	 *	p = rcu_dereference(global_p); [1]
> > > > 	 *	x = p->data; [2]
> > > > 	 *	rcu_read_unlock(); [3b]
> > > > 	 *	rcu_read_unlock(); [3a]
> > > > 	 *
> > > > 	 * Because we are modeling a weak-memory machine, these statements
> > > > 	 * can be seen in any order, the only restriction being that
> > > > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > > > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > > > 	 * is non-deterministic, the above is but one possible placement.
> > > > 	 * Intestingly enough, this model validates all possible placements
> > > > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > > > 	 * with the only constraint being that the rcu_read_lock() must
> > > > 	 * precede the rcu_read_unlock().
> > > > 	 *
> > > > 	 * We also respond to memory-barrier requests, but only if our
> > > > 	 * execution happens to be ordered.  If the current state is
> > > > 	 * misordered, we ignore memory-barrier requests.
> > > > 	 */
> > > > 	do
> > > > 	:: 1 ->
> > > > 		if
> > > > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > > > 			tmp = urcu_active_readers;
> > > > 			if
> > > > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > > > 				tmp = urcu_gp_ctr;
> > > > 				do
> > > > 				:: (reader_progress[1] +
> > > > 				    reader_progress[2] +
> > > > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > > > 					need_mb = 0;
> > > > 				:: 1 -> skip;
> > > > 				:: 1 -> break;
> > > > 				od;
> > > > 				urcu_active_readers = tmp;
> > > > 			 :: else ->
> > > > 				urcu_active_readers = tmp + 1;
> > > > 			fi;
> > > > 			reader_progress[0] = reader_progress[0] + 1;
> > > > 		:: reader_progress[1] == 0 -> /* [1] */
> > > > 			tmp_removed = removed;
> > > > 			reader_progress[1] = 1;
> > > > 		:: reader_progress[2] == 0 -> /* [2] */
> > > > 			tmp_free = free;
> > > > 			reader_progress[2] = 1;
> > > > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > > > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > > > 			tmp = urcu_active_readers - 1;
> > > > 			urcu_active_readers = tmp;
> > > > 			reader_progress[3] = reader_progress[3] + 1;
> > > > 		:: else -> break;
> > > > 		fi;
> > > > 
> > > > 		/* Process memory-barrier requests, if it is safe to do so. */
> > > > 		atomic {
> > > > 			mbok = 0;
> > > > 			tmp = 0;
> > > > 			do
> > > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > > 				tmp = tmp + 1;
> > > > 				break;
> > > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > > 				tmp = tmp + 1;
> > > > 			:: tmp >= 4 &&
> > > > 			   reader_progress[0] == reader_progress[3] ->
> > > > 				done = 1;
> > > > 				break;
> > > > 			:: tmp >= 4 &&
> > > > 			   reader_progress[0] != reader_progress[3] ->
> > > > 			   	break;
> > > > 			od;
> > > > 			do
> > > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > > 				tmp = tmp + 1;
> > > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > > 				break;
> > > > 			:: tmp >= 4 ->
> > > > 				mbok = 1;
> > > > 				break;
> > > > 			od
> > > > 
> > > > 		}
> > > > 
> > > > 		if
> > > > 		:: mbok == 1 ->
> > > > 			/* We get here if mb processing is safe. */
> > > > 			do
> > > > 			:: need_mb == 1 ->
> > > > 				need_mb = 0;
> > > > 			:: 1 -> skip;
> > > > 			:: 1 -> break;
> > > > 			od;
> > > > 		:: else -> skip;
> > > > 		fi;
> > > > 
> > > > 		/*
> > > > 		 * Check to see if we have modeled the entire RCU read-side
> > > > 		 * critical section, and leave if so.
> > > > 		 */
> > > > 		if
> > > > 		:: done == 1 -> break;
> > > > 		:: else -> skip;
> > > > 		fi
> > > > 	od;
> > > > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > > > 
> > > > 	/* Process any late-arriving memory-barrier requests. */
> > > > 	do
> > > > 	:: need_mb == 1 ->
> > > > 		need_mb = 0;
> > > > 	:: 1 -> skip;
> > > > 	:: 1 -> break;
> > > > 	od;
> > > > }
> > > > 
> > > > /* Model the RCU update process. */
> > > > 
> > > > proctype urcu_updater()
> > > > {
> > > > 	byte tmp;
> > > > 
> > > > 	/* prior synchronize_rcu(), second counter flip. */
> > > > 	need_mb = 1; /* mb() A */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > 	need_mb = 1; /* mb() B */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 	do
> > > > 	:: 1 ->
> > > > 		if
> > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 			skip;
> > > > 		:: else -> break;
> > > > 		fi
> > > > 	od;
> > > > 	need_mb = 1; /* mb() C absolutely required by analogy with G */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 
> > > > 	/* Removal statement, e.g., list_del_rcu(). */
> > > > 	removed = 1;
> > > > 
> > > > 	/* current synchronize_rcu(), first counter flip. */
> > > > 	need_mb = 1; /* mb() D suggested */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > 	need_mb = 1;  /* mb() E required if D not present */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 
> > > > 	/* current synchronize_rcu(), first-flip check plus second flip. */
> > > > 	if
> > > > 	:: 1 ->
> > > > 		do
> > > > 		:: 1 ->
> > > > 			if
> > > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 				skip;
> > > > 			:: else -> break;
> > > > 			fi;
> > > > 		od;
> > > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > 	:: 1 ->
> > > > 		tmp = urcu_gp_ctr;
> > > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > 		do
> > > > 		:: 1 ->
> > > > 			if
> > > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 				skip;
> > > > 			:: else -> break;
> > > > 			fi;
> > > > 		od;
> > > > 	fi;
> > > > 
> > > > 	/* current synchronize_rcu(), second counter flip check. */
> > > > 	need_mb = 1; /* mb() F not required */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 	do
> > > > 	:: 1 ->
> > > > 		if
> > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > 			skip;
> > > > 		:: else -> break;
> > > > 		fi;
> > > > 	od;
> > > > 	need_mb = 1; /* mb() G absolutely required */
> > > > 	do
> > > > 	:: need_mb == 1 -> skip;
> > > > 	:: need_mb == 0 -> break;
> > > > 	od;
> > > > 
> > > > 	/* free-up step, e.g., kfree(). */
> > > > 	free = 1;
> > > > }
> > > > 
> > > > /*
> > > >  * Initialize the array, spawn a reader and an updater.  Because readers
> > > >  * are independent of each other, only one reader is needed.
> > > >  */
> > > > 
> > > > init {
> > > > 	atomic {
> > > > 		reader_progress[0] = 0;
> > > > 		reader_progress[1] = 0;
> > > > 		reader_progress[2] = 0;
> > > > 		reader_progress[3] = 0;
> > > > 		run urcu_reader();
> > > > 		run urcu_updater();
> > > > 	}
> > > > }
> > > 
> > > 
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 23:04                                                           ` Paul E. McKenney
@ 2009-02-13 12:49                                                             ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 12:49 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 04:53:41PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Thu, Feb 12, 2009 at 02:38:26PM -0500, Mathieu Desnoyers wrote:
> > > > Replying to a separate portion of the mail with less CC :
> > > > 
> > > > 
> > > > > On Thu, Feb 12, 2009 at 02:05:39AM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Wed, Feb 11, 2009 at 11:08:24PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > 
> > > > > > > [ . . . ]
> > > > > > > 
> > > > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > > > things up.  :-/
> > > > > > > > > 
> > > > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > > > out a few days ago.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Great ! :) I added this version to the git repository, hopefully it's ok
> > > > > > > > with you ?
> > > > > > > 
> > > > > > > Works for me!
> > > > > > > 
> > > > > > > > > I will play with removing models of mb...
> > > > > > > > 
> > > > > > > > OK, I see you already did..
> > > > > > > 
> > > > > > > I continued this, and surprisingly few are actually required, though
> > > > > > > I don't fully trust the modeling of removed memory barriers.
> > > > > > 
> > > > > > On my side I cleaned up the code a lot, and actually added some barriers
> > > > > > ;) Especially in the busy loops, where we expect the other thread's
> > > > > > value to change eventually between iterations. A smp_rmb() seems more
> > > > > > appropriate that barrier(). I also added a lot of comments about
> > > > > > barriers in the code, and made the reader side much easier to review.
> > > > > > 
> > > > > > Please feel free to comment on my added code comments.
> > > > > 
> > > > > The torture test now looks much more familiar.  ;-)
> > > > > 
> > > > > I fixed some compiler warnings (in my original, sad to say), added an
> > > > > ACCESS_ONCE() to rcu_read_lock() (also in my original),
> > > > 
> > > > Yes, I thought about this ACCESS_ONCE during my sleep.. just did not
> > > > have to to update the source yet. :)
> > > > 
> > > > Merged. Thanks !
> > > > 
> > > > [...]
> > > > 
> > > > > --- a/urcu.c
> > > > > +++ b/urcu.c
> > > > > @@ -99,7 +99,8 @@ static void force_mb_single_thread(pthread_t tid)
> > > > >  	 * BUSY-LOOP.
> > > > >  	 */
> > > > >  	while (sig_done < 1)
> > > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > > +				/* cache coherence guarantees CPU re-read. */
> > > > 
> > > > That could be a smp_rmc() ? (see other mail)
> > > 
> > > I prefer making ACCESS_ONCE() actually having the full semantics implied
> > > by its name.  ;-)
> > > 
> > > See patch at end of this email.
> > > 
> > 
> > See my email about LOAD_REMOTE/STORE_REMOTE :)
> > 
> > > > >  	smp_mb();	/* read sig_done before ending the barrier */
> > > > >  }
> > > > >  
> > > > > @@ -113,7 +114,8 @@ static void force_mb_all_threads(void)
> > > > >  	if (!reader_data)
> > > > >  		return;
> > > > >  	sig_done = 0;
> > > > > -	smp_mb();	/* write sig_done before sending the signals */
> > > > > +	/* smp_mb();	write sig_done before sending the signals */
> > > > > +			/* redundant with barriers in pthread_kill(). */
> > > > 
> > > > Absolutely not. pthread_kill does not send a signal to self in every
> > > > case because the writer thread has not requirement to register itself.
> > > > It *could* be registered as a reader too, but does not have to.
> > > 
> > > No, not the barrier in the signal handler, but rather the barriers in
> > > the system call invoked by pthread_kill().
> > > 
> > 
> > The barrier implied by going through a system call does not imply cache
> > flushing AFAIK. So we would have to at least leave a big comment here
> > saying that the kernel has to provide such guarantee. So under that
> > comment I would leave a smp_mc();.
> > 
> > > > >  	for (index = reader_data; index < reader_data + num_readers; index++)
> > > > >  		pthread_kill(index->tid, SIGURCU);
> > > > >  	/*
> > > > > @@ -121,7 +123,8 @@ static void force_mb_all_threads(void)
> > > > >  	 * BUSY-LOOP.
> > > > >  	 */
> > > > >  	while (sig_done < num_readers)
> > > > > -		smp_rmb();	/* ensure we re-read sig-done */
> > > > > +		barrier();	/* ensure compiler re-reads sig-done */
> > > > > +				/* cache coherence guarantees CPU re-read. */
> > > > 
> > > > That could be a smp_rmc() ?
> > > 
> > > Again, prefer:
> > > 
> > > 	while (ACCESS_ONCE() < num_readers)
> > > 
> > > after upgrading ACCESS_ONCE() to provide the full semantics.
> > > 
> > > I will send a patch.
> > 
> > I'll use a variation :
> > 
> >         while (LOAD_REMOTE(sig_done) < num_readers)
> >                 cpu_relax();
> 
> I suspect that LOAD_SHARED() and STORE_SHARED() would be more intuitive.
> But shouldn't we align with the Linux-kernel usage where reasonable?
> (Yes, this can be a moving target, but there isn't much else that
> currently supports this level of SMP function on quite the variety of
> CPU architectures.)
> 

Agreed. This is partly why I decided to CC Linus and the Blackfin
maintainers on this. I think it would be a shame to add such support in
a low-level userland RCU library and not to push it at the kernel level.
I really like the LOAD_SHARED and STORE_SHARED and the smp_*mc() macros,
because I think they help modeling very well what is done to local vs
shared data.

> > > > >  	smp_mb();	/* read sig_done before ending the barrier */
> > > > >  }
> > > > >  #endif
> > > > > @@ -181,7 +184,8 @@ void synchronize_rcu(void)
> > > > >  	 * the writer waiting forever while new readers are always accessing
> > > > >  	 * data (no progress).
> > > > >  	 */
> > > > > -	smp_mb();
> > > > > +	/* smp_mb(); Don't need this one for CPU, only compiler. */
> > > > > +	barrier();
> > > > 
> > > > smp_mc() ?
> > > 
> > > ACCESS_ONCE().
> > > 
> > 
> > Ah, this is what I dislike about using :
> > 
> >   STORE_REMOTE(x, v);
> > ...
> >   if (LOAD_REMOTE(y) ...)
> > rather than
> >   x = v;
> >   smp_mc();
> >   if (y ...)
> > 
> > We will end up in a situation where we do 2 cache flushes rather than a
> > single one. So wherever possible, I would be tempted to leave the
> > smp_mc().
> 
> Ummm...  There is a very real reason why I moved from bare
> smp_read_barrier_depends() calls to rcu_dereference().  Code with an
> rcu_dereference() style is -much- easier to read.
> 
> So I would flip that -- use the per-variable API unless you see
> measureable system-level pain.  Because the variable-free API will
> inflict very real readability pain!
> 
> The problem is that the relationship of the variable-free API to the
> variables it is supposed to constrain gets lost.  With the per-variable
> APIs, the relationship is obvious and explicit.
> 

That's why comments on memory barriers are strictly mandatory. :-) But
yes, I agree that we should use STORE_REMOTE/LOAD_REMOTE when where we
cannot possibly flush more than one read/write at once.

I updated the git tree to use STORE_REMOTE/LOAD_REMOTE.

Thanks,

Mathieu

> > > > >  
> > > > >  	switch_next_urcu_qparity();	/* 1 -> 0 */
> > > > >  
> > > > 
> > > > Side-note :
> > > > on archs without cache coherency, all smp_[rw ]mb would turn into a
> > > > cache flush.
> > > 
> > > So I might need more in my ACCESS_ONCE() below.
> > > 
> > > Add .gitignore files, and redefine accesses in terms of a new
> > > ACCESS_ONCE().
> > 
> > I'll merge the .gitignore file, thanks,
> 
> Sounds good!
> 
> > Please see my updated git tree.
> 
> Will do!
> 
> 							Thanx, Paul
> 
> > Mathieu
> > 
> > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > ---
> > > 
> > >  .gitignore              |    9 +++++++++
> > >  formal-model/.gitignore |    3 +++
> > >  urcu.c                  |   10 ++++------
> > >  urcu.h                  |   12 ++++++++++++
> > >  4 files changed, 28 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/.gitignore b/.gitignore
> > > new file mode 100644
> > > index 0000000..29aa7e5
> > > --- /dev/null
> > > +++ b/.gitignore
> > > @@ -0,0 +1,9 @@
> > > +test_rwlock_timing
> > > +test_urcu
> > > +test_urcu_timing
> > > +test_urcu_yield
> > > +urcu-asm.o
> > > +urcu.o
> > > +urcutorture
> > > +urcutorture-yield
> > > +urcu-yield.o
> > > diff --git a/formal-model/.gitignore b/formal-model/.gitignore
> > > new file mode 100644
> > > index 0000000..49fdd8a
> > > --- /dev/null
> > > +++ b/formal-model/.gitignore
> > > @@ -0,0 +1,3 @@
> > > +pan
> > > +pan.*
> > > +urcu.spin.trail
> > > diff --git a/urcu.c b/urcu.c
> > > index a696439..f61d4c3 100644
> > > --- a/urcu.c
> > > +++ b/urcu.c
> > > @@ -98,9 +98,8 @@ static void force_mb_single_thread(pthread_t tid)
> > >  	 * Wait for sighandler (and thus mb()) to execute on every thread.
> > >  	 * BUSY-LOOP.
> > >  	 */
> > > -	while (sig_done < 1)
> > > -		barrier();	/* ensure compiler re-reads sig-done */
> > > -				/* cache coherence guarantees CPU re-read. */
> > > +	while (ACCESS_ONCE(sig_done) < 1)
> > > +		continue;
> > >  	smp_mb();	/* read sig_done before ending the barrier */
> > >  }
> > >  
> > > @@ -122,9 +121,8 @@ static void force_mb_all_threads(void)
> > >  	 * Wait for sighandler (and thus mb()) to execute on every thread.
> > >  	 * BUSY-LOOP.
> > >  	 */
> > > -	while (sig_done < num_readers)
> > > -		barrier();	/* ensure compiler re-reads sig-done */
> > > -				/* cache coherence guarantees CPU re-read. */
> > > +	while (ACCESS_ONCE(sig_done) < num_readers)
> > > +		continue;
> > >  	smp_mb();	/* read sig_done before ending the barrier */
> > >  }
> > >  #endif
> > > diff --git a/urcu.h b/urcu.h
> > > index 79d9464..dd040a5 100644
> > > --- a/urcu.h
> > > +++ b/urcu.h
> > > @@ -98,6 +98,9 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
> > >  /* Nop everywhere except on alpha. */
> > >  #define smp_read_barrier_depends()
> > >  
> > > +#define CONFIG_ARCH_CACHE_COHERENT
> > > +#define cpu_relax barrier
> > > +
> > >  /*
> > >   * Prevent the compiler from merging or refetching accesses.  The compiler
> > >   * is also forbidden from reordering successive instances of ACCESS_ONCE(),
> > > @@ -110,7 +113,16 @@ static inline unsigned long __xchg(unsigned long x, volatile void *ptr,
> > >   * use is to mediate communication between process-level code and irq/NMI
> > >   * handlers, all running on the same CPU.
> > >   */
> > > +#ifdef CONFIG_ARCH_CACHE_COHERENT
> > >  #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
> > > +#else /* #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > > +#define ACCESS_ONCE(x)     ({ \
> > > +				typeof(x) _________x1; \
> > > +				_________x1 = (*(volatile typeof(x) *)&(x)); \
> > > +				cpu_relax(); \
> > > +				(_________x1); \
> > > +				})
> > > +#endif /* #else #ifdef CONFIG_ARCH_CACHE_COHERENT */
> > >  
> > >  /**
> > >   * rcu_dereference - fetch an RCU-protected pointer in an
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 23:26                                                             ` Paul E. McKenney
@ 2009-02-13 13:12                                                               ` Mathieu Desnoyers
  0 siblings, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 13:12 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: ltt-dev, linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Thu, Feb 12, 2009 at 04:27:12PM -0500, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Thu, Feb 12, 2009 at 01:40:30PM -0500, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Thu, Feb 12, 2009 at 12:47:07AM -0500, Mathieu Desnoyers wrote:
> > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > On Wed, Feb 11, 2009 at 11:10:44PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > On Wed, Feb 11, 2009 at 06:33:08PM -0800, Paul E. McKenney wrote:
> > > > > > > > > > On Wed, Feb 11, 2009 at 04:35:49PM -0800, Paul E. McKenney wrote:
> > > > > > > > > > > On Wed, Feb 11, 2009 at 04:42:58PM -0500, Mathieu Desnoyers wrote:
> > > > > > > > > > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > > > > > 
> > > > > > > > > [ . . . ]
> > > > > > > > > 
> > > > > > > > > > > > > (BTW, I do not trust my model yet, as it currently cannot detect the
> > > > > > > > > > > > > failure case I pointed out earlier.  :-/  Here and I thought that the
> > > > > > > > > > > > > point of such models was to detect additional failure cases!!!)
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > Yes, I'll have to dig deeper into it.
> > > > > > > > > > > 
> > > > > > > > > > > Well, as I said, I attached the current model and the error trail.
> > > > > > > > > > 
> > > > > > > > > > And I had bugs in my model that allowed the rcu_read_lock() model
> > > > > > > > > > to nest indefinitely, which overflowed into the top bit, messing
> > > > > > > > > > things up.  :-/
> > > > > > > > > > 
> > > > > > > > > > Attached is a fixed model.  This model validates correctly (woo-hoo!).
> > > > > > > > > > Even better, gives the expected error if you comment out line 180 and
> > > > > > > > > > uncomment line 213, this latter corresponding to the error case I called
> > > > > > > > > > out a few days ago.
> > > > > > > > > > 
> > > > > > > > > > I will play with removing models of mb...
> > > > > > > > > 
> > > > > > > > > And commenting out the models of mb between the counter flips and the
> > > > > > > > > test for readers still passes validation, as expected, and as shown in
> > > > > > > > > the attached Promela code.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > Hrm, in the email I sent you about the memory barrier, I said that it
> > > > > > > > would not make the algorithm incorrect, but that it would cause
> > > > > > > > situations where it would be impossible for the writer to do any
> > > > > > > > progress as long as there are readers active. I think we would have to
> > > > > > > > enhance the model or at least express this through some LTL statement to
> > > > > > > > validate this specific behavior.
> > > > > > > 
> > > > > > > But if the writer fails to make progress, then the counter remains at a
> > > > > > > given value, which causes readers to drain, which allows the writer to
> > > > > > > eventually make progress again.  Right?
> > > > > > > 
> > > > > > 
> > > > > > Not necessarily. If we don't have the proper memory barriers, we can
> > > > > > have the writer waiting on, say, parity 0 *before* it has performed the
> > > > > > parity switch. Therefore, even newly coming readers will add up to
> > > > > > parity 0.
> > > > > 
> > > > > But the write that changes the parity will eventually make it out.
> > > > > OK, so your argument is that we at least need a compiler barrier?
> > > > 
> > > > It all depends on the assumptions we make. I am currently trying to
> > > > assume the most aggressive memory ordering I can think of. The model I
> > > > think about to represent it is that memory reads/writes are kept local
> > > > to the CPU until a memory barrier is encountered. I doubt it exists in
> > > > practice, bacause the CPU will eventually have to commit the information
> > > > to memory (hrm, are sure about this ?), but if we use that as a starting
> > > > point, I think this would cover the entire spectrum of possible memory
> > > > barriers issues. Also, it would be easy to verify formally. But maybe am
> > > > I going too far ?
> > > 
> > > I believe that you are going a bit too far.  After all, if you make that
> > > assumption, the CPU could just never make anything visible.  After all,
> > > the memory barrier doesn't say "make the previous stuff visible now",
> > > it instead says "if you make anything after the barrier visible to a
> > > given other CPU, then you must also make everything before the barrier
> > > visible to that CPU".
> > > 
> > > > > Regardless, please see attached for a modified version of the Promela
> > > > > model that fully models omitting out the memory barrier that my
> > > > > rcu_nest32.[hc] implementation omits.  (It is possible to partially
> > > > > model removal of other memory barriers via #if 0, but to fully model
> > > > > would need to enumerate the permutations as shown on lines 231-257.)
> > > > > 
> > > > > > In your model, this is not detected, because eventually all readers will
> > > > > > execute, and only then the writer will be able to update the data. But
> > > > > > in reality, if we run a very busy 4096-cores machines where there is
> > > > > > always at least one reader active, the the writer will be stuck forever,
> > > > > > and that's really bad.
> > > > > 
> > > > > Assuming that the reordering is done by the CPU, the write will
> > > > > eventually get out -- it is stuck in (say) the store buffer, and the
> > > > > cache line will eventually arrive, and then the value will eventually
> > > > > be seen by the readers.
> > > > 
> > > > Do we have guarantees that the data *will necessarily* get out of the
> > > > cpu write buffer at some point ?
> > > 
> > > It has to, given a finite CPU write buffer, interrupts, and the like.
> > > The actual CPU designs interact with a cache-coherence protocol, so
> > > the stuff lives in the store buffer only for as long as it takes for
> > > the corresponding cache line to be owned by this CPU.
> > > 
> > > > > We might need a -compiler- barrier, but then again, I am not sure that
> > > > > we are talking about the same memory barrier -- again, please see
> > > > > attached lines 231-257 to see which one that I eliminated.
> > > > 
> > > > As long as we don't have "progress" validation to check our model, the
> > > > fact that it passes the current test does not tell much.
> > > 
> > > Without agreeing or disagreeing with this statement for the moment,
> > > would you be willing to tell me whether or not the memory barrier
> > > eliminated by lines 231-257 of the model was the one that you were
> > > talking about?  ;-)
> > > 
> > 
> > So we are taking about :
> > 
> > /* current synchronize_rcu(), first-flip check plus second flip. */
> > 
> > which does not have any memory barrier anymore. This corresponds to my
> > current :
> > 
> >        /*
> >          * Wait for previous parity to be empty of readers.
> >          */
> >         wait_for_quiescent_state();     /* Wait readers in parity 0 */
> > 
> >         /*
> >          * Must finish waiting for quiescent state for parity 0 before
> >          * committing qparity update to memory. Failure to do so could result in
> >          * the writer waiting forever while new readers are always accessing
> >          * data (no progress).
> >          */
> >         smp_mc();
> > 
> >         switch_next_urcu_qparity();     /* 1 -> 0 */
> > 
> > So the memory barrier is not needed, but a compiler barrier is needed on
> > arch with cache coherency, and a cache flush is needed on architectures
> > without cache coherency.
> > 
> > BTW, I think all the three smp_mb() that were in this function can be
> > turned into smp_mc().
> 
> Verifying this requires merging more code into the interleaving -- it
> is necessary to model all permutations of the statements.  Even that
> isn't always quite right, as Promela treats each statement as atomic.
> (I might be able to pull a trick like I did on the read side, but the
> data dependencies are a bit uglier on the update side.)
> 

One way to do this is to model 2 memories (cache and memory) with 3
commit processes :

mem_to_cache()
cache_to_mem()
two_way_mem_cache_sync()


> That said, I did do a crude check by #if-ing out the individual barriers
> on the update side.  This is semi-plausible, because the read side is
> primarily unordered.  The results are that the final memory barrier
> (just before exiting synchronize_rcu()) is absolutely required, as is
> at least one of the first two memory barriers.
> 

Yes, this is what I envisionned, I'm glad it seems to be true. Actually,
I thought that if we could remove one memory barrier between one qparity
update and one reader wait, we could as well remote them all.

> But I don't trust this analysis -- it is an approximation to an
> approximation, which is not what you want for this sort of job.
> 

True.

> > Therefore, if we assume memory coherency, only barrier()s would be
> > needed between the switch/q.s. wait/switch/q.s. wait.
> 
> I must admit that the need to assume that some platforms fail to
> implement cache coherence comes as a bit of a nasty shock...
> 


Hehe, yes, but if we model this, I think the algorithm would become
rock-solid, which is the kind of result I think we want.

Mathieu


> 							Thanx, Paul
> 
> > Mathieu
> > 
> > 
> > > I might consider eventually adding progress validation to the model,
> > > but am currently a bit overdosed on Promela...
> > > 
> > > > > Also, the original model I sent out has a minor bug that prevents it
> > > > > from fully modeling the nested-read-side case.  The patch below fixes this.
> > > > 
> > > > Ok, merging the fix, thanks,
> > > 
> > > Thank you!
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Mathieu
> > > > 
> > > > > Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > > > ---
> > > > > 
> > > > >  urcu.spin |    6 +++++-
> > > > >  1 file changed, 5 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/formal-model/urcu.spin b/formal-model/urcu.spin
> > > > > index e5bfff3..611464b 100644
> > > > > --- a/formal-model/urcu.spin
> > > > > +++ b/formal-model/urcu.spin
> > > > > @@ -124,9 +124,13 @@ proctype urcu_reader()
> > > > >  				break;
> > > > >  			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > > >  				tmp = tmp + 1;
> > > > > -			:: tmp >= 4 ->
> > > > > +			:: tmp >= 4 &&
> > > > > +			   reader_progress[0] == reader_progress[3] ->
> > > > >  				done = 1;
> > > > >  				break;
> > > > > +			:: tmp >= 4 &&
> > > > > +			   reader_progress[0] != reader_progress[3] ->
> > > > > +			   	break;
> > > > >  			od;
> > > > >  			do
> > > > >  			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > > 
> > > > Content-Description: urcu_mbmin.spin
> > > > > /*
> > > > >  * urcu_mbmin.spin: Promela code to validate urcu.  See commit number
> > > > >  *	3a9e6e9df706b8d39af94d2f027210e2e7d4106e of Mathieu Desnoyer's
> > > > >  *      git archive at git://lttng.org/userspace-rcu.git, but with
> > > > >  *	memory barriers removed.
> > > > >  *
> > > > >  * This program is free software; you can redistribute it and/or modify
> > > > >  * it under the terms of the GNU General Public License as published by
> > > > >  * the Free Software Foundation; either version 2 of the License, or
> > > > >  * (at your option) any later version.
> > > > >  *
> > > > >  * This program is distributed in the hope that it will be useful,
> > > > >  * but WITHOUT ANY WARRANTY; without even the implied warranty of
> > > > >  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> > > > >  * GNU General Public License for more details.
> > > > >  *
> > > > >  * You should have received a copy of the GNU General Public License
> > > > >  * along with this program; if not, write to the Free Software
> > > > >  * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> > > > >  *
> > > > >  * Copyright (c) 2009 Paul E. McKenney, IBM Corporation.
> > > > >  */
> > > > > 
> > > > > /* Promela validation variables. */
> > > > > 
> > > > > bit removed = 0;  /* Has RCU removal happened, e.g., list_del_rcu()? */
> > > > > bit free = 0;     /* Has RCU reclamation happened, e.g., kfree()? */
> > > > > bit need_mb = 0;  /* =1 says need reader mb, =0 for reader response. */
> > > > > byte reader_progress[4];
> > > > > 		  /* Count of read-side statement executions. */
> > > > > 
> > > > > /* urcu definitions and variables, taken straight from the algorithm. */
> > > > > 
> > > > > #define RCU_GP_CTR_BIT (1 << 7)
> > > > > #define RCU_GP_CTR_NEST_MASK (RCU_GP_CTR_BIT - 1)
> > > > > 
> > > > > byte urcu_gp_ctr = 1;
> > > > > byte urcu_active_readers = 0;
> > > > > 
> > > > > /* Model the RCU read-side critical section. */
> > > > > 
> > > > > proctype urcu_reader()
> > > > > {
> > > > > 	bit done = 0;
> > > > > 	bit mbok;
> > > > > 	byte tmp;
> > > > > 	byte tmp_removed;
> > > > > 	byte tmp_free;
> > > > > 
> > > > > 	/* Absorb any early requests for memory barriers. */
> > > > > 	do
> > > > > 	:: need_mb == 1 ->
> > > > > 		need_mb = 0;
> > > > > 	:: 1 -> skip;
> > > > > 	:: 1 -> break;
> > > > > 	od;
> > > > > 
> > > > > 	/*
> > > > > 	 * Each pass through this loop executes one read-side statement
> > > > > 	 * from the following code fragment:
> > > > > 	 *
> > > > > 	 *	rcu_read_lock(); [0a]
> > > > > 	 *	rcu_read_lock(); [0b]
> > > > > 	 *	p = rcu_dereference(global_p); [1]
> > > > > 	 *	x = p->data; [2]
> > > > > 	 *	rcu_read_unlock(); [3b]
> > > > > 	 *	rcu_read_unlock(); [3a]
> > > > > 	 *
> > > > > 	 * Because we are modeling a weak-memory machine, these statements
> > > > > 	 * can be seen in any order, the only restriction being that
> > > > > 	 * rcu_read_unlock() cannot precede the corresponding rcu_read_lock().
> > > > > 	 * The placement of the inner rcu_read_lock() and rcu_read_unlock()
> > > > > 	 * is non-deterministic, the above is but one possible placement.
> > > > > 	 * Intestingly enough, this model validates all possible placements
> > > > > 	 * of the inner rcu_read_lock() and rcu_read_unlock() statements,
> > > > > 	 * with the only constraint being that the rcu_read_lock() must
> > > > > 	 * precede the rcu_read_unlock().
> > > > > 	 *
> > > > > 	 * We also respond to memory-barrier requests, but only if our
> > > > > 	 * execution happens to be ordered.  If the current state is
> > > > > 	 * misordered, we ignore memory-barrier requests.
> > > > > 	 */
> > > > > 	do
> > > > > 	:: 1 ->
> > > > > 		if
> > > > > 		:: reader_progress[0] < 2 -> /* [0a and 0b] */
> > > > > 			tmp = urcu_active_readers;
> > > > > 			if
> > > > > 			:: (tmp & RCU_GP_CTR_NEST_MASK) == 0 ->
> > > > > 				tmp = urcu_gp_ctr;
> > > > > 				do
> > > > > 				:: (reader_progress[1] +
> > > > > 				    reader_progress[2] +
> > > > > 				    reader_progress[3] == 0) && need_mb == 1 ->
> > > > > 					need_mb = 0;
> > > > > 				:: 1 -> skip;
> > > > > 				:: 1 -> break;
> > > > > 				od;
> > > > > 				urcu_active_readers = tmp;
> > > > > 			 :: else ->
> > > > > 				urcu_active_readers = tmp + 1;
> > > > > 			fi;
> > > > > 			reader_progress[0] = reader_progress[0] + 1;
> > > > > 		:: reader_progress[1] == 0 -> /* [1] */
> > > > > 			tmp_removed = removed;
> > > > > 			reader_progress[1] = 1;
> > > > > 		:: reader_progress[2] == 0 -> /* [2] */
> > > > > 			tmp_free = free;
> > > > > 			reader_progress[2] = 1;
> > > > > 		:: ((reader_progress[0] > reader_progress[3]) &&
> > > > > 		    (reader_progress[3] < 2)) -> /* [3a and 3b] */
> > > > > 			tmp = urcu_active_readers - 1;
> > > > > 			urcu_active_readers = tmp;
> > > > > 			reader_progress[3] = reader_progress[3] + 1;
> > > > > 		:: else -> break;
> > > > > 		fi;
> > > > > 
> > > > > 		/* Process memory-barrier requests, if it is safe to do so. */
> > > > > 		atomic {
> > > > > 			mbok = 0;
> > > > > 			tmp = 0;
> > > > > 			do
> > > > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > > > 				tmp = tmp + 1;
> > > > > 				break;
> > > > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > > > 				tmp = tmp + 1;
> > > > > 			:: tmp >= 4 &&
> > > > > 			   reader_progress[0] == reader_progress[3] ->
> > > > > 				done = 1;
> > > > > 				break;
> > > > > 			:: tmp >= 4 &&
> > > > > 			   reader_progress[0] != reader_progress[3] ->
> > > > > 			   	break;
> > > > > 			od;
> > > > > 			do
> > > > > 			:: tmp < 4 && reader_progress[tmp] == 0 ->
> > > > > 				tmp = tmp + 1;
> > > > > 			:: tmp < 4 && reader_progress[tmp] != 0 ->
> > > > > 				break;
> > > > > 			:: tmp >= 4 ->
> > > > > 				mbok = 1;
> > > > > 				break;
> > > > > 			od
> > > > > 
> > > > > 		}
> > > > > 
> > > > > 		if
> > > > > 		:: mbok == 1 ->
> > > > > 			/* We get here if mb processing is safe. */
> > > > > 			do
> > > > > 			:: need_mb == 1 ->
> > > > > 				need_mb = 0;
> > > > > 			:: 1 -> skip;
> > > > > 			:: 1 -> break;
> > > > > 			od;
> > > > > 		:: else -> skip;
> > > > > 		fi;
> > > > > 
> > > > > 		/*
> > > > > 		 * Check to see if we have modeled the entire RCU read-side
> > > > > 		 * critical section, and leave if so.
> > > > > 		 */
> > > > > 		if
> > > > > 		:: done == 1 -> break;
> > > > > 		:: else -> skip;
> > > > > 		fi
> > > > > 	od;
> > > > > 	assert((tmp_free == 0) || (tmp_removed == 1));
> > > > > 
> > > > > 	/* Process any late-arriving memory-barrier requests. */
> > > > > 	do
> > > > > 	:: need_mb == 1 ->
> > > > > 		need_mb = 0;
> > > > > 	:: 1 -> skip;
> > > > > 	:: 1 -> break;
> > > > > 	od;
> > > > > }
> > > > > 
> > > > > /* Model the RCU update process. */
> > > > > 
> > > > > proctype urcu_updater()
> > > > > {
> > > > > 	byte tmp;
> > > > > 
> > > > > 	/* prior synchronize_rcu(), second counter flip. */
> > > > > 	need_mb = 1; /* mb() A */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > > 	need_mb = 1; /* mb() B */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 	do
> > > > > 	:: 1 ->
> > > > > 		if
> > > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 			skip;
> > > > > 		:: else -> break;
> > > > > 		fi
> > > > > 	od;
> > > > > 	need_mb = 1; /* mb() C absolutely required by analogy with G */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 
> > > > > 	/* Removal statement, e.g., list_del_rcu(). */
> > > > > 	removed = 1;
> > > > > 
> > > > > 	/* current synchronize_rcu(), first counter flip. */
> > > > > 	need_mb = 1; /* mb() D suggested */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 	urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > > 	need_mb = 1;  /* mb() E required if D not present */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 
> > > > > 	/* current synchronize_rcu(), first-flip check plus second flip. */
> > > > > 	if
> > > > > 	:: 1 ->
> > > > > 		do
> > > > > 		:: 1 ->
> > > > > 			if
> > > > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 			   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 				skip;
> > > > > 			:: else -> break;
> > > > > 			fi;
> > > > > 		od;
> > > > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > > 	:: 1 ->
> > > > > 		tmp = urcu_gp_ctr;
> > > > > 		urcu_gp_ctr = urcu_gp_ctr + RCU_GP_CTR_BIT;
> > > > > 		do
> > > > > 		:: 1 ->
> > > > > 			if
> > > > > 			:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 			   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 			   (tmp & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 				skip;
> > > > > 			:: else -> break;
> > > > > 			fi;
> > > > > 		od;
> > > > > 	fi;
> > > > > 
> > > > > 	/* current synchronize_rcu(), second counter flip check. */
> > > > > 	need_mb = 1; /* mb() F not required */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 	do
> > > > > 	:: 1 ->
> > > > > 		if
> > > > > 		:: (urcu_active_readers & RCU_GP_CTR_NEST_MASK) != 0 &&
> > > > > 		   (urcu_active_readers & ~RCU_GP_CTR_NEST_MASK) !=
> > > > > 		   (urcu_gp_ctr & ~RCU_GP_CTR_NEST_MASK) ->
> > > > > 			skip;
> > > > > 		:: else -> break;
> > > > > 		fi;
> > > > > 	od;
> > > > > 	need_mb = 1; /* mb() G absolutely required */
> > > > > 	do
> > > > > 	:: need_mb == 1 -> skip;
> > > > > 	:: need_mb == 0 -> break;
> > > > > 	od;
> > > > > 
> > > > > 	/* free-up step, e.g., kfree(). */
> > > > > 	free = 1;
> > > > > }
> > > > > 
> > > > > /*
> > > > >  * Initialize the array, spawn a reader and an updater.  Because readers
> > > > >  * are independent of each other, only one reader is needed.
> > > > >  */
> > > > > 
> > > > > init {
> > > > > 	atomic {
> > > > > 		reader_progress[0] = 0;
> > > > > 		reader_progress[1] = 0;
> > > > > 		reader_progress[2] = 0;
> > > > > 		reader_progress[3] = 0;
> > > > > 		run urcu_reader();
> > > > > 		run urcu_updater();
> > > > > 	}
> > > > > }
> > > > 
> > > > 
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > > 
> > > _______________________________________________
> > > ltt-dev mailing list
> > > ltt-dev@lists.casi.polymtl.ca
> > > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > > 
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 21:59                                                               ` Paul E. McKenney
@ 2009-02-13 13:50                                                                 ` Nick Piggin
  2009-02-13 14:56                                                                   ` Paul E. McKenney
  2009-02-13 16:05                                                                   ` Linus Torvalds
  0 siblings, 2 replies; 116+ messages in thread
From: Nick Piggin @ 2009-02-13 13:50 UTC (permalink / raw)
  To: paulmck
  Cc: Linus Torvalds, Mathieu Desnoyers, ltt-dev, linux-kernel,
	Bryan Wu, uclinux-dist-devel

On Friday 13 February 2009 08:59:59 Paul E. McKenney wrote:
> On Thu, Feb 12, 2009 at 01:15:08PM -0800, Linus Torvalds wrote:
> > On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > > In other words, you are arguing for using ACCESS_ONCE() in the loops,
> > > but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> > > broken?
> >
> > Well, I _also_ argue that if you have a busy loop, you'd better have a
> > cpu_relax() in there somewhere anyway. If you don't, you have a bug.
> >
> > So I think the BF approach is "borderline broken", but I think it should
> > work, if BF just has whatever appropriate cache flush in its cpu_relax.
>
> OK, got it.  Keep ACCESS_ONCE() as is, make sure any busy-wait
> loops contain a cpu_relax().  A given busy loop might or might not
> need ACCESS_ONCE(), but that decision is independent of hardware
> considerations.
>
> Ah, and blackfin's cpu_relax() does seem to have migrated from barrier()
> to smp_mb() recently, so sounds good to me!!!


Interesting. I don't know if you would say it is not cache coherent.
Does anything in cache coherency definition require timeliness? Only
causality I think.

However I think "infinite write buffering delay", or requiring "cache
barriers" is insane to teach any generic code about. BF would be free
to optimise arch functions, but for correctness surely it must also
have a periodic interrupt that will expose stores to other CPUs.


^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 13:50                                                                 ` Nick Piggin
@ 2009-02-13 14:56                                                                   ` Paul E. McKenney
  2009-02-13 15:10                                                                     ` Mathieu Desnoyers
  2009-02-13 16:05                                                                   ` Linus Torvalds
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-13 14:56 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Linus Torvalds, Mathieu Desnoyers, ltt-dev, linux-kernel,
	Bryan Wu, uclinux-dist-devel

On Sat, Feb 14, 2009 at 12:50:43AM +1100, Nick Piggin wrote:
> On Friday 13 February 2009 08:59:59 Paul E. McKenney wrote:
> > On Thu, Feb 12, 2009 at 01:15:08PM -0800, Linus Torvalds wrote:
> > > On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > > > In other words, you are arguing for using ACCESS_ONCE() in the loops,
> > > > but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> > > > broken?
> > >
> > > Well, I _also_ argue that if you have a busy loop, you'd better have a
> > > cpu_relax() in there somewhere anyway. If you don't, you have a bug.
> > >
> > > So I think the BF approach is "borderline broken", but I think it should
> > > work, if BF just has whatever appropriate cache flush in its cpu_relax.
> >
> > OK, got it.  Keep ACCESS_ONCE() as is, make sure any busy-wait
> > loops contain a cpu_relax().  A given busy loop might or might not
> > need ACCESS_ONCE(), but that decision is independent of hardware
> > considerations.
> >
> > Ah, and blackfin's cpu_relax() does seem to have migrated from barrier()
> > to smp_mb() recently, so sounds good to me!!!
> 
> 
> Interesting. I don't know if you would say it is not cache coherent.
> Does anything in cache coherency definition require timeliness? Only
> causality I think.
> 
> However I think "infinite write buffering delay", or requiring "cache
> barriers" is insane to teach any generic code about. BF would be free
> to optimise arch functions, but for correctness surely it must also
> have a periodic interrupt that will expose stores to other CPUs.

I have great sympathy for this point of view!!!  So why not have the
blackfin folks get the appropriate instructions added in the gcc port
to their architecture?  (Yeah, I know, gcc has no way of knowing which
variables are shared and not...)

But perhaps we could decorate the affected variable declarations with
a macro that expands to some sort of gcc attribute in the blackfin case?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 14:56                                                                   ` Paul E. McKenney
@ 2009-02-13 15:10                                                                     ` Mathieu Desnoyers
  2009-02-13 15:55                                                                       ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 15:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Linus Torvalds

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Feb 14, 2009 at 12:50:43AM +1100, Nick Piggin wrote:
> > On Friday 13 February 2009 08:59:59 Paul E. McKenney wrote:
> > > On Thu, Feb 12, 2009 at 01:15:08PM -0800, Linus Torvalds wrote:
> > > > On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > > > > In other words, you are arguing for using ACCESS_ONCE() in the loops,
> > > > > but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> > > > > broken?
> > > >
> > > > Well, I _also_ argue that if you have a busy loop, you'd better have a
> > > > cpu_relax() in there somewhere anyway. If you don't, you have a bug.
> > > >
> > > > So I think the BF approach is "borderline broken", but I think it should
> > > > work, if BF just has whatever appropriate cache flush in its cpu_relax.
> > >
> > > OK, got it.  Keep ACCESS_ONCE() as is, make sure any busy-wait
> > > loops contain a cpu_relax().  A given busy loop might or might not
> > > need ACCESS_ONCE(), but that decision is independent of hardware
> > > considerations.
> > >
> > > Ah, and blackfin's cpu_relax() does seem to have migrated from barrier()
> > > to smp_mb() recently, so sounds good to me!!!
> > 
> > 
> > Interesting. I don't know if you would say it is not cache coherent.
> > Does anything in cache coherency definition require timeliness? Only
> > causality I think.
> > 
> > However I think "infinite write buffering delay", or requiring "cache
> > barriers" is insane to teach any generic code about. BF would be free
> > to optimise arch functions, but for correctness surely it must also
> > have a periodic interrupt that will expose stores to other CPUs.
> 
> I have great sympathy for this point of view!!!  So why not have the
> blackfin folks get the appropriate instructions added in the gcc port
> to their architecture?  (Yeah, I know, gcc has no way of knowing which
> variables are shared and not...)
> 
> But perhaps we could decorate the affected variable declarations with
> a macro that expands to some sort of gcc attribute in the blackfin case?
> 

I think that just for the fact that it help identifying such variable
accesses which are :

- performed atomically
- unprotected by any form of locking

This seems like a good things to wrap such accesses into a macro which
permits easy identification of those sites. A bit like rcu_dereference()
does. Gradual use of this new macro could come incrementally too.

Mathieu


> 							Thanx, Paul
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 15:10                                                                     ` Mathieu Desnoyers
@ 2009-02-13 15:55                                                                       ` Mathieu Desnoyers
  2009-02-13 16:18                                                                         ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 15:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Linus Torvalds

* Mathieu Desnoyers (compudj@krystal.dyndns.org) wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sat, Feb 14, 2009 at 12:50:43AM +1100, Nick Piggin wrote:
> > > On Friday 13 February 2009 08:59:59 Paul E. McKenney wrote:
> > > > On Thu, Feb 12, 2009 at 01:15:08PM -0800, Linus Torvalds wrote:
> > > > > On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > > > > > In other words, you are arguing for using ACCESS_ONCE() in the loops,
> > > > > > but keeping the old ACCESS_ONCE() definition, and declaring BF hardware
> > > > > > broken?
> > > > >
> > > > > Well, I _also_ argue that if you have a busy loop, you'd better have a
> > > > > cpu_relax() in there somewhere anyway. If you don't, you have a bug.
> > > > >
> > > > > So I think the BF approach is "borderline broken", but I think it should
> > > > > work, if BF just has whatever appropriate cache flush in its cpu_relax.
> > > >
> > > > OK, got it.  Keep ACCESS_ONCE() as is, make sure any busy-wait
> > > > loops contain a cpu_relax().  A given busy loop might or might not
> > > > need ACCESS_ONCE(), but that decision is independent of hardware
> > > > considerations.
> > > >
> > > > Ah, and blackfin's cpu_relax() does seem to have migrated from barrier()
> > > > to smp_mb() recently, so sounds good to me!!!
> > > 
> > > 
> > > Interesting. I don't know if you would say it is not cache coherent.
> > > Does anything in cache coherency definition require timeliness? Only
> > > causality I think.
> > > 
> > > However I think "infinite write buffering delay", or requiring "cache
> > > barriers" is insane to teach any generic code about. BF would be free
> > > to optimise arch functions, but for correctness surely it must also
> > > have a periodic interrupt that will expose stores to other CPUs.
> > 
> > I have great sympathy for this point of view!!!  So why not have the
> > blackfin folks get the appropriate instructions added in the gcc port
> > to their architecture?  (Yeah, I know, gcc has no way of knowing which
> > variables are shared and not...)
> > 
> > But perhaps we could decorate the affected variable declarations with
> > a macro that expands to some sort of gcc attribute in the blackfin case?
> > 
> 
> I think that just for the fact that it help identifying such variable
> accesses which are :
> 
> - performed atomically
> - unprotected by any form of locking
> 
> This seems like a good things to wrap such accesses into a macro which
> permits easy identification of those sites. A bit like rcu_dereference()
> does. Gradual use of this new macro could come incrementally too.
> 

I created also

_STORE_SHARED()
_LOAD_SHARED()

which identify the variables which need to have cache flush done before
(load) or after (store). So we get both speed and identification when
needed (if we need to do batch updates linked with a single cache flush).
e.g.


/*
 * Identify a shared load. A smp_rmc() or smp_mc() should come before the load.
 */
#define _LOAD_SHARED(p)	       ACCESS_ONCE(p)

/*
 * Load a data from shared memory, doing a cache flush if required.
 */
#define LOAD_SHARED(p) \
	({ \
		smp_rmc(); \
		_LOAD_SHARED(p); \
	})


/*
 * Identify a shared store. A smp_wmc() or smp_mc() should follow the store.
 */
#define _STORE_SHARED(x, v) \
	do { \
		(x) = (v); \
	} while (0)

/*
 * Store v into x, where x is located in shared memory. Performs the required
 * cache flush after writing.
 */
#define STORE_SHARED(x, v) \
	do { \
		_STORE_SHARED(x, v); \
		smp_wmc(); \
	} while (0)

Mathieu

> Mathieu
> 
> 
> > 							Thanx, Paul
> > 
> > _______________________________________________
> > ltt-dev mailing list
> > ltt-dev@lists.casi.polymtl.ca
> > http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> > 
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 13:50                                                                 ` Nick Piggin
  2009-02-13 14:56                                                                   ` Paul E. McKenney
@ 2009-02-13 16:05                                                                   ` Linus Torvalds
  2009-02-14  3:11                                                                     ` [Uclinux-dist-devel] " Mike Frysinger
  1 sibling, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2009-02-13 16:05 UTC (permalink / raw)
  To: Nick Piggin
  Cc: paulmck, Mathieu Desnoyers, ltt-dev, linux-kernel, Bryan Wu,
	uclinux-dist-devel



On Sat, 14 Feb 2009, Nick Piggin wrote:
> 
> Interesting. I don't know if you would say it is not cache coherent.
> Does anything in cache coherency definition require timeliness? Only
> causality I think.

Nick, afaik, BF _really_ isn't cache coherent.

It's not about timeliness. It's literally non-coherent.

Blackfin L1 caches are
 (a) write-through
 (b) per-cpu
 (c) non-coherent
so the way that BF implements "cache coherency" is by literally

 - use a magic test-and-set instruction that works on L2 memory (shared)
 - keep track of which core has done that test-and-set last
 - *flush* the L1 when it was the other core.

Note that because it's a write-through cache, _writes_ are basically 
"coherent". But since the cache isn't actually _updated_ ont he other CPU, 
you can have two CPU's doing writes, and they'll both continue to see 
their own write, not necessarily the one that made it to memory. So I 
would not call that a "timeliness" issue, I would just say that the caches 
simply aren't coherent.

But because it's write-through, flushing the cache always makes things 
coherent again (well, on _that_ CPU), of course.

			Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 15:55                                                                       ` Mathieu Desnoyers
@ 2009-02-13 16:18                                                                         ` Linus Torvalds
  2009-02-13 17:33                                                                           ` Mathieu Desnoyers
  0 siblings, 1 reply; 116+ messages in thread
From: Linus Torvalds @ 2009-02-13 16:18 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Paul E. McKenney, Nick Piggin, Bryan Wu, linux-kernel, ltt-dev,
	uclinux-dist-devel



On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> 
> I created also
> 
> _STORE_SHARED()
> _LOAD_SHARED()
> 
> which identify the variables which need to have cache flush done before
> (load) or after (store). So we get both speed and identification when
> needed (if we need to do batch updates linked with a single cache flush).
> e.g.

The thing is, THAT JUST ABSOLUTELY SUCKS.

Lookie here - we don't want to flush the cache at every load of a shared 
variable. There's no reason to. If you don't care about the orderign, you 
might as well get the old values. That's what memory ordering _means_, for 
chissake! In the absense of locks, loads may get stale values. It's that 
easy.

A lot of code wants to access multiple variables, and they are potentially 
nearby, and in the same cacheline. Making them all use _LOAD_SHARED() adds 
absolutely no value - and makes it MUCH MUCH SLOWER.

So what's the answer?

I already outlined it: either you use locks (which will do the magic for 
you), or you use memory barriers. In no case do you make the access magic, 
unless you have a compiler issue where you are afraid that the compiler 
would turn it into _multiple_ accesses and potentially get inconsistent 
results.

So the point about ACCESS_ONCE() is not, and never has been, about 
re-ordering. We know that the CPU may re-order the accesses and give us 
stale values (or values from the "future" wrt the other accesses around 
it). That's not the point. The point of ACCESS_ONCE() is that we get 
exactly _one_ value, and not two different ones (or none at all) because 
of the compiler either re-loading it several times or not re-loading it at 
all.

Anybody who confuses ACCESS_ONCE() with ordering is simply confused.

And we don't want to make any "load with cache flush" either. Which side 
should the cache flush be on? Before? After? Both? Atomically? There is no 
sane semantics for that.

The only remaining sane semantics is to depend on memory barriers, and 
then make a magic memory barrier that is extra weak and doesn't order 
anythign at all, but just says "syncronize very weakly".

And I think we have that in "cpu_relax()". Because if you have somebody 
doing shared memory accesses in a loop without any memory barriers or 
locks or anything (ie the _ordering_ doesn't matter, only that some value 
has been seen), then dang it, I can't see how you can _possibly_ use 
anything else than that "cpu_relax()" somewhere in that loop.

			Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 16:18                                                                         ` Linus Torvalds
@ 2009-02-13 17:33                                                                           ` Mathieu Desnoyers
  2009-02-13 17:53                                                                             ` Linus Torvalds
  0 siblings, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 17:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Paul E. McKenney

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > I created also
> > 
> > _STORE_SHARED()
> > _LOAD_SHARED()
> > 
> > which identify the variables which need to have cache flush done before
> > (load) or after (store). So we get both speed and identification when
> > needed (if we need to do batch updates linked with a single cache flush).
> > e.g.
> 
> The thing is, THAT JUST ABSOLUTELY SUCKS.
> 
> Lookie here - we don't want to flush the cache at every load of a shared 
> variable. There's no reason to. If you don't care about the orderign, you 
> might as well get the old values. That's what memory ordering _means_, for 
> chissake! In the absense of locks, loads may get stale values. It's that 
> easy.
> 
> A lot of code wants to access multiple variables, and they are potentially 
> nearby, and in the same cacheline. Making them all use _LOAD_SHARED() adds 
> absolutely no value - and makes it MUCH MUCH SLOWER.
> 

Hrm, I think there is a misunderstanding here, because _LOAD_SHARED() is
not much more than a simple comment.

The whole idea behind _LOAD_SHARED() is that it does not translate in
any different assembly output than a standard load. So no, it cannot be
possibly slower. It has no more side-effect than a simple comment in the
code, and that's its purpose : to identify those variables. So if we
find a code path doing

  _STORE_SHARED(x, v);
  smp_mc();
  while (_LOAD_SHARED(z) != val)
    cpu_relax();

We can verify very easily the code correctness :

A write cache flush is required after _STORE_SHARED
A read cache flush is required before _LOAD_SHARED
Read cache flushes are required to happen eventually between
  _LOAD_SHARED in the loop.

It's basically the same as having something like an eventual :

_STORE_ORDERED(x, v);
smp_mb();
_LOAD_ORDERED(z);

Instead of relying on a comment around smp_mb(); stating which variables
it orders. I would understand if you dislike it, but I find it rather
useful to have this information in the source code around the variable
access rather than formulated as a comment around the barrier. Actually
having both the barrier comment *and* this identification seems rather
good for code review.

> So what's the answer?
> 
> I already outlined it: either you use locks (which will do the magic for 
> you), or you use memory barriers. In no case do you make the access magic, 
> unless you have a compiler issue where you are afraid that the compiler 
> would turn it into _multiple_ accesses and potentially get inconsistent 
> results.
> 
> So the point about ACCESS_ONCE() is not, and never has been, about 
> re-ordering. We know that the CPU may re-order the accesses and give us 
> stale values (or values from the "future" wrt the other accesses around 
> it). That's not the point. The point of ACCESS_ONCE() is that we get 
> exactly _one_ value, and not two different ones (or none at all) because 
> of the compiler either re-loading it several times or not re-loading it at 
> all.
> 
> Anybody who confuses ACCESS_ONCE() with ordering is simply confused.
> 
> And we don't want to make any "load with cache flush" either. Which side 
> should the cache flush be on? Before? After? Both? Atomically? There is no 
> sane semantics for that.
> 

We might want to simply scrap the "safe and slow" version without
underscores (LOAD_SHARED, STORE_SHARED) which contain smp_rmc and
smp_wmc statements within the macro. But Paul insisted that he likes
having the proper memory ordering/cache coherency enforced within the
accessor macros. Personnally, I see much more value in the simple
"comment-only" versions _LOAD_SHARED/_STORE_SHARED matched with an
explicit cache flush statement because in a lot of cases, we will want
to do a batch of read/writes between cache flushes. Note that memory
barriers are already implicit in a lot of kernel primitives, namely
rcu_dereference, cmpxchg, spinlock, ... so this is debatable I guess.

> The only remaining sane semantics is to depend on memory barriers, and 
> then make a magic memory barrier that is extra weak and doesn't order 
> anythign at all, but just says "syncronize very weakly".

I agree completely. What I am proposing here is just to add syntaxic
sugar to better identify the variables related to those extra weak
barriers.

> 
> And I think we have that in "cpu_relax()". Because if you have somebody 
> doing shared memory accesses in a loop without any memory barriers or 
> locks or anything (ie the _ordering_ doesn't matter, only that some value 
> has been seen), then dang it, I can't see how you can _possibly_ use 
> anything else than that "cpu_relax()" somewhere in that loop.
> 

It must also be matched with the equivalent write flush barrier at the
write side, so hiding this deep within cpu_relax() only at the read-side
seems to hide a lot of what must be performed by the cores to exchange
the data properly. (ok we don't care about write cache flush for
Blackfin particularly, but I don't see why we should not start thinking
about what non-coherent caches small embedded devices can bring)

It's also worth noting that Paul and I have no agenda to push anything
into the mainline kernel to enforce anything like "wmc"-type cache flush
barriers. We are barely trying to find the best semantic to express our
userspace RCU algorithm, and I happen to have noticed this loophole
about non-coherent cache architectures. But ideally it would be good to
stay in sync with the Linux kernel's primitives, so this is why your
criticism is much appreciated.

Thanks,

Mathieu

> 			Linus
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 17:33                                                                           ` Mathieu Desnoyers
@ 2009-02-13 17:53                                                                             ` Linus Torvalds
  2009-02-13 18:09                                                                               ` Linus Torvalds
  2009-02-13 18:40                                                                               ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2009-02-13 17:53 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Paul E. McKenney



On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> 
> The whole idea behind _LOAD_SHARED() is that it does not translate in
> any different assembly output than a standard load. So no, it cannot be
> possibly slower. It has no more side-effect than a simple comment in the
> code, and that's its purpose : to identify those variables. So if we
> find a code path doing
> 
>   _STORE_SHARED(x, v);
>   smp_mc();
>   while (_LOAD_SHARED(z) != val)
>     cpu_relax();
> 
> We can verify very easily the code correctness :
> 
> A write cache flush is required after _STORE_SHARED
> A read cache flush is required before _LOAD_SHARED
> Read cache flushes are required to happen eventually between
>   _LOAD_SHARED in the loop.

That makes no sense.

First off, you had the comment that LOAD_SHARED() would flush caches, so 
your argument that it's just a load, nothing else, is in violation with 
your own statements. And I told you why such a thing is INSANE.

As to the underscore-version, what can it do? Nothing. It's perfectly fine 
to have something like this:

	while (_LOAD_SHARED(x) && _LOAD_SHARED(y)) {
		cpu_relax();
	}

and the thing is, there is no reason to do read-cache flushes between 
those two _LOAD_SHARED. So warning about it would be incorrect, and all it 
can do is be purely ugly "documentation" about the fact that it's doing a 
shared load, because it's not really allowed to warn about the fact that 
shared loads should have a cache flush in between, because THEY SHOULD 
NOT.

But it is also _ugly_.

And more importantly - if you see it as a documentation thing, then it's 
broken in the first place - you're documenting the places that you already 
know about, and already know are important, rather than finding places 
that might be buggy. So what does it help us? Nothing.

You might as well just document the cpu_relax(). Which it implicitly does: 
it's a barrier in a tight loop.

In other words, I see no real point. Your [_][LOAD|STORE]_SHARED is ugly 
and doesn't add value, or adds value (the cache flush) in really really 
bad ways that aren't even very well-defined. 

			Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 17:53                                                                             ` Linus Torvalds
@ 2009-02-13 18:09                                                                               ` Linus Torvalds
  2009-02-13 18:54                                                                                 ` Mathieu Desnoyers
  2009-02-14  3:15                                                                                 ` [Uclinux-dist-devel] " Mike Frysinger
  2009-02-13 18:40                                                                               ` Mathieu Desnoyers
  1 sibling, 2 replies; 116+ messages in thread
From: Linus Torvalds @ 2009-02-13 18:09 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Paul E. McKenney



Btw, for user space, if you want to do this all right for something like 
BF. I think the only _correct_ thing to do (in the sense that the end 
result will actually be debuggable) is to essentially give full SMP 
coherency in user space.

It's doable, but rather complicated, and I'm not 100% sure it really ends 
up making sense. The way to do it is to just simply say:

 - never map the same page writably on two different cores, and always 
   flush the cache (on the receiving side) when you switch a page from one 
   core to another.

Now, the kernel can't really do that reasonably, but user space possibly could.

Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so 
by "mapping the same page" in that case we end up really meaning "having a 
shared mapping or thread". I think that _should_ be doable. The most 
trivial approach might be to simply limit all processes with shared 
mappings or CLONE_VM to core 0, and letting core 1 run everything else 
(but you could do it differently: mapping something with MAP_SHARED would 
force you to core 0, but threads would just force the thread group to 
stay on _one_ core, rather than necessarily a fixed one).

Yeah, because of the lack of real memory protection, the kernel can't 
_know_ that processes don't behave badly and access things that they 
didn't explicitly map, but I'm hoping that that is rare.

And yes, if you really want to use threads as a way to do something 
across cores, you'd be screwed - the kenrel would only schedule the 
threads on one CPU. But considering the undefined nature of threading on 
such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have 
the knowledge that user space _looks_ cache-coherent by virtue of the 
kernel just limiting cores appropriately?

And then user space would simply not need to worry as much. Code written 
for another architecture will "just work" on BF SMP too. With the normal 
uclinux limitations, of course.

			Linus

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 17:53                                                                             ` Linus Torvalds
  2009-02-13 18:09                                                                               ` Linus Torvalds
@ 2009-02-13 18:40                                                                               ` Mathieu Desnoyers
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Paul E. McKenney

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> On Fri, 13 Feb 2009, Mathieu Desnoyers wrote:
> > 
> > The whole idea behind _LOAD_SHARED() is that it does not translate in
> > any different assembly output than a standard load. So no, it cannot be
> > possibly slower. It has no more side-effect than a simple comment in the
> > code, and that's its purpose : to identify those variables. So if we
> > find a code path doing
> > 
> >   _STORE_SHARED(x, v);
> >   smp_mc();
> >   while (_LOAD_SHARED(z) != val)
> >     cpu_relax();
> > 
> > We can verify very easily the code correctness :
> > 
> > A write cache flush is required after _STORE_SHARED
> > A read cache flush is required before _LOAD_SHARED
> > Read cache flushes are required to happen eventually between
> >   _LOAD_SHARED in the loop.
> 
> That makes no sense.
> 
> First off, you had the comment that LOAD_SHARED() would flush caches, so 
> your argument that it's just a load, nothing else, is in violation with 
> your own statements. And I told you why such a thing is INSANE.
> 

LOAD_SHARED -> cache flush + load
_LOAD_SHARED -> simple load

There is no contradiction here. And I agree that LOAD_SHARED will
generally produce slow code and that it is inappropriate for multiple
accesses between cache-line flushes.

> As to the underscore-version, what can it do? Nothing. It's perfectly fine 
> to have something like this:
> 
> 	while (_LOAD_SHARED(x) && _LOAD_SHARED(y)) {
> 		cpu_relax();
> 	}
> 
> and the thing is, there is no reason to do read-cache flushes between 
> those two _LOAD_SHARED. So warning about it would be incorrect, and all it 
> can do is be purely ugly "documentation" about the fact that it's doing a 
> shared load, because it's not really allowed to warn about the fact that 
> shared loads should have a cache flush in between, because THEY SHOULD 
> NOT.

Doing two _LOAD_SHARED of different variables without cache flush is
perfectly fine, but we could trigger the warning if some code reads
_the same_ variable twice without any flush in between. The only sane
way I can foresee this being correct is if an interrupt (or signal in
userspace) handler is expected to execute the cache flush for us.

> 
> But it is also _ugly_.
> 

Agreed. I'm under the impression the code is yelling at me when I read
it. :-) Probably that the capital lettering is ill-chosen.

> And more importantly - if you see it as a documentation thing, then it's 
> broken in the first place - you're documenting the places that you already 
> know about, and already know are important, rather than finding places 
> that might be buggy. So what does it help us? Nothing.

Given we have to to match the LOADs and the STOREs within the source
would possibly lead to interesting findings, as learning that such STORE
is really expected to be propagated to the other CPU just because we
notice its associated LOAD. So I think having such in-place
documentation might actually help finding bugs.

> 
> You might as well just document the cpu_relax(). Which it implicitly does: 
> it's a barrier in a tight loop.
> 
> In other words, I see no real point. Your [_][LOAD|STORE]_SHARED is ugly 
> and doesn't add value, or adds value (the cache flush) in really really 
> bad ways that aren't even very well-defined. 
> 

I guess we will have to wait until someone really want to port Linux to
a SMP architecture with non-coherent caches before we can measure the
value of such macro-ish documentation. Now is probably way too soon.

Thanks,

Mathieu

> 			Linus
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 18:09                                                                               ` Linus Torvalds
@ 2009-02-13 18:54                                                                                 ` Mathieu Desnoyers
  2009-02-13 19:36                                                                                   ` Paul E. McKenney
  2009-02-14  3:15                                                                                 ` [Uclinux-dist-devel] " Mike Frysinger
  1 sibling, 1 reply; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-13 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel,
	Paul E. McKenney

* Linus Torvalds (torvalds@linux-foundation.org) wrote:
> 
> 
> Btw, for user space, if you want to do this all right for something like 
> BF. I think the only _correct_ thing to do (in the sense that the end 
> result will actually be debuggable) is to essentially give full SMP 
> coherency in user space.
> 
> It's doable, but rather complicated, and I'm not 100% sure it really ends 
> up making sense. The way to do it is to just simply say:
> 
>  - never map the same page writably on two different cores, and always 
>    flush the cache (on the receiving side) when you switch a page from one 
>    core to another.
> 
> Now, the kernel can't really do that reasonably, but user space possibly could.
> 
> Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so 
> by "mapping the same page" in that case we end up really meaning "having a 
> shared mapping or thread". I think that _should_ be doable. The most 
> trivial approach might be to simply limit all processes with shared 
> mappings or CLONE_VM to core 0, and letting core 1 run everything else 
> (but you could do it differently: mapping something with MAP_SHARED would 
> force you to core 0, but threads would just force the thread group to 
> stay on _one_ core, rather than necessarily a fixed one).
> 
> Yeah, because of the lack of real memory protection, the kernel can't 
> _know_ that processes don't behave badly and access things that they 
> didn't explicitly map, but I'm hoping that that is rare.
> 
> And yes, if you really want to use threads as a way to do something 
> across cores, you'd be screwed - the kenrel would only schedule the 
> threads on one CPU. But considering the undefined nature of threading on 
> such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have 
> the knowledge that user space _looks_ cache-coherent by virtue of the 
> kernel just limiting cores appropriately?
> 
> And then user space would simply not need to worry as much. Code written 
> for another architecture will "just work" on BF SMP too. With the normal 
> uclinux limitations, of course.
> 
> 			Linus
> 

I don't know enough about BF to tell for sure, but the other way around
I see that would still permit running threads with shared memory space
on different CPUs is to call a cache flush each time a userspace lock is
taken/released (at the synchronization points where the "magic
test-and-set instruction" is used) _from_ userspace.

If some more elaborate userspace MT code uses something else than those
basic locks provided by core libraries to synchronize data exchange,
then it would be on its own and have to ensure cache flushing itself.

And yes, that would be incredibly costly/slow. This is why RCU-style
reader-sides are good : they have much more relaxed synchronization
constraints.

I am just thinking that the single-process to a single core solution you
propose above will be somewhat limiting if we end up with a 64-cores
non-cache-coherent architecture. They tend to be especially used for
stuff like video decoding, which is very easy to parallelize when shared
memory is available. But I guess we are not there yet.

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 18:54                                                                                 ` Mathieu Desnoyers
@ 2009-02-13 19:36                                                                                   ` Paul E. McKenney
  2009-02-14  5:07                                                                                     ` Mike Frysinger
  0 siblings, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-13 19:36 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Linus Torvalds, Nick Piggin, Bryan Wu, linux-kernel, ltt-dev,
	uclinux-dist-devel

On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > 
> > 
> > Btw, for user space, if you want to do this all right for something like 
> > BF. I think the only _correct_ thing to do (in the sense that the end 
> > result will actually be debuggable) is to essentially give full SMP 
> > coherency in user space.
> > 
> > It's doable, but rather complicated, and I'm not 100% sure it really ends 
> > up making sense. The way to do it is to just simply say:
> > 
> >  - never map the same page writably on two different cores, and always 
> >    flush the cache (on the receiving side) when you switch a page from one 
> >    core to another.
> > 
> > Now, the kernel can't really do that reasonably, but user space possibly could.
> > 
> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so 
> > by "mapping the same page" in that case we end up really meaning "having a 
> > shared mapping or thread". I think that _should_ be doable. The most 
> > trivial approach might be to simply limit all processes with shared 
> > mappings or CLONE_VM to core 0, and letting core 1 run everything else 
> > (but you could do it differently: mapping something with MAP_SHARED would 
> > force you to core 0, but threads would just force the thread group to 
> > stay on _one_ core, rather than necessarily a fixed one).
> > 
> > Yeah, because of the lack of real memory protection, the kernel can't 
> > _know_ that processes don't behave badly and access things that they 
> > didn't explicitly map, but I'm hoping that that is rare.
> > 
> > And yes, if you really want to use threads as a way to do something 
> > across cores, you'd be screwed - the kenrel would only schedule the 
> > threads on one CPU. But considering the undefined nature of threading on 
> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have 
> > the knowledge that user space _looks_ cache-coherent by virtue of the 
> > kernel just limiting cores appropriately?
> > 
> > And then user space would simply not need to worry as much. Code written 
> > for another architecture will "just work" on BF SMP too. With the normal 
> > uclinux limitations, of course.
> > 
> > 			Linus
> > 
> 
> I don't know enough about BF to tell for sure, but the other way around
> I see that would still permit running threads with shared memory space
> on different CPUs is to call a cache flush each time a userspace lock is
> taken/released (at the synchronization points where the "magic
> test-and-set instruction" is used) _from_ userspace.
> 
> If some more elaborate userspace MT code uses something else than those
> basic locks provided by core libraries to synchronize data exchange,
> then it would be on its own and have to ensure cache flushing itself.

How about just doing a sched_setaffinity() in the BF case?  Sounds
like an easy way to implement Linus's suggestion of restricting the
multithreaded processes to a single core.  I have a hard time losing
sleep over the lack of parallelism in the case where the SMP support is
at best rudimentary...

> And yes, that would be incredibly costly/slow. This is why RCU-style
> reader-sides are good : they have much more relaxed synchronization
> constraints.
> 
> I am just thinking that the single-process to a single core solution you
> propose above will be somewhat limiting if we end up with a 64-cores
> non-cache-coherent architecture. They tend to be especially used for
> stuff like video decoding, which is very easy to parallelize when shared
> memory is available. But I guess we are not there yet.

If someone invests the silicon for 64 cores, but doesn't provide some
semblance of cache coherence, I have to question their sanity.  As a
kludgey quick fix to get to a dual-proc solution I can understand it,
but there is a limit!  ;-)

						Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Uclinux-dist-devel] [ltt-dev] [RFC git tree] Userspace RCU  (urcu) for Linux (repost)
  2009-02-13 16:05                                                                   ` Linus Torvalds
@ 2009-02-14  3:11                                                                     ` Mike Frysinger
  0 siblings, 0 replies; 116+ messages in thread
From: Mike Frysinger @ 2009-02-14  3:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Nick Piggin, linux-kernel, ltt-dev, Mathieu Desnoyers,
	uclinux-dist-devel, paulmck

On Fri, Feb 13, 2009 at 11:05, Linus Torvalds wrote:
> On Sat, 14 Feb 2009, Nick Piggin wrote:
>> Interesting. I don't know if you would say it is not cache coherent.
>> Does anything in cache coherency definition require timeliness? Only
>> causality I think.
>
> Nick, afaik, BF _really_ isn't cache coherent.
>
> It's not about timeliness. It's literally non-coherent.
>
> Blackfin L1 caches are
>  (a) write-through
>  (b) per-cpu
>  (c) non-coherent
> so the way that BF implements "cache coherency" is by literally
>
>  - use a magic test-and-set instruction that works on L2 memory (shared)
>  - keep track of which core has done that test-and-set last
>  - *flush* the L1 when it was the other core.
>
> Note that because it's a write-through cache, _writes_ are basically
> "coherent". But since the cache isn't actually _updated_ ont he other CPU,
> you can have two CPU's doing writes, and they'll both continue to see
> their own write, not necessarily the one that made it to memory. So I
> would not call that a "timeliness" issue, I would just say that the caches
> simply aren't coherent.
>
> But because it's write-through, flushing the cache always makes things
> coherent again (well, on _that_ CPU), of course.

it invalidates, not flushes, the cache when the lock changes hands.
and since the caches are forced to write through mode, the new core
should pick up all the correct data fresh from external memory.
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Uclinux-dist-devel] [ltt-dev] [RFC git tree] Userspace RCU  (urcu) for Linux (repost)
  2009-02-13 18:09                                                                               ` Linus Torvalds
  2009-02-13 18:54                                                                                 ` Mathieu Desnoyers
@ 2009-02-14  3:15                                                                                 ` Mike Frysinger
  1 sibling, 0 replies; 116+ messages in thread
From: Mike Frysinger @ 2009-02-14  3:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Mathieu Desnoyers, Nick Piggin, linux-kernel, ltt-dev,
	uclinux-dist-devel, Paul E. McKenney

On Fri, Feb 13, 2009 at 13:09, Linus Torvalds wrote:
> Btw, for user space, if you want to do this all right for something like
> BF. I think the only _correct_ thing to do (in the sense that the end
> result will actually be debuggable) is to essentially give full SMP
> coherency in user space.
>
> It's doable, but rather complicated, and I'm not 100% sure it really ends
> up making sense. The way to do it is to just simply say:
>
>  - never map the same page writably on two different cores, and always
>   flush the cache (on the receiving side) when you switch a page from one
>   core to another.
>
> Now, the kernel can't really do that reasonably, but user space possibly could.
>
> Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
> by "mapping the same page" in that case we end up really meaning "having a
> shared mapping or thread". I think that _should_ be doable. The most
> trivial approach might be to simply limit all processes with shared
> mappings or CLONE_VM to core 0, and letting core 1 run everything else
> (but you could do it differently: mapping something with MAP_SHARED would
> force you to core 0, but threads would just force the thread group to
> stay on _one_ core, rather than necessarily a fixed one).
>
> Yeah, because of the lack of real memory protection, the kernel can't
> _know_ that processes don't behave badly and access things that they
> didn't explicitly map, but I'm hoping that that is rare.
>
> And yes, if you really want to use threads as a way to do something
> across cores, you'd be screwed - the kenrel would only schedule the
> threads on one CPU. But considering the undefined nature of threading on
> such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
> the knowledge that user space _looks_ cache-coherent by virtue of the
> kernel just limiting cores appropriately?

the BF pseudo SMP does not allow threaded processes to run on multiple
cores simultaneously because of this desync crap.

well, the BF hardware does have optional memory protection, but the
overhead is often too great for most people.  it performs well enough
to do debugging, but not for people trying to multimedia and other fun
stuff.
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [Uclinux-dist-devel] [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-12 20:13                                                         ` Linus Torvalds
  2009-02-12 20:39                                                           ` Paul E. McKenney
@ 2009-02-14  4:58                                                           ` Robin Getz
  1 sibling, 0 replies; 116+ messages in thread
From: Robin Getz @ 2009-02-14  4:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: uclinux-dist-devel, Paul E. McKenney, Mathieu Desnoyers, ltt-dev,
	linux-kernel

On Thu 12 Feb 2009 15:13, Linus Torvalds pondered:
> On Thu, 12 Feb 2009, Paul E. McKenney wrote:
> > And, now that you mention it, I have heard rumors that other CPU 
> > families can violate cache coherence in some circumstances.
> 
> I personally suspect that the BF pseudo-SMP code is just broken, and
> that it likely has tons of subtle bugs and races - because we _do_ depend
> on cache coherency at least for accessing objects next to each other.

I felt similarly, however after using it, testing it, beating the crap out of 
it for while, and only finding a few niggly bugs which were corrected - it 
appears to be as stable as any other Linux kernel I have used on embedded 
hardware (which means rock solid).

There are a few people shipping it in their products today - so at least the 
stablily was also good enough for them.

If you have any other test suggestions which might expose the corner cases 
that normal use / LTP /  would not - I'm happy to try it out. The problem 
is - as Mike stated - since we do limit applications to one core at a time - 
multi-threading userspace isn't really interesting a problem.

As for the claim of "broken" hardware - I don't think that is true -- the 
hardware works as designed, as advertised. It was just never architected to 
run a SMP operating system. While you could claim that we are trying to force 
fit (SMP) something where it doesn't belong (non-cache coherence system), it 
is us that are "broken" - not the hardware :)

-Robin

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-13 19:36                                                                                   ` Paul E. McKenney
@ 2009-02-14  5:07                                                                                     ` Mike Frysinger
  2009-02-14  5:20                                                                                       ` Paul E. McKenney
  0 siblings, 1 reply; 116+ messages in thread
From: Mike Frysinger @ 2009-02-14  5:07 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Linus Torvalds, Nick Piggin, Bryan Wu,
	linux-kernel, ltt-dev, uclinux-dist-devel

On Fri, Feb 13, 2009 at 14:36, Paul E. McKenney wrote:
> On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
>> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
>> > Btw, for user space, if you want to do this all right for something like
>> > BF. I think the only _correct_ thing to do (in the sense that the end
>> > result will actually be debuggable) is to essentially give full SMP
>> > coherency in user space.
>> >
>> > It's doable, but rather complicated, and I'm not 100% sure it really ends
>> > up making sense. The way to do it is to just simply say:
>> >
>> >  - never map the same page writably on two different cores, and always
>> >    flush the cache (on the receiving side) when you switch a page from one
>> >    core to another.
>> >
>> > Now, the kernel can't really do that reasonably, but user space possibly could.
>> >
>> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
>> > by "mapping the same page" in that case we end up really meaning "having a
>> > shared mapping or thread". I think that _should_ be doable. The most
>> > trivial approach might be to simply limit all processes with shared
>> > mappings or CLONE_VM to core 0, and letting core 1 run everything else
>> > (but you could do it differently: mapping something with MAP_SHARED would
>> > force you to core 0, but threads would just force the thread group to
>> > stay on _one_ core, rather than necessarily a fixed one).
>> >
>> > Yeah, because of the lack of real memory protection, the kernel can't
>> > _know_ that processes don't behave badly and access things that they
>> > didn't explicitly map, but I'm hoping that that is rare.
>> >
>> > And yes, if you really want to use threads as a way to do something
>> > across cores, you'd be screwed - the kenrel would only schedule the
>> > threads on one CPU. But considering the undefined nature of threading on
>> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
>> > the knowledge that user space _looks_ cache-coherent by virtue of the
>> > kernel just limiting cores appropriately?
>> >
>> > And then user space would simply not need to worry as much. Code written
>> > for another architecture will "just work" on BF SMP too. With the normal
>> > uclinux limitations, of course.
>>
>> I don't know enough about BF to tell for sure, but the other way around
>> I see that would still permit running threads with shared memory space
>> on different CPUs is to call a cache flush each time a userspace lock is
>> taken/released (at the synchronization points where the "magic
>> test-and-set instruction" is used) _from_ userspace.
>>
>> If some more elaborate userspace MT code uses something else than those
>> basic locks provided by core libraries to synchronize data exchange,
>> then it would be on its own and have to ensure cache flushing itself.
>
> How about just doing a sched_setaffinity() in the BF case?  Sounds
> like an easy way to implement Linus's suggestion of restricting the
> multithreaded processes to a single core.  I have a hard time losing
> sleep over the lack of parallelism in the case where the SMP support is
> at best rudimentary...

the quick way is to tell people to run their program through `taskset`
(which is what we're doing now).

the next step up (or down depending on how you look at it) would be to
hook the clone function to do this automatically.  i havent gotten
around to testing this yet which is why there isnt anything in there
yet though.

asmlinkage int bfin_clone(struct pt_regs....
       unsigned long clone_flags;
       unsigned long newsp;

+#ifdef CONFIG_SMP
+       if (current->rt.nr_cpus_allowed == NR_CPUS) {
+               current->cpus_allowed = cpumask_of_cpu(smp_processor_id());
+               current->rt.nr_cpus_allowed = 1;
+       }
+#endif
+
       /* syscall2 puts clone_flags in r0 and usp in r1 */
       clone_flags = regs->r0;
       newsp = regs->r1;
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14  5:07                                                                                     ` Mike Frysinger
@ 2009-02-14  5:20                                                                                       ` Paul E. McKenney
  2009-02-14  5:46                                                                                         ` Mike Frysinger
  2009-02-14  6:42                                                                                         ` Mathieu Desnoyers
  0 siblings, 2 replies; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-14  5:20 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: Mathieu Desnoyers, Linus Torvalds, Nick Piggin, Bryan Wu,
	linux-kernel, ltt-dev, uclinux-dist-devel

On Sat, Feb 14, 2009 at 12:07:46AM -0500, Mike Frysinger wrote:
> On Fri, Feb 13, 2009 at 14:36, Paul E. McKenney wrote:
> > On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
> >> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> >> > Btw, for user space, if you want to do this all right for something like
> >> > BF. I think the only _correct_ thing to do (in the sense that the end
> >> > result will actually be debuggable) is to essentially give full SMP
> >> > coherency in user space.
> >> >
> >> > It's doable, but rather complicated, and I'm not 100% sure it really ends
> >> > up making sense. The way to do it is to just simply say:
> >> >
> >> >  - never map the same page writably on two different cores, and always
> >> >    flush the cache (on the receiving side) when you switch a page from one
> >> >    core to another.
> >> >
> >> > Now, the kernel can't really do that reasonably, but user space possibly could.
> >> >
> >> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
> >> > by "mapping the same page" in that case we end up really meaning "having a
> >> > shared mapping or thread". I think that _should_ be doable. The most
> >> > trivial approach might be to simply limit all processes with shared
> >> > mappings or CLONE_VM to core 0, and letting core 1 run everything else
> >> > (but you could do it differently: mapping something with MAP_SHARED would
> >> > force you to core 0, but threads would just force the thread group to
> >> > stay on _one_ core, rather than necessarily a fixed one).
> >> >
> >> > Yeah, because of the lack of real memory protection, the kernel can't
> >> > _know_ that processes don't behave badly and access things that they
> >> > didn't explicitly map, but I'm hoping that that is rare.
> >> >
> >> > And yes, if you really want to use threads as a way to do something
> >> > across cores, you'd be screwed - the kenrel would only schedule the
> >> > threads on one CPU. But considering the undefined nature of threading on
> >> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
> >> > the knowledge that user space _looks_ cache-coherent by virtue of the
> >> > kernel just limiting cores appropriately?
> >> >
> >> > And then user space would simply not need to worry as much. Code written
> >> > for another architecture will "just work" on BF SMP too. With the normal
> >> > uclinux limitations, of course.
> >>
> >> I don't know enough about BF to tell for sure, but the other way around
> >> I see that would still permit running threads with shared memory space
> >> on different CPUs is to call a cache flush each time a userspace lock is
> >> taken/released (at the synchronization points where the "magic
> >> test-and-set instruction" is used) _from_ userspace.
> >>
> >> If some more elaborate userspace MT code uses something else than those
> >> basic locks provided by core libraries to synchronize data exchange,
> >> then it would be on its own and have to ensure cache flushing itself.
> >
> > How about just doing a sched_setaffinity() in the BF case?  Sounds
> > like an easy way to implement Linus's suggestion of restricting the
> > multithreaded processes to a single core.  I have a hard time losing
> > sleep over the lack of parallelism in the case where the SMP support is
> > at best rudimentary...
> 
> the quick way is to tell people to run their program through `taskset`
> (which is what we're doing now).

Not sure what environment Mathieu is looking to run his program from,
but he would need to run it on multiple architectures.

> the next step up (or down depending on how you look at it) would be to
> hook the clone function to do this automatically.  i havent gotten
> around to testing this yet which is why there isnt anything in there
> yet though.
> 
> asmlinkage int bfin_clone(struct pt_regs....
>        unsigned long clone_flags;
>        unsigned long newsp;
> 
> +#ifdef CONFIG_SMP
> +       if (current->rt.nr_cpus_allowed == NR_CPUS) {
> +               current->cpus_allowed = cpumask_of_cpu(smp_processor_id());
> +               current->rt.nr_cpus_allowed = 1;
> +       }
> +#endif
> +
>        /* syscall2 puts clone_flags in r0 and usp in r1 */
>        clone_flags = regs->r0;
>        newsp = regs->r1;

Wouldn't you also have to make sched_setaffinity() cut back to only one
CPU if more are specified?  If Blackfin handles hotplug CPU, that may
need attention as well, since tasks affinitied to the CPU being removed
can end up with their affinity set to all CPUs.  And there are probably
other issues.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14  5:20                                                                                       ` Paul E. McKenney
@ 2009-02-14  5:46                                                                                         ` Mike Frysinger
  2009-02-14 15:06                                                                                           ` Paul E. McKenney
  2009-02-22 14:23                                                                                           ` Pavel Machek
  2009-02-14  6:42                                                                                         ` Mathieu Desnoyers
  1 sibling, 2 replies; 116+ messages in thread
From: Mike Frysinger @ 2009-02-14  5:46 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Linus Torvalds, Nick Piggin, Bryan Wu,
	linux-kernel, ltt-dev, uclinux-dist-devel

On Sat, Feb 14, 2009 at 00:20, Paul E. McKenney wrote:
> On Sat, Feb 14, 2009 at 12:07:46AM -0500, Mike Frysinger wrote:
>> On Fri, Feb 13, 2009 at 14:36, Paul E. McKenney wrote:
>> > On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
>> >> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
>> >> > Btw, for user space, if you want to do this all right for something like
>> >> > BF. I think the only _correct_ thing to do (in the sense that the end
>> >> > result will actually be debuggable) is to essentially give full SMP
>> >> > coherency in user space.
>> >> >
>> >> > It's doable, but rather complicated, and I'm not 100% sure it really ends
>> >> > up making sense. The way to do it is to just simply say:
>> >> >
>> >> >  - never map the same page writably on two different cores, and always
>> >> >    flush the cache (on the receiving side) when you switch a page from one
>> >> >    core to another.
>> >> >
>> >> > Now, the kernel can't really do that reasonably, but user space possibly could.
>> >> >
>> >> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
>> >> > by "mapping the same page" in that case we end up really meaning "having a
>> >> > shared mapping or thread". I think that _should_ be doable. The most
>> >> > trivial approach might be to simply limit all processes with shared
>> >> > mappings or CLONE_VM to core 0, and letting core 1 run everything else
>> >> > (but you could do it differently: mapping something with MAP_SHARED would
>> >> > force you to core 0, but threads would just force the thread group to
>> >> > stay on _one_ core, rather than necessarily a fixed one).
>> >> >
>> >> > Yeah, because of the lack of real memory protection, the kernel can't
>> >> > _know_ that processes don't behave badly and access things that they
>> >> > didn't explicitly map, but I'm hoping that that is rare.
>> >> >
>> >> > And yes, if you really want to use threads as a way to do something
>> >> > across cores, you'd be screwed - the kenrel would only schedule the
>> >> > threads on one CPU. But considering the undefined nature of threading on
>> >> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
>> >> > the knowledge that user space _looks_ cache-coherent by virtue of the
>> >> > kernel just limiting cores appropriately?
>> >> >
>> >> > And then user space would simply not need to worry as much. Code written
>> >> > for another architecture will "just work" on BF SMP too. With the normal
>> >> > uclinux limitations, of course.
>> >>
>> >> I don't know enough about BF to tell for sure, but the other way around
>> >> I see that would still permit running threads with shared memory space
>> >> on different CPUs is to call a cache flush each time a userspace lock is
>> >> taken/released (at the synchronization points where the "magic
>> >> test-and-set instruction" is used) _from_ userspace.
>> >>
>> >> If some more elaborate userspace MT code uses something else than those
>> >> basic locks provided by core libraries to synchronize data exchange,
>> >> then it would be on its own and have to ensure cache flushing itself.
>> >
>> > How about just doing a sched_setaffinity() in the BF case?  Sounds
>> > like an easy way to implement Linus's suggestion of restricting the
>> > multithreaded processes to a single core.  I have a hard time losing
>> > sleep over the lack of parallelism in the case where the SMP support is
>> > at best rudimentary...
>>
>> the quick way is to tell people to run their program through `taskset`
>> (which is what we're doing now).
>
> Not sure what environment Mathieu is looking to run his program from,
> but he would need to run it on multiple architectures.

right, that is exactly the kind of thing we strive to avoid on our
(the Blackfin) side of things

>> the next step up (or down depending on how you look at it) would be to
>> hook the clone function to do this automatically.  i havent gotten
>> around to testing this yet which is why there isnt anything in there
>> yet though.
>>
>> asmlinkage int bfin_clone(struct pt_regs....
>>        unsigned long clone_flags;
>>        unsigned long newsp;
>>
>> +#ifdef CONFIG_SMP
>> +       if (current->rt.nr_cpus_allowed == NR_CPUS) {
>> +               current->cpus_allowed = cpumask_of_cpu(smp_processor_id());
>> +               current->rt.nr_cpus_allowed = 1;
>> +       }
>> +#endif
>> +
>>        /* syscall2 puts clone_flags in r0 and usp in r1 */
>>        clone_flags = regs->r0;
>>        newsp = regs->r1;
>
> Wouldn't you also have to make sched_setaffinity() cut back to only one
> CPU if more are specified?

mmm, yes and no.  if we wanted to keep the transparency thing going,
then adding a check to the affinity functions to make sure threaded
apps dont span cpus would be needed.  but i would think we'd want to
have it return an error rather (EINVAL prob) than attempting to make
any automatic selections.  the only real blocker here would be
figuring out how to detect the application in question is threaded
with 100% accuracy.  the Blackfin port does not yet have TLS support
which means we're using linuxthreads rather than NPTL ...

hooking clone gives us the biggest bang for the buck: majority of
stuff today are threaded applications that dont look at affinity.

> If Blackfin handles hotplug CPU, that may
> need attention as well, since tasks affinitied to the CPU being removed
> can end up with their affinity set to all CPUs.  And there are probably
> other issues.

no, we dont support hotplugging of CPUs.  there is no hardware support
for it, so i think the only thing you'd gain is perhaps power savings
?  not sure it would even work in our case though as the hardware does
not support restarting or shutdown of one core ... they both have to
restart/shutdown.  putting one core into a constant idle loop would
save power, but that can already be accomplished by reducing the apps
that go onto a specific core.
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14  5:20                                                                                       ` Paul E. McKenney
  2009-02-14  5:46                                                                                         ` Mike Frysinger
@ 2009-02-14  6:42                                                                                         ` Mathieu Desnoyers
  1 sibling, 0 replies; 116+ messages in thread
From: Mathieu Desnoyers @ 2009-02-14  6:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Mike Frysinger, Nick Piggin, Bryan Wu, linux-kernel, ltt-dev,
	uclinux-dist-devel, Linus Torvalds

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sat, Feb 14, 2009 at 12:07:46AM -0500, Mike Frysinger wrote:
> > On Fri, Feb 13, 2009 at 14:36, Paul E. McKenney wrote:
> > > On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
> > >> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> > >> > Btw, for user space, if you want to do this all right for something like
> > >> > BF. I think the only _correct_ thing to do (in the sense that the end
> > >> > result will actually be debuggable) is to essentially give full SMP
> > >> > coherency in user space.
> > >> >
> > >> > It's doable, but rather complicated, and I'm not 100% sure it really ends
> > >> > up making sense. The way to do it is to just simply say:
> > >> >
> > >> >  - never map the same page writably on two different cores, and always
> > >> >    flush the cache (on the receiving side) when you switch a page from one
> > >> >    core to another.
> > >> >
> > >> > Now, the kernel can't really do that reasonably, but user space possibly could.
> > >> >
> > >> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
> > >> > by "mapping the same page" in that case we end up really meaning "having a
> > >> > shared mapping or thread". I think that _should_ be doable. The most
> > >> > trivial approach might be to simply limit all processes with shared
> > >> > mappings or CLONE_VM to core 0, and letting core 1 run everything else
> > >> > (but you could do it differently: mapping something with MAP_SHARED would
> > >> > force you to core 0, but threads would just force the thread group to
> > >> > stay on _one_ core, rather than necessarily a fixed one).
> > >> >
> > >> > Yeah, because of the lack of real memory protection, the kernel can't
> > >> > _know_ that processes don't behave badly and access things that they
> > >> > didn't explicitly map, but I'm hoping that that is rare.
> > >> >
> > >> > And yes, if you really want to use threads as a way to do something
> > >> > across cores, you'd be screwed - the kenrel would only schedule the
> > >> > threads on one CPU. But considering the undefined nature of threading on
> > >> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
> > >> > the knowledge that user space _looks_ cache-coherent by virtue of the
> > >> > kernel just limiting cores appropriately?
> > >> >
> > >> > And then user space would simply not need to worry as much. Code written
> > >> > for another architecture will "just work" on BF SMP too. With the normal
> > >> > uclinux limitations, of course.
> > >>
> > >> I don't know enough about BF to tell for sure, but the other way around
> > >> I see that would still permit running threads with shared memory space
> > >> on different CPUs is to call a cache flush each time a userspace lock is
> > >> taken/released (at the synchronization points where the "magic
> > >> test-and-set instruction" is used) _from_ userspace.
> > >>
> > >> If some more elaborate userspace MT code uses something else than those
> > >> basic locks provided by core libraries to synchronize data exchange,
> > >> then it would be on its own and have to ensure cache flushing itself.
> > >
> > > How about just doing a sched_setaffinity() in the BF case?  Sounds
> > > like an easy way to implement Linus's suggestion of restricting the
> > > multithreaded processes to a single core.  I have a hard time losing
> > > sleep over the lack of parallelism in the case where the SMP support is
> > > at best rudimentary...
> > 
> > the quick way is to tell people to run their program through `taskset`
> > (which is what we're doing now).
> 
> Not sure what environment Mathieu is looking to run his program from,
> but he would need to run it on multiple architectures.
> 

Given I plan to use this userspace rcu mechanism to ensure coherency of
the LTTng userspace tracing control data structures, and given I plan to
deploy it on a large set of architectures (ideally all architectures
supported by Linux), I need to understand the limitations linked to the
design choice we make. If we make the assumption that the caches are
coherent, that's fine, but we have to document it, because otherwise
people might think we would have taken that into account when it is not
the case. Knowing the limitation of the cache coherency model and memory
ordering will help us designing what I hope to be a rock-solid userspace
RCU library.

And if we document clearly the points of data exchange within our
implementation, it could possibly become an efficient way of supporting
SMP on such architectures given RCU need very few synchronization or to
better identify, for instances, remote vs local NUMA accesses. It leaves
room for exploration.

Mathieu


> > the next step up (or down depending on how you look at it) would be to
> > hook the clone function to do this automatically.  i havent gotten
> > around to testing this yet which is why there isnt anything in there
> > yet though.
> > 
> > asmlinkage int bfin_clone(struct pt_regs....
> >        unsigned long clone_flags;
> >        unsigned long newsp;
> > 
> > +#ifdef CONFIG_SMP
> > +       if (current->rt.nr_cpus_allowed == NR_CPUS) {
> > +               current->cpus_allowed = cpumask_of_cpu(smp_processor_id());
> > +               current->rt.nr_cpus_allowed = 1;
> > +       }
> > +#endif
> > +
> >        /* syscall2 puts clone_flags in r0 and usp in r1 */
> >        clone_flags = regs->r0;
> >        newsp = regs->r1;
> 
> Wouldn't you also have to make sched_setaffinity() cut back to only one
> CPU if more are specified?  If Blackfin handles hotplug CPU, that may
> need attention as well, since tasks affinitied to the CPU being removed
> can end up with their affinity set to all CPUs.  And there are probably
> other issues.
> 
> 							Thanx, Paul
> 
> _______________________________________________
> ltt-dev mailing list
> ltt-dev@lists.casi.polymtl.ca
> http://lists.casi.polymtl.ca/cgi-bin/mailman/listinfo/ltt-dev
> 

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14  5:46                                                                                         ` Mike Frysinger
@ 2009-02-14 15:06                                                                                           ` Paul E. McKenney
  2009-02-14 17:37                                                                                             ` Mike Frysinger
  2009-02-22 14:23                                                                                           ` Pavel Machek
  1 sibling, 1 reply; 116+ messages in thread
From: Paul E. McKenney @ 2009-02-14 15:06 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: Mathieu Desnoyers, Linus Torvalds, Nick Piggin, Bryan Wu,
	linux-kernel, ltt-dev, uclinux-dist-devel

On Sat, Feb 14, 2009 at 12:46:02AM -0500, Mike Frysinger wrote:
> On Sat, Feb 14, 2009 at 00:20, Paul E. McKenney wrote:
> > On Sat, Feb 14, 2009 at 12:07:46AM -0500, Mike Frysinger wrote:
> >> On Fri, Feb 13, 2009 at 14:36, Paul E. McKenney wrote:
> >> > On Fri, Feb 13, 2009 at 01:54:11PM -0500, Mathieu Desnoyers wrote:
> >> >> * Linus Torvalds (torvalds@linux-foundation.org) wrote:
> >> >> > Btw, for user space, if you want to do this all right for something like
> >> >> > BF. I think the only _correct_ thing to do (in the sense that the end
> >> >> > result will actually be debuggable) is to essentially give full SMP
> >> >> > coherency in user space.
> >> >> >
> >> >> > It's doable, but rather complicated, and I'm not 100% sure it really ends
> >> >> > up making sense. The way to do it is to just simply say:
> >> >> >
> >> >> >  - never map the same page writably on two different cores, and always
> >> >> >    flush the cache (on the receiving side) when you switch a page from one
> >> >> >    core to another.
> >> >> >
> >> >> > Now, the kernel can't really do that reasonably, but user space possibly could.
> >> >> >
> >> >> > Now, I realize that blackfin doesn't actually even have a MMU or a TLB, so
> >> >> > by "mapping the same page" in that case we end up really meaning "having a
> >> >> > shared mapping or thread". I think that _should_ be doable. The most
> >> >> > trivial approach might be to simply limit all processes with shared
> >> >> > mappings or CLONE_VM to core 0, and letting core 1 run everything else
> >> >> > (but you could do it differently: mapping something with MAP_SHARED would
> >> >> > force you to core 0, but threads would just force the thread group to
> >> >> > stay on _one_ core, rather than necessarily a fixed one).
> >> >> >
> >> >> > Yeah, because of the lack of real memory protection, the kernel can't
> >> >> > _know_ that processes don't behave badly and access things that they
> >> >> > didn't explicitly map, but I'm hoping that that is rare.
> >> >> >
> >> >> > And yes, if you really want to use threads as a way to do something
> >> >> > across cores, you'd be screwed - the kenrel would only schedule the
> >> >> > threads on one CPU. But considering the undefined nature of threading on
> >> >> > such a cpu, wouldn't that still be preferable? Wouldn't it be nice to have
> >> >> > the knowledge that user space _looks_ cache-coherent by virtue of the
> >> >> > kernel just limiting cores appropriately?
> >> >> >
> >> >> > And then user space would simply not need to worry as much. Code written
> >> >> > for another architecture will "just work" on BF SMP too. With the normal
> >> >> > uclinux limitations, of course.
> >> >>
> >> >> I don't know enough about BF to tell for sure, but the other way around
> >> >> I see that would still permit running threads with shared memory space
> >> >> on different CPUs is to call a cache flush each time a userspace lock is
> >> >> taken/released (at the synchronization points where the "magic
> >> >> test-and-set instruction" is used) _from_ userspace.
> >> >>
> >> >> If some more elaborate userspace MT code uses something else than those
> >> >> basic locks provided by core libraries to synchronize data exchange,
> >> >> then it would be on its own and have to ensure cache flushing itself.
> >> >
> >> > How about just doing a sched_setaffinity() in the BF case?  Sounds
> >> > like an easy way to implement Linus's suggestion of restricting the
> >> > multithreaded processes to a single core.  I have a hard time losing
> >> > sleep over the lack of parallelism in the case where the SMP support is
> >> > at best rudimentary...
> >>
> >> the quick way is to tell people to run their program through `taskset`
> >> (which is what we're doing now).
> >
> > Not sure what environment Mathieu is looking to run his program from,
> > but he would need to run it on multiple architectures.
> 
> right, that is exactly the kind of thing we strive to avoid on our
> (the Blackfin) side of things
> 
> >> the next step up (or down depending on how you look at it) would be to
> >> hook the clone function to do this automatically.  i havent gotten
> >> around to testing this yet which is why there isnt anything in there
> >> yet though.
> >>
> >> asmlinkage int bfin_clone(struct pt_regs....
> >>        unsigned long clone_flags;
> >>        unsigned long newsp;
> >>
> >> +#ifdef CONFIG_SMP
> >> +       if (current->rt.nr_cpus_allowed == NR_CPUS) {
> >> +               current->cpus_allowed = cpumask_of_cpu(smp_processor_id());
> >> +               current->rt.nr_cpus_allowed = 1;
> >> +       }
> >> +#endif
> >> +
> >>        /* syscall2 puts clone_flags in r0 and usp in r1 */
> >>        clone_flags = regs->r0;
> >>        newsp = regs->r1;
> >
> > Wouldn't you also have to make sched_setaffinity() cut back to only one
> > CPU if more are specified?
> 
> mmm, yes and no.  if we wanted to keep the transparency thing going,
> then adding a check to the affinity functions to make sure threaded
> apps dont span cpus would be needed.  but i would think we'd want to
> have it return an error rather (EINVAL prob) than attempting to make
> any automatic selections.  the only real blocker here would be
> figuring out how to detect the application in question is threaded
> with 100% accuracy.  the Blackfin port does not yet have TLS support
> which means we're using linuxthreads rather than NPTL ...
> 
> hooking clone gives us the biggest bang for the buck: majority of
> stuff today are threaded applications that dont look at affinity.
> 
> > If Blackfin handles hotplug CPU, that may
> > need attention as well, since tasks affinitied to the CPU being removed
> > can end up with their affinity set to all CPUs.  And there are probably
> > other issues.
> 
> no, we dont support hotplugging of CPUs.  there is no hardware support
> for it, so i think the only thing you'd gain is perhaps power savings
> ?  not sure it would even work in our case though as the hardware does
> not support restarting or shutdown of one core ... they both have to
> restart/shutdown.  putting one core into a constant idle loop would
> save power, but that can already be accomplished by reducing the apps
> that go onto a specific core.

OK, that removes that issue, at least aside from any people who will
take the software approach to CPU hotplug (leaving the unplugged CPU
spinning with irqs disabled or some such).

Other potential issues include unrelated processes that share memory via
shmget() or mmap() -- presumably groups of such processes would need to
be bound to a single CPU?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14 15:06                                                                                           ` Paul E. McKenney
@ 2009-02-14 17:37                                                                                             ` Mike Frysinger
  0 siblings, 0 replies; 116+ messages in thread
From: Mike Frysinger @ 2009-02-14 17:37 UTC (permalink / raw)
  To: paulmck
  Cc: Mathieu Desnoyers, Linus Torvalds, Nick Piggin, Bryan Wu,
	linux-kernel, ltt-dev, uclinux-dist-devel

On Sat, Feb 14, 2009 at 10:06, Paul E. McKenney wrote:
> Other potential issues include unrelated processes that share memory via
> shmget() or mmap() -- presumably groups of such processes would need to
> be bound to a single CPU?

using PROT_WRITE with MAP_SHARED is a mess with no-mmu already ... in
other words, it isnt in use in practice as most of the time, the
kernel wont even grant it

i dont believe we've looked at shm/ipc ... i'll open a tracker item on
our side to review it
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-14  5:46                                                                                         ` Mike Frysinger
  2009-02-14 15:06                                                                                           ` Paul E. McKenney
@ 2009-02-22 14:23                                                                                           ` Pavel Machek
  2009-02-22 18:28                                                                                             ` Mike Frysinger
  1 sibling, 1 reply; 116+ messages in thread
From: Pavel Machek @ 2009-02-22 14:23 UTC (permalink / raw)
  To: Mike Frysinger
  Cc: paulmck, Mathieu Desnoyers, Linus Torvalds, Nick Piggin,
	Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel


Hi!

> > If Blackfin handles hotplug CPU, that may
> > need attention as well, since tasks affinitied to the CPU being removed
> > can end up with their affinity set to all CPUs.  And there are probably
> > other issues.
> 
> no, we dont support hotplugging of CPUs.  there is no hardware support
> for it, so i think the only thing you'd gain is perhaps power savings
> ?  not sure it would even work in our case though as the hardware does
> not support restarting or shutdown of one core ... they both have to
> restart/shutdown.  putting one core into a constant idle loop would
> save power, but that can already be accomplished by reducing the apps
> that go onto a specific core.

Well, cpu hotplug is needed for suspend and hibernation...

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 116+ messages in thread

* Re: [ltt-dev] [RFC git tree] Userspace RCU (urcu) for Linux (repost)
  2009-02-22 14:23                                                                                           ` Pavel Machek
@ 2009-02-22 18:28                                                                                             ` Mike Frysinger
  0 siblings, 0 replies; 116+ messages in thread
From: Mike Frysinger @ 2009-02-22 18:28 UTC (permalink / raw)
  To: Pavel Machek
  Cc: paulmck, Mathieu Desnoyers, Linus Torvalds, Nick Piggin,
	Bryan Wu, linux-kernel, ltt-dev, uclinux-dist-devel

On Sun, Feb 22, 2009 at 09:23, Pavel Machek wrote:
>> > If Blackfin handles hotplug CPU, that may
>> > need attention as well, since tasks affinitied to the CPU being removed
>> > can end up with their affinity set to all CPUs.  And there are probably
>> > other issues.
>>
>> no, we dont support hotplugging of CPUs.  there is no hardware support
>> for it, so i think the only thing you'd gain is perhaps power savings
>> ?  not sure it would even work in our case though as the hardware does
>> not support restarting or shutdown of one core ... they both have to
>> restart/shutdown.  putting one core into a constant idle loop would
>> save power, but that can already be accomplished by reducing the apps
>> that go onto a specific core.
>
> Well, cpu hotplug is needed for suspend and hibernation...

we dont currently support hibernation (or at least we've never tested
it), but suspend to ram works fine on UP systems.  i'll open a tracker
item for someone to check suspend to ram on SMP systems.
-mike

^ permalink raw reply	[flat|nested] 116+ messages in thread

end of thread, other threads:[~2009-02-22 18:28 UTC | newest]

Thread overview: 116+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-02-06  3:05 [RFC git tree] Userspace RCU (urcu) for Linux Mathieu Desnoyers
2009-02-06  4:58 ` [RFC git tree] Userspace RCU (urcu) for Linux (repost) Mathieu Desnoyers
2009-02-06 13:06   ` Paul E. McKenney
2009-02-06 16:34     ` Paul E. McKenney
2009-02-07 15:10       ` Paul E. McKenney
2009-02-07 22:16         ` Paul E. McKenney
2009-02-08  0:19           ` Mathieu Desnoyers
2009-02-07 23:38         ` Mathieu Desnoyers
2009-02-08  0:44           ` Paul E. McKenney
2009-02-08 21:46             ` Mathieu Desnoyers
2009-02-08 22:36               ` Paul E. McKenney
2009-02-09  0:24                 ` Paul E. McKenney
2009-02-09  0:54                   ` Mathieu Desnoyers
2009-02-09  1:08                     ` [ltt-dev] " Mathieu Desnoyers
2009-02-09  3:47                       ` Paul E. McKenney
2009-02-09  3:42                     ` Paul E. McKenney
2009-02-09  0:40                 ` [ltt-dev] " Mathieu Desnoyers
2009-02-08 22:44       ` Mathieu Desnoyers
2009-02-09  4:11         ` Paul E. McKenney
2009-02-09  4:53           ` Mathieu Desnoyers
2009-02-09  5:17             ` [ltt-dev] " Mathieu Desnoyers
2009-02-09  7:03               ` Mathieu Desnoyers
2009-02-09 15:33                 ` Paul E. McKenney
2009-02-10 19:17                   ` Mathieu Desnoyers
2009-02-10 21:16                     ` Paul E. McKenney
2009-02-10 21:28                       ` Mathieu Desnoyers
2009-02-10 22:21                         ` Paul E. McKenney
2009-02-10 22:58                           ` Paul E. McKenney
2009-02-10 23:01                             ` Paul E. McKenney
2009-02-11  0:57                           ` Mathieu Desnoyers
2009-02-11  5:28                             ` Paul E. McKenney
2009-02-11  6:35                               ` Mathieu Desnoyers
2009-02-11 15:32                                 ` Paul E. McKenney
2009-02-11 18:52                                   ` Mathieu Desnoyers
2009-02-11 20:09                                     ` Paul E. McKenney
2009-02-11 21:42                                       ` Mathieu Desnoyers
2009-02-11 22:08                                         ` Mathieu Desnoyers
     [not found]                                         ` <20090212003549.GU6694@linux.vnet.ibm.com>
2009-02-12  2:33                                           ` Paul E. McKenney
2009-02-12  2:37                                             ` Paul E. McKenney
2009-02-12  4:10                                               ` Mathieu Desnoyers
2009-02-12  5:09                                                 ` Paul E. McKenney
2009-02-12  5:47                                                   ` Mathieu Desnoyers
2009-02-12 16:18                                                     ` Paul E. McKenney
2009-02-12 18:40                                                       ` Mathieu Desnoyers
2009-02-12 20:28                                                         ` Paul E. McKenney
2009-02-12 21:27                                                           ` Mathieu Desnoyers
2009-02-12 23:26                                                             ` Paul E. McKenney
2009-02-13 13:12                                                               ` Mathieu Desnoyers
2009-02-12  4:08                                             ` Mathieu Desnoyers
2009-02-12  5:01                                               ` Paul E. McKenney
2009-02-12  7:05                                                 ` Mathieu Desnoyers
2009-02-12 16:46                                                   ` Paul E. McKenney
2009-02-12 19:29                                                     ` Mathieu Desnoyers
2009-02-12 20:02                                                       ` Paul E. McKenney
2009-02-12 20:09                                                         ` Mathieu Desnoyers
2009-02-12 20:35                                                           ` Paul E. McKenney
2009-02-12 21:15                                                             ` Mathieu Desnoyers
2009-02-12 20:13                                                         ` Linus Torvalds
2009-02-12 20:39                                                           ` Paul E. McKenney
2009-02-12 21:15                                                             ` Linus Torvalds
2009-02-12 21:59                                                               ` Paul E. McKenney
2009-02-13 13:50                                                                 ` Nick Piggin
2009-02-13 14:56                                                                   ` Paul E. McKenney
2009-02-13 15:10                                                                     ` Mathieu Desnoyers
2009-02-13 15:55                                                                       ` Mathieu Desnoyers
2009-02-13 16:18                                                                         ` Linus Torvalds
2009-02-13 17:33                                                                           ` Mathieu Desnoyers
2009-02-13 17:53                                                                             ` Linus Torvalds
2009-02-13 18:09                                                                               ` Linus Torvalds
2009-02-13 18:54                                                                                 ` Mathieu Desnoyers
2009-02-13 19:36                                                                                   ` Paul E. McKenney
2009-02-14  5:07                                                                                     ` Mike Frysinger
2009-02-14  5:20                                                                                       ` Paul E. McKenney
2009-02-14  5:46                                                                                         ` Mike Frysinger
2009-02-14 15:06                                                                                           ` Paul E. McKenney
2009-02-14 17:37                                                                                             ` Mike Frysinger
2009-02-22 14:23                                                                                           ` Pavel Machek
2009-02-22 18:28                                                                                             ` Mike Frysinger
2009-02-14  6:42                                                                                         ` Mathieu Desnoyers
2009-02-14  3:15                                                                                 ` [Uclinux-dist-devel] " Mike Frysinger
2009-02-13 18:40                                                                               ` Mathieu Desnoyers
2009-02-13 16:05                                                                   ` Linus Torvalds
2009-02-14  3:11                                                                     ` [Uclinux-dist-devel] " Mike Frysinger
2009-02-14  4:58                                                           ` Robin Getz
2009-02-12 19:38                                                     ` Mathieu Desnoyers
2009-02-12 20:17                                                       ` Paul E. McKenney
2009-02-12 21:53                                                         ` Mathieu Desnoyers
2009-02-12 23:04                                                           ` Paul E. McKenney
2009-02-13 12:49                                                             ` Mathieu Desnoyers
2009-02-11  5:08                     ` Lai Jiangshan
2009-02-11  8:58                       ` Mathieu Desnoyers
2009-02-09 13:23               ` Paul E. McKenney
2009-02-09 17:28                 ` Mathieu Desnoyers
2009-02-09 17:47                   ` Paul E. McKenney
2009-02-09 18:13                     ` Mathieu Desnoyers
2009-02-09 18:19                       ` Mathieu Desnoyers
2009-02-09 18:37                       ` Paul E. McKenney
2009-02-09 18:49                         ` Paul E. McKenney
2009-02-09 19:05                           ` Mathieu Desnoyers
2009-02-09 19:15                             ` Mathieu Desnoyers
2009-02-09 19:35                               ` Paul E. McKenney
2009-02-09 19:23                             ` Paul E. McKenney
2009-02-09 13:16             ` Paul E. McKenney
2009-02-09 17:19               ` Bert Wesarg
2009-02-09 17:34                 ` Paul E. McKenney
2009-02-09 17:35                   ` Bert Wesarg
2009-02-09 17:40                     ` Paul E. McKenney
2009-02-09 17:42                       ` Mathieu Desnoyers
2009-02-09 18:00                         ` Paul E. McKenney
2009-02-09 17:45                       ` Bert Wesarg
2009-02-09 17:59                         ` Paul E. McKenney
2009-02-07 22:56   ` Kyle Moffett
2009-02-07 23:50     ` Mathieu Desnoyers
2009-02-08  0:13     ` Paul E. McKenney
2009-02-06  8:55 ` [RFC git tree] Userspace RCU (urcu) for Linux Bert Wesarg
2009-02-06 11:36   ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).