linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* dyntick-hpc and RCU
@ 2010-11-04 23:21 Paul E. McKenney
  2010-11-05  5:27 ` Frederic Weisbecker
  2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
  0 siblings, 2 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-04 23:21 UTC (permalink / raw)
  To: fweisbec, mathieu.desnoyers, dhowells, loic.minier, dhaval.giani
  Cc: tglx, peterz, linux-kernel, josh

Hello!

Just wanted some written record of our discussion this Wednesday.
I don't have an email address for Jim Houston, and I am not sure I have
all of the attendees, but here goes anyway.  Please don't hesitate to
reply with any corrections!

The goal is to be able to turn of scheduling-clock interrupts for
long-running user-mode execution when there is but one runnable task
on a given CPU, but while still allowing RCU to function correctly.
In particular, we need to minimize (or better, eliminate) any source
of interruption to such a CPU.  We discussed these approaches, along
with their advantages and disadvantages:

1.	If a user task is executing in dyntick-hpc mode, inform RCU
	of all kernel/user transitions, calling rcu_enter_nohz()
	on each transition to user-mode execution and calling
	rcu_exit_nohz() on each transition to kernel-mode execution.

	+	Transitions due to interrupts and NMIs are already
		handled by the existing dyntick-idle code.

	+	RCU works without changes.

	-	-Every- exception path must be located and instrumented.

	-	Every system call must be instrumented.

	-	The system-call return fastpath is disabled by this
		approach, increasing the overhead of system calls.

	--	The scheduling-clock timer must be restarted on each
		transition to kernel-mode execution.  This is thought
		to be difficult on some of the exception code paths,
		and has high overhead regardless.

2.	Like #1 above, but instead of starting up the scheduling-clock
	timer on the CPU transitioning into the kernel, instead wake
	up a kthread that IPIs this CPU.  This has roughly the same
	advantages and disadvantages as #1 above, but substitutes
	a less-ugly kthread-wakeup operation in place of starting
	the scheduling-clock timer.

	There are a number of variations on this approach, but the
	rest of them are infeasible due to the fact that irq-disable
	and preempt-disable code sections are implicit read-side
	critical sections for RCU-sched.

3.	Substitute an RCU implementation similar to Jim Houston's
	real-time RCU implementation used by Concurrent.  (Jim posted
	this in 2004: http://lkml.org/lkml/2004/8/30/87 against
	2.6.1.1-mm4.)  In this implementation, the RCU grace periods
	are driven out of rcu_read_unlock(), so that there is no
	dependency on the scheduler-clock interrupt.

	+	Allows dyntick-hpc to simply require this alternative
		RCU implementation, without the need to interact
		with it.

	0	This implementation disables preemption across
		RCU read-side critical sections, which might be
		unacceptable for some users.  Or it might be OK,
		we were unable to determine this.

	0	This implementation increases the overhead of
		rcu_read_lock() and rcu_read_unlock().  However,
		this is probably acceptable, especially given that
		the workloads in question execute almost entirely
		in user space.

	---	Implicit RCU-sched and RCU-bh read-side critical
		sections would need to be explicitly marked with
		rcu_read_lock_sched() and rcu_read_lock_bh(),
		respectively.  Implicit critical sections include
		disabled preemption, disabled interrupts, hardirq
		handlers, and NMI handlers.  This change would
		require a large, intrusive, high-regression-risk patch.
		In addition, the hardirq-handler portion has been proposed
		and rejected in the past.

4.	Substitute an RCU implementation based on one of the
	user-level RCU implementations.  This has roughly the same
	advantages and disadvantages as does #3 above.

5.	Don't tell RCU about dyntick-hpc mode, but instead make RCU
	push processing through via some processor that is kept out
	of dyntick-hpc mode.  This requires that the rcutree RCU
	priority boosting be pushed further along so that RCU grace period
	and callback processing is done in kthread context, permitting
	remote forcing of grace periods.  The RCU_JIFFIES_TILL_FORCE_QS
	macro is promoted to a config variable, retaining its value
	of 3 in absence of dyntick-hpc, but getting value of HZ
	(or thereabouts) for dyntick-hpc builds.  In dyntick-hpc
	builds, force_quiescent_state() would push grace periods
	for CPUs lacking a scheduling-clock interrupt.

	+	Relatively small changes to RCU, some of which is
		coming with RCU priority boosting anyway.

	+	No need to inform RCU of user/kernel transitions.

	+	No need to turn scheduling-clock interrupts on
		at each user/kernel transition.

	-	Some IPIs to dyntick-hpc CPUs remain, but these
		are down in the every-second-or-so frequency,
		so hopefully are not a real problem.

6.	Your idea here!

The general consensus at the end of the meeting was that #5 was most
likely to work out the best.

							Thanx, Paul

PS.  If anyone knows Jim Houston's email address, please feel free
     to forward to him.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-04 23:21 dyntick-hpc and RCU Paul E. McKenney
@ 2010-11-05  5:27 ` Frederic Weisbecker
  2010-11-05  5:38   ` Frederic Weisbecker
  2010-11-05 15:04   ` Paul E. McKenney
  2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
  1 sibling, 2 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-05  5:27 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

On Thu, Nov 04, 2010 at 04:21:48PM -0700, Paul E. McKenney wrote:
> Hello!
> 
> Just wanted some written record of our discussion this Wednesday.
> I don't have an email address for Jim Houston, and I am not sure I have
> all of the attendees, but here goes anyway.  Please don't hesitate to
> reply with any corrections!



Thanks a lot for doing this. I was about to send you an email
to get such a summarize. Especially for the 5th proposition that
was actually not clear to me.




> 
> The goal is to be able to turn of scheduling-clock interrupts for
> long-running user-mode execution when there is but one runnable task
> on a given CPU, but while still allowing RCU to function correctly.
> In particular, we need to minimize (or better, eliminate) any source
> of interruption to such a CPU.  We discussed these approaches, along
> with their advantages and disadvantages:
> 
> 1.	If a user task is executing in dyntick-hpc mode, inform RCU
> 	of all kernel/user transitions, calling rcu_enter_nohz()
> 	on each transition to user-mode execution and calling
> 	rcu_exit_nohz() on each transition to kernel-mode execution.
> 
> 	+	Transitions due to interrupts and NMIs are already
> 		handled by the existing dyntick-idle code.
> 
> 	+	RCU works without changes.
> 
> 	-	-Every- exception path must be located and instrumented.


Yeah, that's bad.



> 
> 	-	Every system call must be instrumented.




Not really, we just need to enter into the syscall slow path mode (which
is still a "-" point, but at least we don't need to inspect every syscalls).



> 
> 	-	The system-call return fastpath is disabled by this
> 		approach, increasing the overhead of system calls.


Yep.



> 
> 	--	The scheduling-clock timer must be restarted on each
> 		transition to kernel-mode execution.  This is thought
> 		to be difficult on some of the exception code paths,
> 		and has high overhead regardless.



Right.



> 
> 2.	Like #1 above, but instead of starting up the scheduling-clock
> 	timer on the CPU transitioning into the kernel, instead wake
> 	up a kthread that IPIs this CPU.  This has roughly the same
> 	advantages and disadvantages as #1 above, but substitutes
> 	a less-ugly kthread-wakeup operation in place of starting
> 	the scheduling-clock timer.
> 
> 	There are a number of variations on this approach, but the
> 	rest of them are infeasible due to the fact that irq-disable
> 	and preempt-disable code sections are implicit read-side
> 	critical sections for RCU-sched.




Yep, that approach is a bit better than 1.




> 3.	Substitute an RCU implementation similar to Jim Houston's
> 	real-time RCU implementation used by Concurrent.  (Jim posted
> 	this in 2004: http://lkml.org/lkml/2004/8/30/87 against
> 	2.6.1.1-mm4.)  In this implementation, the RCU grace periods
> 	are driven out of rcu_read_unlock(), so that there is no
> 	dependency on the scheduler-clock interrupt.
> 
> 	+	Allows dyntick-hpc to simply require this alternative
> 		RCU implementation, without the need to interact
> 		with it.
> 
> 	0	This implementation disables preemption across
> 		RCU read-side critical sections, which might be
> 		unacceptable for some users.  Or it might be OK,
> 		we were unable to determine this.



(Probably because of my misunderstanding of the question at that time)

Requiring a preemption disabled style rcu read side critical section
is probably not acceptable for our goals. This cpu isolation thing
is targeted for HPC purpose (in which case I suspect it's perfectly
fine to have preemption disabled in rcu_read_lock()) but also for real
time purposes (in which case we need rcu_read_lock() to be preemptable).

So this is rather a drawback.




> 
> 	0	This implementation increases the overhead of
> 		rcu_read_lock() and rcu_read_unlock().  However,
> 		this is probably acceptable, especially given that
> 		the workloads in question execute almost entirely
> 		in user space.



This overhead might need to be measured, if it's actually measurable),
but yeah.



> 
> 	---	Implicit RCU-sched and RCU-bh read-side critical
> 		sections would need to be explicitly marked with
> 		rcu_read_lock_sched() and rcu_read_lock_bh(),
> 		respectively.  Implicit critical sections include
> 		disabled preemption, disabled interrupts, hardirq
> 		handlers, and NMI handlers.  This change would
> 		require a large, intrusive, high-regression-risk patch.
> 		In addition, the hardirq-handler portion has been proposed
> 		and rejected in the past.



Now an alternative is to find who is really concerned by this
by looking at the users of rcu_dereference_sched() and
rcu_derefence_bh() (there are very few), and then convert them to use
rcu_read_lock(), and then get rid of the sched and bh rcu flavours.
Not sure we want that though. But it's just to notice that removing
the call to rcu_bh_qs() after each softirq handler or rcu_check_callbacks()
from the timer could somehow cancel the overhead from the rcu_read_unlock()
calls.

OTOH, on traditional rcu configs, this requires the overhead of calling
rcu_read_lock() in sched/bh critical section that usually would have relied
on the implicit grace period.

I guess this is probably a loss in the final picture.

Yet another solution is to require users of bh and sched rcu flavours to
call a specific rcu_read_lock_sched()/bh, or something similar, that would
be only implemented in this new rcu config. We would only need to touch the
existing users and the future ones instead of adding an explicit call
to every implicit paths.



> 
> 4.	Substitute an RCU implementation based on one of the
> 	user-level RCU implementations.  This has roughly the same
> 	advantages and disadvantages as does #3 above.
> 
> 5.	Don't tell RCU about dyntick-hpc mode, but instead make RCU
> 	push processing through via some processor that is kept out
> 	of dyntick-hpc mode.



I don't understand what you mean.
Do you mean that dyntick-hpc cpu would enqueue rcu callbacks to
another CPU? But how does that protect rcu critical sections
in our dyntick-hpc CPU?




>       This requires that the rcutree RCU
> 	priority boosting be pushed further along so that RCU grace period
> 	and callback processing is done in kthread context, permitting
> 	remote forcing of grace periods.



I should have a look at the rcu priority boosting to understand what you
mean here.



>       The RCU_JIFFIES_TILL_FORCE_QS
> 	macro is promoted to a config variable, retaining its value
> 	of 3 in absence of dyntick-hpc, but getting value of HZ
> 	(or thereabouts) for dyntick-hpc builds.  In dyntick-hpc
> 	builds, force_quiescent_state() would push grace periods
> 	for CPUs lacking a scheduling-clock interrupt.
> 
> 	+	Relatively small changes to RCU, some of which is
> 		coming with RCU priority boosting anyway.
> 
> 	+	No need to inform RCU of user/kernel transitions.
> 
> 	+	No need to turn scheduling-clock interrupts on
> 		at each user/kernel transition.
> 
> 	-	Some IPIs to dyntick-hpc CPUs remain, but these
> 		are down in the every-second-or-so frequency,
> 		so hopefully are not a real problem.


Hmm, I hope we could avoid that, ideally the task in userspace shouldn't be
interrupted at all.

I wonder if we shouldn't go back to #3 eventually.



> 
> 6.	Your idea here!
> 
> The general consensus at the end of the meeting was that #5 was most
> likely to work out the best.


At that time yeah.

But now I don't know, I really need to dig deeper into it and really
understand how #5 works before picking that orientation :)

For now #3 seems to me more viable (with one of the adds I proposed).



> 							Thanx, Paul
> 
> PS.  If anyone knows Jim Houston's email address, please feel free
>      to forward to him.


I'll try to find him tomorrow and ask him his mail address :)

Thanks a lot!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-05  5:27 ` Frederic Weisbecker
@ 2010-11-05  5:38   ` Frederic Weisbecker
  2010-11-05 15:06     ` Paul E. McKenney
  2010-11-05 15:04   ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-05  5:38 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

2010/11/5 Frederic Weisbecker <fweisbec@gmail.com>:
> For now #3 seems to me more viable (with one of the adds I proposed).


Doh, but I forgot it's not preemptable! So #2 looks then more viable.
Unless we can tweak #3 into getting it preemptable.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-05  5:27 ` Frederic Weisbecker
  2010-11-05  5:38   ` Frederic Weisbecker
@ 2010-11-05 15:04   ` Paul E. McKenney
  2010-11-08 14:10     ` Frederic Weisbecker
  1 sibling, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-05 15:04 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

On Fri, Nov 05, 2010 at 06:27:46AM +0100, Frederic Weisbecker wrote:
> On Thu, Nov 04, 2010 at 04:21:48PM -0700, Paul E. McKenney wrote:
> > Hello!
> > 
> > Just wanted some written record of our discussion this Wednesday.
> > I don't have an email address for Jim Houston, and I am not sure I have
> > all of the attendees, but here goes anyway.  Please don't hesitate to
> > reply with any corrections!
> 
> 
> 
> Thanks a lot for doing this. I was about to send you an email
> to get such a summarize. Especially for the 5th proposition that
> was actually not clear to me.
> 
> 
> 
> 
> > 
> > The goal is to be able to turn of scheduling-clock interrupts for
> > long-running user-mode execution when there is but one runnable task
> > on a given CPU, but while still allowing RCU to function correctly.
> > In particular, we need to minimize (or better, eliminate) any source
> > of interruption to such a CPU.  We discussed these approaches, along
> > with their advantages and disadvantages:
> > 
> > 1.	If a user task is executing in dyntick-hpc mode, inform RCU
> > 	of all kernel/user transitions, calling rcu_enter_nohz()
> > 	on each transition to user-mode execution and calling
> > 	rcu_exit_nohz() on each transition to kernel-mode execution.
> > 
> > 	+	Transitions due to interrupts and NMIs are already
> > 		handled by the existing dyntick-idle code.
> > 
> > 	+	RCU works without changes.
> > 
> > 	-	-Every- exception path must be located and instrumented.
> 
> 
> Yeah, that's bad.
> 
> 
> 
> > 
> > 	-	Every system call must be instrumented.
> 
> 
> 
> 
> Not really, we just need to enter into the syscall slow path mode (which
> is still a "-" point, but at least we don't need to inspect every syscalls).

OK, so either each system-call path be instrumented or the system-call
return fastpath is disabled.  ;-)

I have combined these two, and noted that disabling the system-call
fastpath seems to be the best choice.


> > 	-	The system-call return fastpath is disabled by this
> > 		approach, increasing the overhead of system calls.
> 
> 
> Yep.
> 
> 
> 
> > 
> > 	--	The scheduling-clock timer must be restarted on each
> > 		transition to kernel-mode execution.  This is thought
> > 		to be difficult on some of the exception code paths,
> > 		and has high overhead regardless.
> 
> 
> 
> Right.
> 
> 
> 
> > 
> > 2.	Like #1 above, but instead of starting up the scheduling-clock
> > 	timer on the CPU transitioning into the kernel, instead wake
> > 	up a kthread that IPIs this CPU.  This has roughly the same
> > 	advantages and disadvantages as #1 above, but substitutes
> > 	a less-ugly kthread-wakeup operation in place of starting
> > 	the scheduling-clock timer.
> > 
> > 	There are a number of variations on this approach, but the
> > 	rest of them are infeasible due to the fact that irq-disable
> > 	and preempt-disable code sections are implicit read-side
> > 	critical sections for RCU-sched.
> 
> 
> 
> 
> Yep, that approach is a bit better than 1.
> 
> 
> 
> 
> > 3.	Substitute an RCU implementation similar to Jim Houston's
> > 	real-time RCU implementation used by Concurrent.  (Jim posted
> > 	this in 2004: http://lkml.org/lkml/2004/8/30/87 against
> > 	2.6.1.1-mm4.)  In this implementation, the RCU grace periods
> > 	are driven out of rcu_read_unlock(), so that there is no
> > 	dependency on the scheduler-clock interrupt.
> > 
> > 	+	Allows dyntick-hpc to simply require this alternative
> > 		RCU implementation, without the need to interact
> > 		with it.
> > 
> > 	0	This implementation disables preemption across
> > 		RCU read-side critical sections, which might be
> > 		unacceptable for some users.  Or it might be OK,
> > 		we were unable to determine this.
> 
> 
> 
> (Probably because of my misunderstanding of the question at that time)
> 
> Requiring a preemption disabled style rcu read side critical section
> is probably not acceptable for our goals. This cpu isolation thing
> is targeted for HPC purpose (in which case I suspect it's perfectly
> fine to have preemption disabled in rcu_read_lock()) but also for real
> time purposes (in which case we need rcu_read_lock() to be preemptable).
> 
> So this is rather a drawback.

OK, I have marked it as a negative ("-").

> > 	0	This implementation increases the overhead of
> > 		rcu_read_lock() and rcu_read_unlock().  However,
> > 		this is probably acceptable, especially given that
> > 		the workloads in question execute almost entirely
> > 		in user space.
> 
> 
> 
> This overhead might need to be measured, if it's actually measurable),
> but yeah.
> 
> 
> 
> > 
> > 	---	Implicit RCU-sched and RCU-bh read-side critical
> > 		sections would need to be explicitly marked with
> > 		rcu_read_lock_sched() and rcu_read_lock_bh(),
> > 		respectively.  Implicit critical sections include
> > 		disabled preemption, disabled interrupts, hardirq
> > 		handlers, and NMI handlers.  This change would
> > 		require a large, intrusive, high-regression-risk patch.
> > 		In addition, the hardirq-handler portion has been proposed
> > 		and rejected in the past.
> 
> 
> 
> Now an alternative is to find who is really concerned by this
> by looking at the users of rcu_dereference_sched() and
> rcu_derefence_bh() (there are very few), and then convert them to use
> rcu_read_lock(), and then get rid of the sched and bh rcu flavours.
> Not sure we want that though. But it's just to notice that removing
> the call to rcu_bh_qs() after each softirq handler or rcu_check_callbacks()
> from the timer could somehow cancel the overhead from the rcu_read_unlock()
> calls.
> 
> OTOH, on traditional rcu configs, this requires the overhead of calling
> rcu_read_lock() in sched/bh critical section that usually would have relied
> on the implicit grace period.
> 
> I guess this is probably a loss in the final picture.
> 
> Yet another solution is to require users of bh and sched rcu flavours to
> call a specific rcu_read_lock_sched()/bh, or something similar, that would
> be only implemented in this new rcu config. We would only need to touch the
> existing users and the future ones instead of adding an explicit call
> to every implicit paths.

This approach would be a much nicer solution, and I do wish I had required
this to start with.  Unfortunately, at that time, there was no preemptible
RCU, CONFIG_PREEMPT, nor any RCU-bh, so there was no way to enforce this.
Besides which, I was thinking in terms of maybe 100 occurrences of the RCU
API in the kernel.  ;-)

> > 4.	Substitute an RCU implementation based on one of the
> > 	user-level RCU implementations.  This has roughly the same
> > 	advantages and disadvantages as does #3 above.
> > 
> > 5.	Don't tell RCU about dyntick-hpc mode, but instead make RCU
> > 	push processing through via some processor that is kept out
> > 	of dyntick-hpc mode.
> 
> I don't understand what you mean.
> Do you mean that dyntick-hpc cpu would enqueue rcu callbacks to
> another CPU? But how does that protect rcu critical sections
> in our dyntick-hpc CPU?

There is a large range of possible solutions, but any solution will need
to check for RCU read-side critical sections on the dyntick-hpc CPU.  I
was thinking in terms of IPIing the dyntick-hpc CPUs, but very infrequently,
say once per second.

> >       This requires that the rcutree RCU
> > 	priority boosting be pushed further along so that RCU grace period
> > 	and callback processing is done in kthread context, permitting
> > 	remote forcing of grace periods.
> 
> 
> 
> I should have a look at the rcu priority boosting to understand what you
> mean here.

The only thing that you really need to know about it is that I will be
moving the current softirq processing to kthread context.  The key point
here is that we can wake up a kthread on some other CPU.

> >       The RCU_JIFFIES_TILL_FORCE_QS
> > 	macro is promoted to a config variable, retaining its value
> > 	of 3 in absence of dyntick-hpc, but getting value of HZ
> > 	(or thereabouts) for dyntick-hpc builds.  In dyntick-hpc
> > 	builds, force_quiescent_state() would push grace periods
> > 	for CPUs lacking a scheduling-clock interrupt.
> > 
> > 	+	Relatively small changes to RCU, some of which is
> > 		coming with RCU priority boosting anyway.
> > 
> > 	+	No need to inform RCU of user/kernel transitions.
> > 
> > 	+	No need to turn scheduling-clock interrupts on
> > 		at each user/kernel transition.
> > 
> > 	-	Some IPIs to dyntick-hpc CPUs remain, but these
> > 		are down in the every-second-or-so frequency,
> > 		so hopefully are not a real problem.
> 
> 
> Hmm, I hope we could avoid that, ideally the task in userspace shouldn't be
> interrupted at all.

Yep.  But if we do need to interrupt it, let's do it as infrequently as
we can!

> I wonder if we shouldn't go back to #3 eventually.

And there are variants of #3 that permit preemption of RCU read-side
critical sections.

> > 6.	Your idea here!
> > 
> > The general consensus at the end of the meeting was that #5 was most
> > likely to work out the best.

And I believe Dave Howells is working up something.

> At that time yeah.
> 
> But now I don't know, I really need to dig deeper into it and really
> understand how #5 works before picking that orientation :)

This is probably true for all of us for all of the options.  ;-)

> For now #3 seems to me more viable (with one of the adds I proposed).

The difficulty here is convincing everyone to change their code to add
RCU markers around all of the implicit IRQ disabling.  We can of course
add code to the existing IRQ enable/disable and preempt enable/disable
primitives, which would leave the enable/disable in hardware and in
random arch-dependent assembly code.

> > PS.  If anyone knows Jim Houston's email address, please feel free
> >      to forward to him.
> 
> 
> I'll try to find him tomorrow and ask him his mail address :)

Please let me know!

							Thanx, Paul

> Thanks a lot!
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-05  5:38   ` Frederic Weisbecker
@ 2010-11-05 15:06     ` Paul E. McKenney
  2010-11-05 20:06       ` Dhaval Giani
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-05 15:06 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

On Fri, Nov 05, 2010 at 06:38:17AM +0100, Frederic Weisbecker wrote:
> 2010/11/5 Frederic Weisbecker <fweisbec@gmail.com>:
> > For now #3 seems to me more viable (with one of the adds I proposed).
> 
> Doh, but I forgot it's not preemptable! So #2 looks then more viable.
> Unless we can tweak #3 into getting it preemptable.

There are ways of getting preemptiblity working.  This might not be
all that big a modification to TREE_PREEMPT_RCU, now that I think
of it.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-05 15:06     ` Paul E. McKenney
@ 2010-11-05 20:06       ` Dhaval Giani
  0 siblings, 0 replies; 63+ messages in thread
From: Dhaval Giani @ 2010-11-05 20:06 UTC (permalink / raw)
  To: paulmck
  Cc: Frederic Weisbecker, mathieu.desnoyers, dhowells, loic.minier,
	tglx, peterz, linux-kernel, josh, jim.houston

[Adding Jim to the cc]

On Fri, Nov 5, 2010 at 11:06 AM, Paul E. McKenney
<paulmck@linux.vnet.ibm.com> wrote:
> On Fri, Nov 05, 2010 at 06:38:17AM +0100, Frederic Weisbecker wrote:
>> 2010/11/5 Frederic Weisbecker <fweisbec@gmail.com>:
>> > For now #3 seems to me more viable (with one of the adds I proposed).
>>
>> Doh, but I forgot it's not preemptable! So #2 looks then more viable.
>> Unless we can tweak #3 into getting it preemptable.
>
> There are ways of getting preemptiblity working.  This might not be
> all that big a modification to TREE_PREEMPT_RCU, now that I think
> of it.
>
>                                                        Thanx, Paul
>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] a local-timer-free version of RCU
  2010-11-04 23:21 dyntick-hpc and RCU Paul E. McKenney
  2010-11-05  5:27 ` Frederic Weisbecker
@ 2010-11-05 21:00 ` Joe Korty
  2010-11-06 19:28   ` Paul E. McKenney
                     ` (2 more replies)
  1 sibling, 3 replies; 63+ messages in thread
From: Joe Korty @ 2010-11-05 21:00 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: fweisbec, mathieu.desnoyers, dhowells, loic.minier, dhaval.giani,
	tglx, peterz, linux-kernel, josh

On Thu, Nov 04, 2010 at 04:21:48PM -0700, Paul E. McKenney wrote:
> Just wanted some written record of our discussion this Wednesday.
> I don't have an email address for Jim Houston, and I am not sure I have
> all of the attendees, but here goes anyway.  Please don't hesitate to
> reply with any corrections!
> 
> The goal is to be able to turn of scheduling-clock interrupts for
> long-running user-mode execution when there is but one runnable task
> on a given CPU, but while still allowing RCU to function correctly.
> In particular, we need to minimize (or better, eliminate) any source
> of interruption to such a CPU.  We discussed these approaches, along
> with their advantages and disadvantages:




Jim Houston's timer-less version of RCU.
	
This rather ancient version of RCU handles RCU garbage
collection in the absence of a per-cpu local timer
interrupt.

This is a minimal forward port to 2.6.36.  It works,
but it is not yet a complete implementation of RCU.

Developed-by: Jim Houston <jim.houston@ccur.com>
Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: b/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -167,7 +167,8 @@ void mce_log(struct mce *mce)
 	mce->finished = 0;
 	wmb();
 	for (;;) {
-		entry = rcu_dereference_check_mce(mcelog.next);
+		entry = mcelog.next;
+		smp_read_barrier_depends();
 		for (;;) {
 			/*
 			 * If edac_mce is enabled, it will check the error type
@@ -1558,7 +1559,8 @@ static ssize_t mce_read(struct file *fil
 			goto out;
 	}
 
-	next = rcu_dereference_check_mce(mcelog.next);
+	next = mcelog.next;
+	smp_read_barrier_depends();
 
 	/* Only supports full reads right now */
 	err = -EINVAL;
Index: b/include/linux/rcushield.h
===================================================================
--- /dev/null
+++ b/include/linux/rcushield.h
@@ -0,0 +1,361 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ *
+ * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ */
+
+#ifndef __LINUX_RCUPDATE_H
+#define __LINUX_RCUPDATE_H
+
+#ifdef __KERNEL__
+
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/threads.h>
+#include <linux/smp.h>
+#include <linux/cpumask.h>
+
+/*
+ * These #includes are not used by shielded RCUs; they are here
+ * to match the #includes made by the other rcu implementations.
+ */
+#include <linux/seqlock.h>
+#include <linux/lockdep.h>
+#include <linux/completion.h>
+
+/**
+ * struct rcu_head - callback structure for use with RCU
+ * @next: next update requests in a list
+ * @func: actual update function to call after the grace period.
+ */
+struct rcu_head {
+	struct rcu_head *next;
+	void (*func)(struct rcu_head *head);
+};
+
+#define RCU_HEAD_INIT 	{ .next = NULL, .func = NULL }
+#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
+#define INIT_RCU_HEAD(ptr) do { \
+       (ptr)->next = NULL; (ptr)->func = NULL; \
+} while (0)
+
+/*
+ * The rcu_batch variable contains the current batch number
+ * and the following flags.  The RCU_NEXT_PENDING bit requests that
+ * a new batch should start when the current batch completes.  The
+ * RCU_COMPLETE bit indicates that the most recent batch has completed
+ * and RCU processing has stopped.
+ */
+extern long rcu_batch;
+#define RCU_BATCH_MASK		(~3)
+#define RCU_INCREMENT		4
+#define RCU_COMPLETE		2
+#define RCU_NEXT_PENDING	1
+
+/* Is batch a before batch b ? */
+static inline int rcu_batch_before(long a, long b)
+{
+	return (a - b) < 0;
+}
+
+/* Is batch a after batch b ? */
+static inline int rcu_batch_after(long a, long b)
+{
+	return (a - b) > 0;
+}
+
+static inline int rcu_batch_complete(long batch)
+{
+	return !rcu_batch_before((rcu_batch & ~RCU_NEXT_PENDING), batch);
+}
+
+struct rcu_list {
+	struct rcu_head *head;
+	struct rcu_head **tail;
+};
+
+static inline void rcu_list_init(struct rcu_list *l)
+{
+	l->head = NULL;
+	l->tail = &l->head;
+}
+
+static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
+{
+	*l->tail = h;
+	l->tail = &h->next;
+}
+
+static inline void rcu_list_move(struct rcu_list *to, struct rcu_list *from)
+{
+	if (from->head) {
+		*to->tail = from->head;
+		to->tail = from->tail;
+		rcu_list_init(from);
+	}
+}
+
+/*
+ * Per-CPU data for Read-Copy UPdate.
+ * nxtlist - new callbacks are added here
+ * curlist - current batch for which quiescent cycle started if any
+ */
+struct rcu_data {
+	/* 1) batch handling */
+	long  	       	batch;		/* batch # for current RCU batch */
+	unsigned long	nxtbatch;	/* batch # for next queue */
+	struct rcu_list nxt;
+	struct rcu_list cur;
+	struct rcu_list done;
+	long		nxtcount;	/* number of callbacks queued */
+	struct task_struct *krcud;
+	struct rcu_head barrier;
+
+	/* 2) synchronization between rcu_read_lock and rcu_start_batch. */
+	int		nest_count;	/* count of rcu_read_lock nesting */
+	unsigned int	flags;
+	unsigned int	sequence;	/* count of read locks. */
+};
+
+/*
+ * Flags values used to synchronize between rcu_read_lock/rcu_read_unlock
+ * and the rcu_start_batch.  Only processors executing rcu_read_lock
+ * protected code get invited to the rendezvous.
+ */
+#define	IN_RCU_READ_LOCK	1
+#define	DO_RCU_COMPLETION	2
+
+DECLARE_PER_CPU(struct rcu_data, rcu_data);
+
+/**
+ * rcu_assign_pointer - assign (publicize) a pointer to a newly
+ * initialized structure that will be dereferenced by RCU read-side
+ * critical sections.  Returns the value assigned.
+ *
+ * Inserts memory barriers on architectures that require them
+ * (pretty much all of them other than x86), and also prevents
+ * the compiler from reordering the code that initializes the
+ * structure after the pointer assignment.  More importantly, this
+ * call documents which pointers will be dereferenced by RCU read-side
+ * code.
+ */
+
+#define rcu_assign_pointer(p, v)	({ \
+						smp_wmb(); \
+						(p) = (v); \
+					})
+
+extern void rcu_init(void);
+extern void rcu_restart_cpu(int cpu);
+extern void rcu_quiescent(int cpu);
+extern void rcu_poll(int cpu);
+
+/* stubs for mainline rcu features we do not need */
+static inline void rcu_sched_qs(int cpu) { }
+static inline void rcu_bh_qs(int cpu) { }
+static inline int rcu_needs_cpu(int cpu) { return 0; }
+static inline void rcu_enter_nohz(void) { }
+static inline void rcu_exit_nohz(void) { }
+static inline void rcu_init_sched(void) { }
+
+extern void __rcu_read_lock(void);
+extern void __rcu_read_unlock(void);
+
+static inline void rcu_read_lock(void)
+{
+	preempt_disable();
+	__rcu_read_lock();
+}
+
+static inline void rcu_read_unlock(void)
+{
+	__rcu_read_unlock();
+	preempt_enable();
+}
+
+#define rcu_read_lock_sched(void) rcu_read_lock()
+#define rcu_read_unlock_sched(void) rcu_read_unlock()
+
+static inline void rcu_read_lock_sched_notrace(void)
+{
+	preempt_disable_notrace();
+	__rcu_read_lock();
+}
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+#error need DEBUG_LOCK_ALLOC definitions for rcu_read_lock_*_held
+#else
+static inline int rcu_read_lock_held(void)
+{
+	return 1;
+}
+
+static inline int rcu_read_lock_bh_held(void)
+{
+	return 1;
+}
+#endif /* CONFIG_DEBUG_LOCK_ALLOC */
+
+static inline int rcu_preempt_depth(void)
+{
+	return 0;
+}
+
+static inline void exit_rcu(void)
+{
+}
+
+static inline void rcu_read_unlock_sched_notrace(void)
+{
+	__rcu_read_unlock();
+	preempt_enable_notrace();
+}
+
+#ifdef CONFIG_DEBUG_KERNEL
+/*
+ * Try to catch code which depends on RCU but doesn't
+ * hold the rcu_read_lock.
+ */
+static inline void rcu_read_lock_assert(void)
+{
+#ifdef NOTYET
+	/* 2.6.13 has _lots_ of panics here.  Must fix up. */
+	struct rcu_data *r;
+
+	r = &per_cpu(rcu_data, smp_processor_id());
+	BUG_ON(r->nest_count == 0);
+#endif
+}
+#else
+static inline void rcu_read_lock_assert(void) {}
+#endif
+
+/*
+ * So where is rcu_write_lock()?  It does not exist, as there is no
+ * way for writers to lock out RCU readers.  This is a feature, not
+ * a bug -- this property is what provides RCU's performance benefits.
+ * Of course, writers must coordinate with each other.  The normal
+ * spinlock primitives work well for this, but any other technique may be
+ * used as well.  RCU does not care how the writers keep out of each
+ * others' way, as long as they do so.
+ */
+
+/**
+ * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section
+ *
+ * This is equivalent of rcu_read_lock(), but to be used when updates
+ * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
+ * consider completion of a softirq handler to be a quiescent state,
+ * a process in RCU read-side critical section must be protected by
+ * disabling softirqs. Read-side critical sections in interrupt context
+ * can use just rcu_read_lock().
+ *
+ * Hack alert.  I'm not sure if I understand the reason this interface
+ * is needed and if it is still needed with my implementation of RCU.
+ */
+static inline void rcu_read_lock_bh(void)
+{
+	local_bh_disable();
+	rcu_read_lock();
+}
+
+/*
+ * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
+ *
+ * See rcu_read_lock_bh() for more information.
+ */
+static inline void rcu_read_unlock_bh(void)
+{
+	rcu_read_unlock();
+	local_bh_enable();
+}
+
+/**
+ * rcu_dereference - fetch an RCU-protected pointer in an
+ * RCU read-side critical section.  This pointer may later
+ * be safely dereferenced.
+ *
+ * Inserts memory barriers on architectures that require them
+ * (currently only the Alpha), and, more importantly, documents
+ * exactly which pointers are protected by RCU.
+ */
+
+#define rcu_dereference(p)     ({ \
+				typeof(p) _________p1 = p; \
+				rcu_read_lock_assert(); \
+				smp_read_barrier_depends(); \
+				(_________p1); \
+				})
+
+#define rcu_dereference_raw(p)     ({ \
+				typeof(p) _________p1 = p; \
+				smp_read_barrier_depends(); \
+				(_________p1); \
+				})
+
+#define rcu_dereference_sched(p) rcu_dereference(p)
+#define rcu_dereference_check(p, c) rcu_dereference(p)
+#define rcu_dereference_index_check(p, c) rcu_dereference(p)
+#define rcu_dereference_protected(p, c) rcu_dereference(p)
+#define rcu_dereference_bh(p) rcu_dereference(p)
+
+static inline void rcu_note_context_switch(int cpu) {}
+
+/**
+ * synchronize_sched - block until all CPUs have exited any non-preemptive
+ * kernel code sequences.
+ *
+ * This means that all preempt_disable code sequences, including NMI and
+ * hardware-interrupt handlers, in progress on entry will have completed
+ * before this primitive returns.  However, this does not guarantee that
+ * softirq handlers will have completed, since in some kernels
+ *
+ * This primitive provides the guarantees made by the (deprecated)
+ * synchronize_kernel() API.  In contrast, synchronize_rcu() only
+ * guarantees that rcu_read_lock() sections will have completed.
+ */
+#define synchronize_sched synchronize_rcu
+#define synchronize_sched_expedited synchronize_rcu
+
+/* Exported interfaces */
+#define call_rcu_sched(head, func) call_rcu(head, func)
+extern void call_rcu(struct rcu_head *head,
+		void (*func)(struct rcu_head *head));
+extern void call_rcu_bh(struct rcu_head *head,
+		void (*func)(struct rcu_head *head));
+extern __deprecated_for_modules void synchronize_kernel(void);
+extern void synchronize_rcu(void);
+extern void rcu_barrier(void);
+#define rcu_barrier_sched rcu_barrier
+#define rcu_barrier_bh rcu_barrier
+static inline void rcu_scheduler_starting(void) {}
+extern void do_delayed_rcu_daemon_wakeups(void);
+
+#endif /* __KERNEL__ */
+#endif /* __LINUX_RCUPDATE_H */
Index: b/include/linux/rcupdate.h
===================================================================
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -30,6 +30,10 @@
  *
  */
 
+#ifdef CONFIG_SHIELDING_RCU
+#include <linux/rcushield.h>
+#else
+
 #ifndef __LINUX_RCUPDATE_H
 #define __LINUX_RCUPDATE_H
 
@@ -600,3 +604,4 @@ static inline void debug_rcu_head_unqueu
 	__rcu_dereference_index_check((p), (c))
 
 #endif /* __LINUX_RCUPDATE_H */
+#endif /* CONFIG_SHIELDING_RCU */
Index: b/include/linux/sysctl.h
===================================================================
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -153,6 +153,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_RCU=77,	/* make rcu variables available for debug */
 };
 
 
@@ -235,6 +236,11 @@ enum
 	RANDOM_UUID=6
 };
 
+/* /proc/sys/kernel/rcu */
+enum {
+	RCU_BATCH=1
+};
+
 /* /proc/sys/kernel/pty */
 enum
 {
Index: b/init/main.c
===================================================================
--- a/init/main.c
+++ b/init/main.c
@@ -606,13 +606,13 @@ asmlinkage void __init start_kernel(void
 				"enabled *very* early, fixing it\n");
 		local_irq_disable();
 	}
-	rcu_init();
 	radix_tree_init();
 	/* init some links before init_ISA_irqs() */
 	early_irq_init();
 	init_IRQ();
 	prio_tree_init();
 	init_timers();
+	rcu_init();  /* must appear after init_timers for shielded rcu */
 	hrtimers_init();
 	softirq_init();
 	timekeeping_init();
Index: b/kernel/Makefile
===================================================================
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -6,13 +6,16 @@ obj-y     = sched.o fork.o exec_domain.o
 	    cpu.o exit.o itimer.o time.o softirq.o resource.o \
 	    sysctl.o sysctl_binary.o capability.o ptrace.o timer.o user.o \
 	    signal.o sys.o kmod.o workqueue.o pid.o \
-	    rcupdate.o extable.o params.o posix-timers.o \
+	    extable.o params.o posix-timers.o \
 	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
 	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
 	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
 	    async.o range.o
 obj-$(CONFIG_HAVE_EARLY_RES) += early_res.o
 obj-y += groups.o
+ifndef CONFIG_SHIELDING_RCU
+obj-y += rcupdate.o
+endif
 
 ifdef CONFIG_FUNCTION_TRACER
 # Do not trace debug files and internal ftrace files
@@ -81,6 +84,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_t
 obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
 obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
 obj-$(CONFIG_SECCOMP) += seccomp.o
+obj-$(CONFIG_SHIELDING_RCU) += rcushield.o
 obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
 obj-$(CONFIG_TREE_RCU) += rcutree.o
 obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
Index: b/kernel/rcushield.c
===================================================================
--- /dev/null
+++ b/kernel/rcushield.c
@@ -0,0 +1,812 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2001
+ *
+ * Authors: Dipankar Sarma <dipankar@in.ibm.com>
+ *	    Manfred Spraul <manfred@colorfullife.com>
+ *
+ * Based on the original work by Paul McKenney <paulmck@us.ibm.com>
+ * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
+ * Papers:
+ * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
+ * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
+ *
+ * For detailed explanation of Read-Copy Update mechanism see -
+ * 		http://lse.sourceforge.net/locking/rcupdate.html
+ *
+ * Modified by:  Jim Houston <jim.houston@ccur.com>
+ * 	This is a experimental version which uses explicit synchronization
+ *	between rcu_read_lock/rcu_read_unlock and rcu_poll_other_cpus()
+ *	to complete RCU batches without relying on timer based polling.
+ *
+ */
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/sched.h>
+#include <asm/atomic.h>
+#include <linux/bitops.h>
+#include <linux/module.h>
+#include <linux/completion.h>
+#include <linux/moduleparam.h>
+#include <linux/percpu.h>
+#include <linux/notifier.h>
+#include <linux/rcupdate.h>
+#include <linux/cpu.h>
+#include <linux/jiffies.h>
+#include <linux/kthread.h>
+#include <linux/sysctl.h>
+
+/*
+ * Definition for rcu_batch.  This variable includes the flags:
+ *	RCU_NEXT_PENDING
+ * 		used to request that another batch should be
+ *		started when the current batch completes.
+ *	RCU_COMPLETE
+ *		which indicates that the last batch completed and
+ *		that rcu callback processing is stopped.
+ *
+ * Combinning this state in a single word allows them to be maintained
+ * using an atomic exchange.
+ */
+long rcu_batch = (-300*RCU_INCREMENT)+RCU_COMPLETE;
+unsigned long rcu_timestamp;
+
+/* Bookkeeping of the progress of the grace period */
+struct {
+	cpumask_t	rcu_cpu_mask; /* CPUs that need to switch in order    */
+				      /* for current batch to proceed.        */
+} rcu_state ____cacheline_internodealigned_in_smp =
+	  { .rcu_cpu_mask = CPU_MASK_NONE };
+
+
+DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+
+/*
+ * Limits to control when new batchs of RCU callbacks are started.
+ */
+long rcu_max_count = 256;
+unsigned long rcu_max_time = HZ/10;
+
+static void rcu_start_batch(void);
+
+/*
+ * Make the rcu_batch available for debug.
+ */
+ctl_table rcu_table[] = {
+	{
+		.procname	= "batch",
+		.data		= &rcu_batch,
+		.maxlen		= sizeof(rcu_batch),
+		.mode		= 0444,
+		.proc_handler	= &proc_doulongvec_minmax,
+	},
+	{}
+};
+
+/*
+ * rcu_set_state maintains the RCU_COMPLETE and RCU_NEXT_PENDING
+ * bits in rcu_batch.  Multiple processors might try to mark the
+ * current batch as complete, or start a new batch at the same time.
+ * The cmpxchg() makes the state transition atomic. rcu_set_state()
+ * returns the previous state.  This allows the caller to tell if
+ * it caused the state transition.
+ */
+
+int rcu_set_state(long state)
+{
+	long batch, new, last;
+	do {
+		batch = rcu_batch;
+		if (batch & state)
+			return batch & (RCU_COMPLETE | RCU_NEXT_PENDING);
+		new = batch | state;
+		last = cmpxchg(&rcu_batch, batch, new);
+	} while (unlikely(last != batch));
+	return last & (RCU_COMPLETE | RCU_NEXT_PENDING);
+}
+
+
+static atomic_t rcu_barrier_cpu_count;
+static struct mutex rcu_barrier_mutex;
+static struct completion rcu_barrier_completion;
+
+/*
+ * If the batch in the nxt list or cur list has completed move it to the
+ * done list.  If its grace period for the nxt list has begun
+ * move the contents to the cur list.
+ */
+static int rcu_move_if_done(struct rcu_data *r)
+{
+	int done = 0;
+
+	if (r->cur.head && rcu_batch_complete(r->batch)) {
+		rcu_list_move(&r->done, &r->cur);
+		done = 1;
+	}
+	if (r->nxt.head) {
+		if (rcu_batch_complete(r->nxtbatch)) {
+			rcu_list_move(&r->done, &r->nxt);
+			r->nxtcount = 0;
+			done = 1;
+		} else if (r->nxtbatch == rcu_batch) {
+			/*
+			 * The grace period for the nxt list has started
+			 * move its content to the cur list.
+			 */
+			rcu_list_move(&r->cur, &r->nxt);
+			r->batch = r->nxtbatch;
+			r->nxtcount = 0;
+		}
+	}
+	return done;
+}
+
+/*
+ * support delayed krcud wakeups.  Needed whenever we
+ * cannot wake up krcud directly, this happens whenever
+ * rcu_read_lock ... rcu_read_unlock is used under
+ * rq->lock.
+ */
+static cpumask_t rcu_wake_mask = CPU_MASK_NONE;
+static cpumask_t rcu_wake_mask_copy;
+static DEFINE_RAW_SPINLOCK(rcu_wake_lock);
+static int rcu_delayed_wake_count;
+
+void do_delayed_rcu_daemon_wakeups(void)
+{
+	int cpu;
+	unsigned long flags;
+	struct rcu_data *r;
+	struct task_struct *p;
+
+	if (likely(cpumask_empty(&rcu_wake_mask)))
+		return;
+
+	raw_spin_lock_irqsave(&rcu_wake_lock, flags);
+	cpumask_copy(&rcu_wake_mask_copy, &rcu_wake_mask);
+	cpumask_clear(&rcu_wake_mask);
+	raw_spin_unlock_irqrestore(&rcu_wake_lock, flags);
+
+	for_each_cpu(cpu, &rcu_wake_mask_copy) {
+		r = &per_cpu(rcu_data, cpu);
+		p = r->krcud;
+		if (p && p->state != TASK_RUNNING) {
+			wake_up_process(p);
+			rcu_delayed_wake_count++;
+		}
+	}
+}
+
+void rcu_wake_daemon_delayed(struct rcu_data *r)
+{
+	unsigned long flags;
+	raw_spin_lock_irqsave(&rcu_wake_lock, flags);
+	cpumask_set_cpu(task_cpu(r->krcud), &rcu_wake_mask);
+	raw_spin_unlock_irqrestore(&rcu_wake_lock, flags);
+}
+
+/*
+ * Wake rcu daemon if it is not already running.  Note that
+ * we avoid invoking wake_up_process if RCU is being used under
+ * the rq lock.
+ */
+void rcu_wake_daemon(struct rcu_data *r)
+{
+	struct task_struct *p = r->krcud;
+
+	if (p && p->state != TASK_RUNNING) {
+#ifdef BROKEN
+		/* runqueue_is_locked is racy, let us use only
+		 * the delayed approach.
+		 */
+		if (unlikely(runqueue_is_locked(smp_processor_id())))
+			rcu_wake_daemon_delayed(r);
+		else
+			wake_up_process(p);
+#else
+		rcu_wake_daemon_delayed(r);
+#endif
+	}
+}
+
+/**
+ * rcu_read_lock - mark the beginning of an RCU read-side critical section.
+ *
+ * When synchronize_rcu() is invoked on one CPU while other CPUs
+ * are within RCU read-side critical sections, then the
+ * synchronize_rcu() is guaranteed to block until after all the other
+ * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
+ * on one CPU while other CPUs are within RCU read-side critical
+ * sections, invocation of the corresponding RCU callback is deferred
+ * until after the all the other CPUs exit their critical sections.
+ *
+ * Note, however, that RCU callbacks are permitted to run concurrently
+ * with RCU read-side critical sections.  One way that this can happen
+ * is via the following sequence of events: (1) CPU 0 enters an RCU
+ * read-side critical section, (2) CPU 1 invokes call_rcu() to register
+ * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
+ * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
+ * callback is invoked.  This is legal, because the RCU read-side critical
+ * section that was running concurrently with the call_rcu() (and which
+ * therefore might be referencing something that the corresponding RCU
+ * callback would free up) has completed before the corresponding
+ * RCU callback is invoked.
+ *
+ * RCU read-side critical sections may be nested.  Any deferred actions
+ * will be deferred until the outermost RCU read-side critical section
+ * completes.
+ *
+ * It is illegal to block while in an RCU read-side critical section.
+ */
+void __rcu_read_lock(void)
+{
+	struct rcu_data *r;
+
+	r = &per_cpu(rcu_data, smp_processor_id());
+	if (r->nest_count++ == 0)
+		/*
+		 * Set the flags value to show that we are in
+		 * a read side critical section.  The code starting
+		 * a batch uses this to determine if a processor
+		 * needs to participate in the batch.  Including
+		 * a sequence allows the remote processor to tell
+		 * that a critical section has completed and another
+		 * has begun.
+		 */
+		r->flags = IN_RCU_READ_LOCK | (r->sequence++ << 2);
+}
+EXPORT_SYMBOL(__rcu_read_lock);
+
+/**
+ * rcu_read_unlock - marks the end of an RCU read-side critical section.
+ * Check if a RCU batch was started while we were in the critical
+ * section.  If so, call rcu_quiescent() join the rendezvous.
+ *
+ * See rcu_read_lock() for more information.
+ */
+void __rcu_read_unlock(void)
+{
+	struct rcu_data *r;
+	int	cpu, flags;
+
+	cpu = smp_processor_id();
+	r = &per_cpu(rcu_data, cpu);
+	if (--r->nest_count == 0) {
+		flags = xchg(&r->flags, 0);
+		if (flags & DO_RCU_COMPLETION)
+			rcu_quiescent(cpu);
+	}
+}
+EXPORT_SYMBOL(__rcu_read_unlock);
+
+/**
+ * call_rcu - Queue an RCU callback for invocation after a grace period.
+ * @head: structure to be used for queueing the RCU updates.
+ * @func: actual update function to be invoked after the grace period
+ *
+ * The update function will be invoked some time after a full grace
+ * period elapses, in other words after all currently executing RCU
+ * read-side critical sections have completed.  RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ */
+void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
+{
+	struct rcu_data *r;
+	unsigned long flags;
+	int cpu;
+
+	head->func = func;
+	head->next = NULL;
+	local_irq_save(flags);
+	cpu = smp_processor_id();
+	r = &per_cpu(rcu_data, cpu);
+	/*
+	 * Avoid mixing new entries with batches which have already
+	 * completed or have a grace period in progress.
+	 */
+	if (r->nxt.head && rcu_move_if_done(r))
+		rcu_wake_daemon(r);
+
+	rcu_list_add(&r->nxt, head);
+	if (r->nxtcount++ == 0) {
+		r->nxtbatch = (rcu_batch & RCU_BATCH_MASK) + RCU_INCREMENT;
+		barrier();
+		if (!rcu_timestamp)
+			rcu_timestamp = jiffies ?: 1;
+	}
+	/* If we reach the limit start a batch. */
+	if (r->nxtcount > rcu_max_count) {
+		if (rcu_set_state(RCU_NEXT_PENDING) == RCU_COMPLETE)
+			rcu_start_batch();
+	}
+	local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
+
+/*
+ * Revisit - my patch treats any code not protected by rcu_read_lock(),
+ * rcu_read_unlock() as a quiescent state.  I suspect that the call_rcu_bh()
+ * interface is not needed.
+ */
+void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
+{
+	call_rcu(head, func);
+}
+EXPORT_SYMBOL_GPL(call_rcu_bh);
+
+static void rcu_barrier_callback(struct rcu_head *notused)
+{
+	if (atomic_dec_and_test(&rcu_barrier_cpu_count))
+		complete(&rcu_barrier_completion);
+}
+
+/*
+ * Called with preemption disabled, and from cross-cpu IRQ context.
+ */
+static void rcu_barrier_func(void *notused)
+{
+	int cpu = smp_processor_id();
+	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+	struct rcu_head *head;
+
+	head = &rdp->barrier;
+	atomic_inc(&rcu_barrier_cpu_count);
+	call_rcu(head, rcu_barrier_callback);
+}
+
+/**
+ * rcu_barrier - Wait until all the in-flight RCUs are complete.
+ */
+void rcu_barrier(void)
+{
+	BUG_ON(in_interrupt());
+	/* Take cpucontrol semaphore to protect against CPU hotplug */
+	mutex_lock(&rcu_barrier_mutex);
+	init_completion(&rcu_barrier_completion);
+	atomic_set(&rcu_barrier_cpu_count, 0);
+	on_each_cpu(rcu_barrier_func, NULL, 1);
+	wait_for_completion(&rcu_barrier_completion);
+	mutex_unlock(&rcu_barrier_mutex);
+}
+EXPORT_SYMBOL(rcu_barrier);
+
+
+/*
+ * cpu went through a quiescent state since the beginning of the grace period.
+ * Clear it from the cpu mask and complete the grace period if it was the last
+ * cpu. Start another grace period if someone has further entries pending
+ */
+
+static void rcu_grace_period_complete(void)
+{
+	struct rcu_data *r;
+	int cpu, last;
+
+	/*
+	 * Mark the batch as complete.  If RCU_COMPLETE was
+	 * already set we raced with another processor
+	 * and it will finish the completion processing.
+	 */
+	last = rcu_set_state(RCU_COMPLETE);
+	if (last & RCU_COMPLETE)
+		return;
+	/*
+	 * If RCU_NEXT_PENDING is set, start the new batch.
+	 */
+	if (last & RCU_NEXT_PENDING)
+		rcu_start_batch();
+	/*
+	 * Wake the krcud for any cpu which has requests queued.
+	 */
+	for_each_online_cpu(cpu) {
+		r = &per_cpu(rcu_data, cpu);
+		if (r->nxt.head || r->cur.head || r->done.head)
+			rcu_wake_daemon(r);
+	}
+}
+
+/*
+ * rcu_quiescent() is called from rcu_read_unlock() when a
+ * RCU batch was started while the rcu_read_lock/rcu_read_unlock
+ * critical section was executing.
+ */
+
+void rcu_quiescent(int cpu)
+{
+	cpu_clear(cpu, rcu_state.rcu_cpu_mask);
+	if (cpus_empty(rcu_state.rcu_cpu_mask))
+		rcu_grace_period_complete();
+}
+
+/*
+ * Check if the other cpus are in rcu_read_lock/rcu_read_unlock protected code.
+ * If not they are assumed to be quiescent and we can clear the bit in
+ * bitmap.  If not set DO_RCU_COMPLETION to request a quiescent point on
+ * the rcu_read_unlock.
+ *
+ * Do this in two passes.  On the first pass we sample the flags value.
+ * The second pass only looks at processors which were found in the read
+ * side critical section on the first pass.  The flags value contains
+ * a sequence value so we can tell if the processor has completed a
+ * critical section even if it has started another.
+ */
+long rcu_grace_periods;
+long rcu_count1;
+long rcu_count2;
+long rcu_count3;
+
+void rcu_poll_other_cpus(void)
+{
+	struct rcu_data *r;
+	int cpu;
+	cpumask_t mask;
+	unsigned int f, flags[NR_CPUS];
+
+	rcu_grace_periods++;
+	for_each_online_cpu(cpu) {
+		r = &per_cpu(rcu_data, cpu);
+		f = flags[cpu] = r->flags;
+		if (f == 0) {
+			cpu_clear(cpu, rcu_state.rcu_cpu_mask);
+			rcu_count1++;
+		}
+	}
+	mask = rcu_state.rcu_cpu_mask;
+	for_each_cpu_mask(cpu, mask) {
+		r = &per_cpu(rcu_data, cpu);
+		/*
+		 * If the remote processor is still in the same read-side
+		 * critical section set DO_RCU_COMPLETION to request that
+		 * the cpu participate in the grace period.
+		 */
+		f = r->flags;
+		if (f == flags[cpu])
+			f = cmpxchg(&r->flags, f, f | DO_RCU_COMPLETION);
+		/*
+		 * If the other processors flags value changes before
+		 * the cmpxchg() that processor is nolonger in the
+		 * read-side critical section so we clear its bit.
+		 */
+		if (f != flags[cpu]) {
+			cpu_clear(cpu, rcu_state.rcu_cpu_mask);
+			rcu_count2++;
+		} else
+			rcu_count3++;
+
+	}
+	if (cpus_empty(rcu_state.rcu_cpu_mask))
+		rcu_grace_period_complete();
+}
+
+/*
+ * Grace period handling:
+ * The grace period handling consists out of two steps:
+ * - A new grace period is started.
+ *   This is done by rcu_start_batch. The rcu_poll_other_cpus()
+ *   call drives the synchronization.  It loops checking if each
+ *   of the other cpus are executing in a rcu_read_lock/rcu_read_unlock
+ *   critical section.  The flags word for the cpus it finds in a
+ *   rcu_read_lock/rcu_read_unlock critical section will be updated to
+ *   request a rcu_quiescent() call.
+ * - Each of the cpus which were in the rcu_read_lock/rcu_read_unlock
+ *   critical section will eventually call rcu_quiescent() and clear
+ *   the bit corresponding to their cpu in rcu_state.rcu_cpu_mask.
+ * - The processor which clears the last bit wakes the krcud for
+ *   the cpus which have rcu callback requests queued.
+ *
+ * The process of starting a batch is arbitrated with the RCU_COMPLETE &
+ * RCU_NEXT_PENDING bits. These bits can be set in either order but the
+ * thread which sets the second bit must call rcu_start_batch().
+ * Multiple processors might try to set these bits at the same time.
+ * By using cmpxchg() we can determine which processor actually set
+ * the bit and be sure that only a single thread trys to start the batch.
+ *
+ */
+static void rcu_start_batch(void)
+{
+	long batch, new;
+
+	batch = rcu_batch;
+	BUG_ON((batch & (RCU_COMPLETE|RCU_NEXT_PENDING)) !=
+		(RCU_COMPLETE|RCU_NEXT_PENDING));
+	rcu_timestamp = 0;
+	smp_mb();
+	/*
+	 * nohz_cpu_mask can go away because only cpus executing
+	 * rcu_read_lock/rcu_read_unlock critical sections need to
+	 * participate in the rendezvous.
+	 */
+	cpumask_andnot(&rcu_state.rcu_cpu_mask, cpu_online_mask, nohz_cpu_mask);
+	new = (batch & RCU_BATCH_MASK) + RCU_INCREMENT;
+	smp_mb();
+	rcu_batch = new;
+	smp_mb();
+	rcu_poll_other_cpus();
+}
+
+
+
+#ifdef CONFIG_HOTPLUG_CPU
+
+static void rcu_offline_cpu(int cpu)
+{
+	struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
+	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
+
+#if 0
+	/*
+	 * The cpu should not have been in a read side critical
+	 * section when it was removed.  So this code is not needed.
+	 */
+	/* if the cpu going offline owns the grace period
+	 * we can block indefinitely waiting for it, so flush
+	 * it here
+	 */
+	if (!(rcu_batch & RCU_COMPLETE))
+		rcu_quiescent(cpu);
+#endif
+	local_irq_disable();
+	/*
+	 * The rcu lists are per-cpu private data only protected by
+	 * disabling interrupts.  Since we know the other cpu is dead
+	 * it should not be manipulating these lists.
+	 */
+	rcu_list_move(&this_rdp->cur, &rdp->cur);
+	rcu_list_move(&this_rdp->nxt, &rdp->nxt);
+	this_rdp->nxtbatch = (rcu_batch & RCU_BATCH_MASK) + RCU_INCREMENT;
+	local_irq_enable();
+	put_cpu_var(rcu_data);
+}
+
+#else
+
+static inline void rcu_offline_cpu(int cpu)
+{
+}
+
+#endif
+
+/*
+ * Process the completed RCU callbacks.
+ */
+static void rcu_process_callbacks(struct rcu_data *r)
+{
+	struct rcu_head *list, *next;
+
+	local_irq_disable();
+	rcu_move_if_done(r);
+	list = r->done.head;
+	rcu_list_init(&r->done);
+	local_irq_enable();
+
+	while (list) {
+		next = list->next;
+		list->func(list);
+		list = next;
+	}
+}
+
+/*
+ * Poll rcu_timestamp to start a RCU batch if there are
+ * any pending request which have been waiting longer
+ * than rcu_max_time.
+ */
+struct timer_list rcu_timer;
+
+void rcu_timeout(unsigned long unused)
+{
+	do_delayed_rcu_daemon_wakeups();
+
+	if (rcu_timestamp
+	&& time_after(jiffies, (rcu_timestamp + rcu_max_time))) {
+		if (rcu_set_state(RCU_NEXT_PENDING) == RCU_COMPLETE)
+			rcu_start_batch();
+	}
+	init_timer(&rcu_timer);
+	rcu_timer.expires = jiffies + (rcu_max_time/2?:1);
+	add_timer(&rcu_timer);
+}
+
+static void __devinit rcu_online_cpu(int cpu)
+{
+	struct rcu_data *r = &per_cpu(rcu_data, cpu);
+
+	memset(&per_cpu(rcu_data, cpu), 0, sizeof(struct rcu_data));
+	rcu_list_init(&r->nxt);
+	rcu_list_init(&r->cur);
+	rcu_list_init(&r->done);
+}
+
+int rcu_pending(struct rcu_data *r)
+{
+	return r->done.head ||
+		(r->cur.head && rcu_batch_complete(r->batch)) ||
+		(r->nxt.head && rcu_batch_complete(r->nxtbatch));
+}
+
+static int krcud(void *__bind_cpu)
+{
+	int cpu = (int)(long) __bind_cpu;
+	struct rcu_data *r = &per_cpu(rcu_data, cpu);
+
+	set_user_nice(current, 19);
+	current->flags |= PF_NOFREEZE;
+
+	set_current_state(TASK_INTERRUPTIBLE);
+
+	while (!kthread_should_stop()) {
+		if (!rcu_pending(r))
+			schedule();
+
+		__set_current_state(TASK_RUNNING);
+
+		while (rcu_pending(r)) {
+			/* Preempt disable stops cpu going offline.
+			   If already offline, we'll be on wrong CPU:
+			   don't process */
+			preempt_disable();
+			if (cpu_is_offline((long)__bind_cpu))
+				goto wait_to_die;
+			preempt_enable();
+			rcu_process_callbacks(r);
+			cond_resched();
+		}
+
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+
+wait_to_die:
+	preempt_enable();
+	/* Wait for kthread_stop */
+	set_current_state(TASK_INTERRUPTIBLE);
+	while (!kthread_should_stop()) {
+		schedule();
+		set_current_state(TASK_INTERRUPTIBLE);
+	}
+	__set_current_state(TASK_RUNNING);
+	return 0;
+}
+
+static int __devinit rcu_cpu_notify(struct notifier_block *nfb,
+				  unsigned long action,
+				  void *hcpu)
+{
+	int cpu = (unsigned long)hcpu;
+	struct rcu_data *r = &per_cpu(rcu_data, cpu);
+	struct task_struct *p;
+
+	switch (action) {
+	case CPU_UP_PREPARE:
+		rcu_online_cpu(cpu);
+		p = kthread_create(krcud, hcpu, "krcud/%d", cpu);
+		if (IS_ERR(p)) {
+			printk(KERN_INFO "krcud for %i failed\n", cpu);
+			return NOTIFY_BAD;
+		}
+		kthread_bind(p, cpu);
+		r->krcud = p;
+		break;
+	case CPU_ONLINE:
+		wake_up_process(r->krcud);
+		break;
+#ifdef CONFIG_HOTPLUG_CPU
+	case CPU_UP_CANCELED:
+		/* Unbind so it can run.  Fall thru. */
+		kthread_bind(r->krcud, smp_processor_id());
+	case CPU_DEAD:
+		p = r->krcud;
+		r->krcud = NULL;
+		kthread_stop(p);
+		rcu_offline_cpu(cpu);
+		break;
+#endif /* CONFIG_HOTPLUG_CPU */
+	}
+	return NOTIFY_OK;
+}
+
+static struct notifier_block __devinitdata rcu_nb = {
+	.notifier_call	= rcu_cpu_notify,
+};
+
+static __init int spawn_krcud(void)
+{
+	void *cpu = (void *)(long)smp_processor_id();
+	rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, cpu);
+	rcu_cpu_notify(&rcu_nb, CPU_ONLINE, cpu);
+	register_cpu_notifier(&rcu_nb);
+	return 0;
+}
+early_initcall(spawn_krcud);
+/*
+ * Initializes rcu mechanism.  Assumed to be called early.
+ * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
+ * Note that rcu_qsctr and friends are implicitly
+ * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
+ */
+void __init rcu_init(void)
+{
+	mutex_init(&rcu_barrier_mutex);
+	rcu_online_cpu(smp_processor_id());
+	/*
+	 * Use a timer to catch the elephants which would otherwise
+	 * fall throught the cracks on local timer shielded cpus.
+	 */
+	init_timer(&rcu_timer);
+	rcu_timer.function = rcu_timeout;
+	rcu_timer.expires = jiffies + (rcu_max_time/2?:1);
+	add_timer(&rcu_timer);
+}
+
+
+struct rcu_synchronize {
+	struct rcu_head head;
+	struct completion completion;
+};
+
+/* Because of FASTCALL declaration of complete, we use this wrapper */
+static void wakeme_after_rcu(struct rcu_head  *head)
+{
+	struct rcu_synchronize *rcu;
+
+	rcu = container_of(head, struct rcu_synchronize, head);
+	complete(&rcu->completion);
+}
+
+/**
+ * synchronize_rcu - wait until a grace period has elapsed.
+ *
+ * Control will return to the caller some time after a full grace
+ * period has elapsed, in other words after all currently executing RCU
+ * read-side critical sections have completed.  RCU read-side critical
+ * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
+ * and may be nested.
+ *
+ * If your read-side code is not protected by rcu_read_lock(), do -not-
+ * use synchronize_rcu().
+ */
+void synchronize_rcu(void)
+{
+	struct rcu_synchronize rcu;
+
+	init_completion(&rcu.completion);
+	/* Will wake me after RCU finished */
+	call_rcu(&rcu.head, wakeme_after_rcu);
+
+	/* Wait for it */
+	wait_for_completion(&rcu.completion);
+}
+EXPORT_SYMBOL_GPL(synchronize_rcu);
+
+/*
+ * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
+ */
+void synchronize_kernel(void)
+{
+	synchronize_rcu();
+}
+EXPORT_SYMBOL(synchronize_kernel);
+
+module_param(rcu_max_count, long, 0644);
+module_param(rcu_max_time, long, 0644);
Index: b/kernel/sysctl.c
===================================================================
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -215,6 +215,10 @@ extern struct ctl_table random_table[];
 extern struct ctl_table epoll_table[];
 #endif
 
+#ifdef CONFIG_SHIELDING_RCU
+extern ctl_table rcu_table[];
+#endif
+
 #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
 int sysctl_legacy_va_layout;
 #endif
@@ -808,6 +812,13 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 #endif
+#ifdef CONFIG_SHIELDING_RCU
+	{
+		.procname	= "rcu",
+		.mode		= 0555,
+		.child		= rcu_table,
+	},
+#endif
 #if defined(CONFIG_S390) && defined(CONFIG_SMP)
 	{
 		.procname	= "spin_retry",
Index: b/kernel/timer.c
===================================================================
--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -1272,12 +1272,15 @@ unsigned long get_next_timer_interrupt(u
 void update_process_times(int user_tick)
 {
 	struct task_struct *p = current;
-	int cpu = smp_processor_id();
 
 	/* Note: this timer irq context must be accounted for as well. */
 	account_process_tick(p, user_tick);
 	run_local_timers();
-	rcu_check_callbacks(cpu, user_tick);
+#ifndef CONFIG_SHIELDING_RCU
+	rcu_check_callbacks(smp_processor_id(), user_tick);
+#else
+	do_delayed_rcu_daemon_wakeups();
+#endif
 	printk_tick();
 	perf_event_do_pending();
 	scheduler_tick();
Index: b/lib/Kconfig.debug
===================================================================
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -791,6 +791,7 @@ config BOOT_PRINTK_DELAY
 config RCU_TORTURE_TEST
 	tristate "torture tests for RCU"
 	depends on DEBUG_KERNEL
+	depends on !SHIELDING_RCU
 	default n
 	help
 	  This option provides a kernel module that runs torture tests
Index: b/init/Kconfig
===================================================================
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -365,6 +365,13 @@ config TINY_RCU
 	  is not required.  This option greatly reduces the
 	  memory footprint of RCU.
 
+config SHIELDING_RCU
+	bool "Shielding RCU"
+	help
+	  This option selects the RCU implementation that does not
+	  depend on a per-cpu periodic interrupt to do garbage
+	  collection.  This is good when one is trying to shield
+	  some set of CPUs from as much system activity as possible.
 endchoice
 
 config RCU_TRACE
Index: b/include/linux/hardirq.h
===================================================================
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -138,7 +138,12 @@ static inline void account_system_vtime(
 }
 #endif
 
-#if defined(CONFIG_NO_HZ)
+#if defined(CONFIG_SHIELDING_RCU)
+# define rcu_irq_enter() do { } while (0)
+# define rcu_irq_exit() do { } while (0)
+# define rcu_nmi_enter() do { } while (0)
+# define rcu_nmi_exit() do { } while (0)
+#elif defined(CONFIG_NO_HZ)
 #if defined(CONFIG_TINY_RCU)
 extern void rcu_enter_nohz(void);
 extern void rcu_exit_nohz(void);
@@ -161,13 +166,13 @@ static inline void rcu_nmi_exit(void)
 {
 }
 
-#else
+#else /* !CONFIG_TINY_RCU */
 extern void rcu_irq_enter(void);
 extern void rcu_irq_exit(void);
 extern void rcu_nmi_enter(void);
 extern void rcu_nmi_exit(void);
 #endif
-#else
+#else /* !CONFIG_NO_HZ */
 # define rcu_irq_enter() do { } while (0)
 # define rcu_irq_exit() do { } while (0)
 # define rcu_nmi_enter() do { } while (0)
Index: b/kernel/sysctl_binary.c
===================================================================
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -61,6 +61,11 @@ static const struct bin_table bin_pty_ta
 	{}
 };
 
+static const struct bin_table bin_rcu_table[] = {
+	{ CTL_INT,	RCU_BATCH,	"batch" },
+	{}
+};
+
 static const struct bin_table bin_kern_table[] = {
 	{ CTL_STR,	KERN_OSTYPE,			"ostype" },
 	{ CTL_STR,	KERN_OSRELEASE,			"osrelease" },
@@ -138,6 +143,7 @@ static const struct bin_table bin_kern_t
 	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ CTL_INT,	KERN_NMI_WATCHDOG,		"nmi_watchdog" },
 	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ CTL_DIR,	KERN_RCU,			"rcu", bin_rcu_table },
 	{}
 };
 
Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -9119,6 +9119,7 @@ struct cgroup_subsys cpuacct_subsys = {
 };
 #endif	/* CONFIG_CGROUP_CPUACCT */
 
+#ifndef CONFIG_SHIELDING_RCU
 #ifndef CONFIG_SMP
 
 void synchronize_sched_expedited(void)
@@ -9188,3 +9189,4 @@ void synchronize_sched_expedited(void)
 EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
 
 #endif /* #else #ifndef CONFIG_SMP */
+#endif /* CONFIG_SHIELDING_RCU */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
@ 2010-11-06 19:28   ` Paul E. McKenney
  2010-11-06 19:34     ` Mathieu Desnoyers
                       ` (2 more replies)
  2010-11-06 20:03   ` Mathieu Desnoyers
  2010-11-09  9:22   ` Lai Jiangshan
  2 siblings, 3 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-06 19:28 UTC (permalink / raw)
  To: Joe Korty
  Cc: fweisbec, mathieu.desnoyers, dhowells, loic.minier, dhaval.giani,
	tglx, peterz, linux-kernel, josh

On Fri, Nov 05, 2010 at 05:00:59PM -0400, Joe Korty wrote:
> On Thu, Nov 04, 2010 at 04:21:48PM -0700, Paul E. McKenney wrote:
> > Just wanted some written record of our discussion this Wednesday.
> > I don't have an email address for Jim Houston, and I am not sure I have
> > all of the attendees, but here goes anyway.  Please don't hesitate to
> > reply with any corrections!
> > 
> > The goal is to be able to turn of scheduling-clock interrupts for
> > long-running user-mode execution when there is but one runnable task
> > on a given CPU, but while still allowing RCU to function correctly.
> > In particular, we need to minimize (or better, eliminate) any source
> > of interruption to such a CPU.  We discussed these approaches, along
> > with their advantages and disadvantages:

Thank you very much for forward-porting and sending this, Joe!!!

A few questions and comments interspersed, probably mostly reflecting
my confusion about what this is doing.  The basic approach of driving
the grace periods out of rcu_read_unlock() and a per-CPU kthread does
seem quite workable in any case.

							Thanx, Paul

> Jim Houston's timer-less version of RCU.
> 	
> This rather ancient version of RCU handles RCU garbage
> collection in the absence of a per-cpu local timer
> interrupt.
> 
> This is a minimal forward port to 2.6.36.  It works,
> but it is not yet a complete implementation of RCU.
> 
> Developed-by: Jim Houston <jim.houston@ccur.com>
> Signed-off-by: Joe Korty <joe.korty@ccur.com>
> 
> Index: b/arch/x86/kernel/cpu/mcheck/mce.c
> ===================================================================
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -167,7 +167,8 @@ void mce_log(struct mce *mce)
>  	mce->finished = 0;
>  	wmb();
>  	for (;;) {
> -		entry = rcu_dereference_check_mce(mcelog.next);
> +		entry = mcelog.next;
> +		smp_read_barrier_depends();
>  		for (;;) {
>  			/*
>  			 * If edac_mce is enabled, it will check the error type
> @@ -1558,7 +1559,8 @@ static ssize_t mce_read(struct file *fil
>  			goto out;
>  	}
> 
> -	next = rcu_dereference_check_mce(mcelog.next);
> +	next = mcelog.next;
> +	smp_read_barrier_depends();
> 
>  	/* Only supports full reads right now */
>  	err = -EINVAL;
> Index: b/include/linux/rcushield.h
> ===================================================================
> --- /dev/null
> +++ b/include/linux/rcushield.h
> @@ -0,0 +1,361 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2001
> + *
> + * Author: Dipankar Sarma <dipankar@in.ibm.com>
> + *
> + * Based on the original work by Paul McKenney <paul.mckenney@us.ibm.com>
> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> + * Papers:
> + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + * 		http://lse.sourceforge.net/locking/rcupdate.html
> + *
> + */
> +
> +#ifndef __LINUX_RCUPDATE_H
> +#define __LINUX_RCUPDATE_H
> +
> +#ifdef __KERNEL__
> +
> +#include <linux/cache.h>
> +#include <linux/spinlock.h>
> +#include <linux/threads.h>
> +#include <linux/smp.h>
> +#include <linux/cpumask.h>
> +
> +/*
> + * These #includes are not used by shielded RCUs; they are here
> + * to match the #includes made by the other rcu implementations.
> + */
> +#include <linux/seqlock.h>
> +#include <linux/lockdep.h>
> +#include <linux/completion.h>
> +
> +/**
> + * struct rcu_head - callback structure for use with RCU
> + * @next: next update requests in a list
> + * @func: actual update function to call after the grace period.
> + */
> +struct rcu_head {
> +	struct rcu_head *next;
> +	void (*func)(struct rcu_head *head);
> +};
> +
> +#define RCU_HEAD_INIT 	{ .next = NULL, .func = NULL }
> +#define RCU_HEAD(head) struct rcu_head head = RCU_HEAD_INIT
> +#define INIT_RCU_HEAD(ptr) do { \
> +       (ptr)->next = NULL; (ptr)->func = NULL; \
> +} while (0)
> +
> +/*
> + * The rcu_batch variable contains the current batch number
> + * and the following flags.  The RCU_NEXT_PENDING bit requests that
> + * a new batch should start when the current batch completes.  The
> + * RCU_COMPLETE bit indicates that the most recent batch has completed
> + * and RCU processing has stopped.
> + */
> +extern long rcu_batch;
> +#define RCU_BATCH_MASK		(~3)
> +#define RCU_INCREMENT		4
> +#define RCU_COMPLETE		2
> +#define RCU_NEXT_PENDING	1
> +
> +/* Is batch a before batch b ? */
> +static inline int rcu_batch_before(long a, long b)
> +{
> +	return (a - b) < 0;
> +}
> +
> +/* Is batch a after batch b ? */
> +static inline int rcu_batch_after(long a, long b)
> +{
> +	return (a - b) > 0;
> +}
> +
> +static inline int rcu_batch_complete(long batch)
> +{
> +	return !rcu_batch_before((rcu_batch & ~RCU_NEXT_PENDING), batch);
> +}
> +
> +struct rcu_list {
> +	struct rcu_head *head;
> +	struct rcu_head **tail;
> +};
> +
> +static inline void rcu_list_init(struct rcu_list *l)
> +{
> +	l->head = NULL;
> +	l->tail = &l->head;
> +}
> +
> +static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
> +{
> +	*l->tail = h;
> +	l->tail = &h->next;
> +}
> +
> +static inline void rcu_list_move(struct rcu_list *to, struct rcu_list *from)
> +{
> +	if (from->head) {
> +		*to->tail = from->head;
> +		to->tail = from->tail;
> +		rcu_list_init(from);
> +	}
> +}
> +
> +/*
> + * Per-CPU data for Read-Copy UPdate.
> + * nxtlist - new callbacks are added here
> + * curlist - current batch for which quiescent cycle started if any
> + */
> +struct rcu_data {
> +	/* 1) batch handling */
> +	long  	       	batch;		/* batch # for current RCU batch */
> +	unsigned long	nxtbatch;	/* batch # for next queue */
> +	struct rcu_list nxt;
> +	struct rcu_list cur;
> +	struct rcu_list done;

Lai Jiangshan's multi-tail trick would work well here, but this works
fine too.

> +	long		nxtcount;	/* number of callbacks queued */
> +	struct task_struct *krcud;
> +	struct rcu_head barrier;
> +
> +	/* 2) synchronization between rcu_read_lock and rcu_start_batch. */
> +	int		nest_count;	/* count of rcu_read_lock nesting */
> +	unsigned int	flags;
> +	unsigned int	sequence;	/* count of read locks. */
> +};
> +
> +/*
> + * Flags values used to synchronize between rcu_read_lock/rcu_read_unlock
> + * and the rcu_start_batch.  Only processors executing rcu_read_lock
> + * protected code get invited to the rendezvous.
> + */
> +#define	IN_RCU_READ_LOCK	1
> +#define	DO_RCU_COMPLETION	2
> +
> +DECLARE_PER_CPU(struct rcu_data, rcu_data);
> +
> +/**
> + * rcu_assign_pointer - assign (publicize) a pointer to a newly
> + * initialized structure that will be dereferenced by RCU read-side
> + * critical sections.  Returns the value assigned.
> + *
> + * Inserts memory barriers on architectures that require them
> + * (pretty much all of them other than x86), and also prevents
> + * the compiler from reordering the code that initializes the
> + * structure after the pointer assignment.  More importantly, this
> + * call documents which pointers will be dereferenced by RCU read-side
> + * code.
> + */
> +
> +#define rcu_assign_pointer(p, v)	({ \
> +						smp_wmb(); \
> +						(p) = (v); \
> +					})
> +
> +extern void rcu_init(void);
> +extern void rcu_restart_cpu(int cpu);
> +extern void rcu_quiescent(int cpu);
> +extern void rcu_poll(int cpu);
> +
> +/* stubs for mainline rcu features we do not need */
> +static inline void rcu_sched_qs(int cpu) { }
> +static inline void rcu_bh_qs(int cpu) { }
> +static inline int rcu_needs_cpu(int cpu) { return 0; }
> +static inline void rcu_enter_nohz(void) { }
> +static inline void rcu_exit_nohz(void) { }
> +static inline void rcu_init_sched(void) { }
> +
> +extern void __rcu_read_lock(void);
> +extern void __rcu_read_unlock(void);
> +
> +static inline void rcu_read_lock(void)
> +{
> +	preempt_disable();

We will need preemptible read-side critical sections for some workloads,
however, the HPC guys are probably OK with non-preemptible read-side
critical sections.  And it is probably not impossible to adapt something
like this for the preemptible case.

> +	__rcu_read_lock();
> +}
> +
> +static inline void rcu_read_unlock(void)
> +{
> +	__rcu_read_unlock();
> +	preempt_enable();
> +}
> +
> +#define rcu_read_lock_sched(void) rcu_read_lock()
> +#define rcu_read_unlock_sched(void) rcu_read_unlock()
> +
> +static inline void rcu_read_lock_sched_notrace(void)
> +{
> +	preempt_disable_notrace();
> +	__rcu_read_lock();
> +}
> +
> +#ifdef CONFIG_DEBUG_LOCK_ALLOC
> +#error need DEBUG_LOCK_ALLOC definitions for rcu_read_lock_*_held
> +#else
> +static inline int rcu_read_lock_held(void)
> +{
> +	return 1;
> +}
> +
> +static inline int rcu_read_lock_bh_held(void)
> +{
> +	return 1;
> +}
> +#endif /* CONFIG_DEBUG_LOCK_ALLOC */
> +
> +static inline int rcu_preempt_depth(void)
> +{
> +	return 0;
> +}
> +
> +static inline void exit_rcu(void)
> +{
> +}
> +
> +static inline void rcu_read_unlock_sched_notrace(void)
> +{
> +	__rcu_read_unlock();
> +	preempt_enable_notrace();
> +}
> +
> +#ifdef CONFIG_DEBUG_KERNEL
> +/*
> + * Try to catch code which depends on RCU but doesn't
> + * hold the rcu_read_lock.
> + */
> +static inline void rcu_read_lock_assert(void)
> +{
> +#ifdef NOTYET
> +	/* 2.6.13 has _lots_ of panics here.  Must fix up. */
> +	struct rcu_data *r;
> +
> +	r = &per_cpu(rcu_data, smp_processor_id());
> +	BUG_ON(r->nest_count == 0);
> +#endif
> +}
> +#else
> +static inline void rcu_read_lock_assert(void) {}
> +#endif
> +
> +/*
> + * So where is rcu_write_lock()?  It does not exist, as there is no
> + * way for writers to lock out RCU readers.  This is a feature, not
> + * a bug -- this property is what provides RCU's performance benefits.
> + * Of course, writers must coordinate with each other.  The normal
> + * spinlock primitives work well for this, but any other technique may be
> + * used as well.  RCU does not care how the writers keep out of each
> + * others' way, as long as they do so.
> + */
> +
> +/**
> + * rcu_read_lock_bh - mark the beginning of a softirq-only RCU critical section
> + *
> + * This is equivalent of rcu_read_lock(), but to be used when updates
> + * are being done using call_rcu_bh(). Since call_rcu_bh() callbacks
> + * consider completion of a softirq handler to be a quiescent state,
> + * a process in RCU read-side critical section must be protected by
> + * disabling softirqs. Read-side critical sections in interrupt context
> + * can use just rcu_read_lock().
> + *
> + * Hack alert.  I'm not sure if I understand the reason this interface
> + * is needed and if it is still needed with my implementation of RCU.

Given that you keep track of RCU read-side critical sections exactly
rather than relying on quiescent states, this should work fine.

> + */
> +static inline void rcu_read_lock_bh(void)
> +{
> +	local_bh_disable();
> +	rcu_read_lock();
> +}
> +
> +/*
> + * rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
> + *
> + * See rcu_read_lock_bh() for more information.
> + */
> +static inline void rcu_read_unlock_bh(void)
> +{
> +	rcu_read_unlock();
> +	local_bh_enable();
> +}
> +
> +/**
> + * rcu_dereference - fetch an RCU-protected pointer in an
> + * RCU read-side critical section.  This pointer may later
> + * be safely dereferenced.
> + *
> + * Inserts memory barriers on architectures that require them
> + * (currently only the Alpha), and, more importantly, documents
> + * exactly which pointers are protected by RCU.
> + */
> +
> +#define rcu_dereference(p)     ({ \
> +				typeof(p) _________p1 = p; \
> +				rcu_read_lock_assert(); \
> +				smp_read_barrier_depends(); \
> +				(_________p1); \
> +				})
> +
> +#define rcu_dereference_raw(p)     ({ \
> +				typeof(p) _________p1 = p; \
> +				smp_read_barrier_depends(); \
> +				(_________p1); \
> +				})
> +
> +#define rcu_dereference_sched(p) rcu_dereference(p)
> +#define rcu_dereference_check(p, c) rcu_dereference(p)
> +#define rcu_dereference_index_check(p, c) rcu_dereference(p)
> +#define rcu_dereference_protected(p, c) rcu_dereference(p)
> +#define rcu_dereference_bh(p) rcu_dereference(p)
> +
> +static inline void rcu_note_context_switch(int cpu) {}
> +
> +/**
> + * synchronize_sched - block until all CPUs have exited any non-preemptive
> + * kernel code sequences.
> + *
> + * This means that all preempt_disable code sequences, including NMI and
> + * hardware-interrupt handlers, in progress on entry will have completed
> + * before this primitive returns.  However, this does not guarantee that
> + * softirq handlers will have completed, since in some kernels

OK, so your approach treats preempt_disable code sequences as RCU
read-side critical sections by relying on the fact that the per-CPU
->krcud task cannot run until such code sequences complete, correct?

This seems to require that each CPU's ->krcud task be awakened at
least once per grace period, but I might well be missing something.

> + * This primitive provides the guarantees made by the (deprecated)
> + * synchronize_kernel() API.  In contrast, synchronize_rcu() only
> + * guarantees that rcu_read_lock() sections will have completed.
> + */
> +#define synchronize_sched synchronize_rcu
> +#define synchronize_sched_expedited synchronize_rcu
> +
> +/* Exported interfaces */
> +#define call_rcu_sched(head, func) call_rcu(head, func)
> +extern void call_rcu(struct rcu_head *head,
> +		void (*func)(struct rcu_head *head));
> +extern void call_rcu_bh(struct rcu_head *head,
> +		void (*func)(struct rcu_head *head));
> +extern __deprecated_for_modules void synchronize_kernel(void);
> +extern void synchronize_rcu(void);
> +extern void rcu_barrier(void);
> +#define rcu_barrier_sched rcu_barrier
> +#define rcu_barrier_bh rcu_barrier
> +static inline void rcu_scheduler_starting(void) {}
> +extern void do_delayed_rcu_daemon_wakeups(void);
> +
> +#endif /* __KERNEL__ */
> +#endif /* __LINUX_RCUPDATE_H */
> Index: b/include/linux/rcupdate.h
> ===================================================================
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -30,6 +30,10 @@
>   *
>   */
> 
> +#ifdef CONFIG_SHIELDING_RCU
> +#include <linux/rcushield.h>
> +#else
> +
>  #ifndef __LINUX_RCUPDATE_H
>  #define __LINUX_RCUPDATE_H
> 
> @@ -600,3 +604,4 @@ static inline void debug_rcu_head_unqueu
>  	__rcu_dereference_index_check((p), (c))
> 
>  #endif /* __LINUX_RCUPDATE_H */
> +#endif /* CONFIG_SHIELDING_RCU */
> Index: b/include/linux/sysctl.h
> ===================================================================
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -153,6 +153,7 @@ enum
>  	KERN_MAX_LOCK_DEPTH=74, /* int: rtmutex's maximum lock depth */
>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> +	KERN_RCU=77,	/* make rcu variables available for debug */
>  };
> 
> 
> @@ -235,6 +236,11 @@ enum
>  	RANDOM_UUID=6
>  };
> 
> +/* /proc/sys/kernel/rcu */
> +enum {
> +	RCU_BATCH=1
> +};
> +
>  /* /proc/sys/kernel/pty */
>  enum
>  {
> Index: b/init/main.c
> ===================================================================
> --- a/init/main.c
> +++ b/init/main.c
> @@ -606,13 +606,13 @@ asmlinkage void __init start_kernel(void
>  				"enabled *very* early, fixing it\n");
>  		local_irq_disable();
>  	}
> -	rcu_init();
>  	radix_tree_init();
>  	/* init some links before init_ISA_irqs() */
>  	early_irq_init();
>  	init_IRQ();
>  	prio_tree_init();
>  	init_timers();
> +	rcu_init();  /* must appear after init_timers for shielded rcu */
>  	hrtimers_init();
>  	softirq_init();
>  	timekeeping_init();
> Index: b/kernel/Makefile
> ===================================================================
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -6,13 +6,16 @@ obj-y     = sched.o fork.o exec_domain.o
>  	    cpu.o exit.o itimer.o time.o softirq.o resource.o \
>  	    sysctl.o sysctl_binary.o capability.o ptrace.o timer.o user.o \
>  	    signal.o sys.o kmod.o workqueue.o pid.o \
> -	    rcupdate.o extable.o params.o posix-timers.o \
> +	    extable.o params.o posix-timers.o \
>  	    kthread.o wait.o kfifo.o sys_ni.o posix-cpu-timers.o mutex.o \
>  	    hrtimer.o rwsem.o nsproxy.o srcu.o semaphore.o \
>  	    notifier.o ksysfs.o pm_qos_params.o sched_clock.o cred.o \
>  	    async.o range.o
>  obj-$(CONFIG_HAVE_EARLY_RES) += early_res.o
>  obj-y += groups.o
> +ifndef CONFIG_SHIELDING_RCU
> +obj-y += rcupdate.o
> +endif
> 
>  ifdef CONFIG_FUNCTION_TRACER
>  # Do not trace debug files and internal ftrace files
> @@ -81,6 +84,7 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_t
>  obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
>  obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
>  obj-$(CONFIG_SECCOMP) += seccomp.o
> +obj-$(CONFIG_SHIELDING_RCU) += rcushield.o
>  obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
>  obj-$(CONFIG_TREE_RCU) += rcutree.o
>  obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
> Index: b/kernel/rcushield.c
> ===================================================================
> --- /dev/null
> +++ b/kernel/rcushield.c
> @@ -0,0 +1,812 @@
> +/*
> + * Read-Copy Update mechanism for mutual exclusion
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2001
> + *
> + * Authors: Dipankar Sarma <dipankar@in.ibm.com>
> + *	    Manfred Spraul <manfred@colorfullife.com>
> + *
> + * Based on the original work by Paul McKenney <paulmck@us.ibm.com>
> + * and inputs from Rusty Russell, Andrea Arcangeli and Andi Kleen.
> + * Papers:
> + * http://www.rdrop.com/users/paulmck/paper/rclockpdcsproof.pdf
> + * http://lse.sourceforge.net/locking/rclock_OLS.2001.05.01c.sc.pdf (OLS2001)
> + *
> + * For detailed explanation of Read-Copy Update mechanism see -
> + * 		http://lse.sourceforge.net/locking/rcupdate.html
> + *
> + * Modified by:  Jim Houston <jim.houston@ccur.com>
> + * 	This is a experimental version which uses explicit synchronization
> + *	between rcu_read_lock/rcu_read_unlock and rcu_poll_other_cpus()
> + *	to complete RCU batches without relying on timer based polling.
> + *
> + */
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/init.h>
> +#include <linux/spinlock.h>
> +#include <linux/smp.h>
> +#include <linux/interrupt.h>
> +#include <linux/sched.h>
> +#include <asm/atomic.h>
> +#include <linux/bitops.h>
> +#include <linux/module.h>
> +#include <linux/completion.h>
> +#include <linux/moduleparam.h>
> +#include <linux/percpu.h>
> +#include <linux/notifier.h>
> +#include <linux/rcupdate.h>
> +#include <linux/cpu.h>
> +#include <linux/jiffies.h>
> +#include <linux/kthread.h>
> +#include <linux/sysctl.h>
> +
> +/*
> + * Definition for rcu_batch.  This variable includes the flags:
> + *	RCU_NEXT_PENDING
> + * 		used to request that another batch should be
> + *		started when the current batch completes.
> + *	RCU_COMPLETE
> + *		which indicates that the last batch completed and
> + *		that rcu callback processing is stopped.
> + *
> + * Combinning this state in a single word allows them to be maintained
> + * using an atomic exchange.
> + */
> +long rcu_batch = (-300*RCU_INCREMENT)+RCU_COMPLETE;
> +unsigned long rcu_timestamp;
> +
> +/* Bookkeeping of the progress of the grace period */
> +struct {
> +	cpumask_t	rcu_cpu_mask; /* CPUs that need to switch in order    */
> +				      /* for current batch to proceed.        */
> +} rcu_state ____cacheline_internodealigned_in_smp =
> +	  { .rcu_cpu_mask = CPU_MASK_NONE };
> +
> +
> +DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
> +
> +/*
> + * Limits to control when new batchs of RCU callbacks are started.
> + */
> +long rcu_max_count = 256;
> +unsigned long rcu_max_time = HZ/10;
> +
> +static void rcu_start_batch(void);
> +
> +/*
> + * Make the rcu_batch available for debug.
> + */
> +ctl_table rcu_table[] = {
> +	{
> +		.procname	= "batch",
> +		.data		= &rcu_batch,
> +		.maxlen		= sizeof(rcu_batch),
> +		.mode		= 0444,
> +		.proc_handler	= &proc_doulongvec_minmax,
> +	},
> +	{}
> +};
> +
> +/*
> + * rcu_set_state maintains the RCU_COMPLETE and RCU_NEXT_PENDING
> + * bits in rcu_batch.  Multiple processors might try to mark the
> + * current batch as complete, or start a new batch at the same time.
> + * The cmpxchg() makes the state transition atomic. rcu_set_state()
> + * returns the previous state.  This allows the caller to tell if
> + * it caused the state transition.
> + */
> +
> +int rcu_set_state(long state)
> +{
> +	long batch, new, last;
> +	do {
> +		batch = rcu_batch;
> +		if (batch & state)
> +			return batch & (RCU_COMPLETE | RCU_NEXT_PENDING);
> +		new = batch | state;
> +		last = cmpxchg(&rcu_batch, batch, new);
> +	} while (unlikely(last != batch));
> +	return last & (RCU_COMPLETE | RCU_NEXT_PENDING);
> +}
> +
> +
> +static atomic_t rcu_barrier_cpu_count;
> +static struct mutex rcu_barrier_mutex;
> +static struct completion rcu_barrier_completion;
> +
> +/*
> + * If the batch in the nxt list or cur list has completed move it to the
> + * done list.  If its grace period for the nxt list has begun
> + * move the contents to the cur list.
> + */
> +static int rcu_move_if_done(struct rcu_data *r)
> +{
> +	int done = 0;
> +
> +	if (r->cur.head && rcu_batch_complete(r->batch)) {
> +		rcu_list_move(&r->done, &r->cur);
> +		done = 1;
> +	}
> +	if (r->nxt.head) {
> +		if (rcu_batch_complete(r->nxtbatch)) {
> +			rcu_list_move(&r->done, &r->nxt);
> +			r->nxtcount = 0;
> +			done = 1;
> +		} else if (r->nxtbatch == rcu_batch) {
> +			/*
> +			 * The grace period for the nxt list has started
> +			 * move its content to the cur list.
> +			 */
> +			rcu_list_move(&r->cur, &r->nxt);
> +			r->batch = r->nxtbatch;
> +			r->nxtcount = 0;
> +		}
> +	}
> +	return done;
> +}
> +
> +/*
> + * support delayed krcud wakeups.  Needed whenever we
> + * cannot wake up krcud directly, this happens whenever
> + * rcu_read_lock ... rcu_read_unlock is used under
> + * rq->lock.
> + */
> +static cpumask_t rcu_wake_mask = CPU_MASK_NONE;
> +static cpumask_t rcu_wake_mask_copy;
> +static DEFINE_RAW_SPINLOCK(rcu_wake_lock);
> +static int rcu_delayed_wake_count;
> +
> +void do_delayed_rcu_daemon_wakeups(void)
> +{
> +	int cpu;
> +	unsigned long flags;
> +	struct rcu_data *r;
> +	struct task_struct *p;
> +
> +	if (likely(cpumask_empty(&rcu_wake_mask)))
> +		return;
> +
> +	raw_spin_lock_irqsave(&rcu_wake_lock, flags);
> +	cpumask_copy(&rcu_wake_mask_copy, &rcu_wake_mask);
> +	cpumask_clear(&rcu_wake_mask);
> +	raw_spin_unlock_irqrestore(&rcu_wake_lock, flags);
> +
> +	for_each_cpu(cpu, &rcu_wake_mask_copy) {
> +		r = &per_cpu(rcu_data, cpu);
> +		p = r->krcud;
> +		if (p && p->state != TASK_RUNNING) {
> +			wake_up_process(p);
> +			rcu_delayed_wake_count++;
> +		}
> +	}
> +}

Hmmm....  I wonder if it would make sense to use RCU_SOFTIRQ for
the delay, where needed?

> +void rcu_wake_daemon_delayed(struct rcu_data *r)
> +{
> +	unsigned long flags;
> +	raw_spin_lock_irqsave(&rcu_wake_lock, flags);
> +	cpumask_set_cpu(task_cpu(r->krcud), &rcu_wake_mask);
> +	raw_spin_unlock_irqrestore(&rcu_wake_lock, flags);
> +}
> +
> +/*
> + * Wake rcu daemon if it is not already running.  Note that
> + * we avoid invoking wake_up_process if RCU is being used under
> + * the rq lock.
> + */
> +void rcu_wake_daemon(struct rcu_data *r)
> +{
> +	struct task_struct *p = r->krcud;
> +
> +	if (p && p->state != TASK_RUNNING) {
> +#ifdef BROKEN
> +		/* runqueue_is_locked is racy, let us use only
> +		 * the delayed approach.
> +		 */
> +		if (unlikely(runqueue_is_locked(smp_processor_id())))
> +			rcu_wake_daemon_delayed(r);
> +		else
> +			wake_up_process(p);
> +#else
> +		rcu_wake_daemon_delayed(r);
> +#endif
> +	}
> +}
> +
> +/**
> + * rcu_read_lock - mark the beginning of an RCU read-side critical section.
> + *
> + * When synchronize_rcu() is invoked on one CPU while other CPUs
> + * are within RCU read-side critical sections, then the
> + * synchronize_rcu() is guaranteed to block until after all the other
> + * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
> + * on one CPU while other CPUs are within RCU read-side critical
> + * sections, invocation of the corresponding RCU callback is deferred
> + * until after the all the other CPUs exit their critical sections.
> + *
> + * Note, however, that RCU callbacks are permitted to run concurrently
> + * with RCU read-side critical sections.  One way that this can happen
> + * is via the following sequence of events: (1) CPU 0 enters an RCU
> + * read-side critical section, (2) CPU 1 invokes call_rcu() to register
> + * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
> + * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
> + * callback is invoked.  This is legal, because the RCU read-side critical
> + * section that was running concurrently with the call_rcu() (and which
> + * therefore might be referencing something that the corresponding RCU
> + * callback would free up) has completed before the corresponding
> + * RCU callback is invoked.
> + *
> + * RCU read-side critical sections may be nested.  Any deferred actions
> + * will be deferred until the outermost RCU read-side critical section
> + * completes.
> + *
> + * It is illegal to block while in an RCU read-side critical section.
> + */
> +void __rcu_read_lock(void)
> +{
> +	struct rcu_data *r;
> +
> +	r = &per_cpu(rcu_data, smp_processor_id());
> +	if (r->nest_count++ == 0)
> +		/*
> +		 * Set the flags value to show that we are in
> +		 * a read side critical section.  The code starting
> +		 * a batch uses this to determine if a processor
> +		 * needs to participate in the batch.  Including
> +		 * a sequence allows the remote processor to tell
> +		 * that a critical section has completed and another
> +		 * has begun.
> +		 */
> +		r->flags = IN_RCU_READ_LOCK | (r->sequence++ << 2);

It seems to me that we need a memory barrier here -- what am I missing?

> +}
> +EXPORT_SYMBOL(__rcu_read_lock);
> +
> +/**
> + * rcu_read_unlock - marks the end of an RCU read-side critical section.
> + * Check if a RCU batch was started while we were in the critical
> + * section.  If so, call rcu_quiescent() join the rendezvous.
> + *
> + * See rcu_read_lock() for more information.
> + */
> +void __rcu_read_unlock(void)
> +{
> +	struct rcu_data *r;
> +	int	cpu, flags;
> +
> +	cpu = smp_processor_id();
> +	r = &per_cpu(rcu_data, cpu);
> +	if (--r->nest_count == 0) {
> +		flags = xchg(&r->flags, 0);
> +		if (flags & DO_RCU_COMPLETION)
> +			rcu_quiescent(cpu);
> +	}
> +}
> +EXPORT_SYMBOL(__rcu_read_unlock);
> +
> +/**
> + * call_rcu - Queue an RCU callback for invocation after a grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual update function to be invoked after the grace period
> + *
> + * The update function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed.  RCU read-side critical
> + * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
> + * and may be nested.
> + */
> +void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> +{
> +	struct rcu_data *r;
> +	unsigned long flags;
> +	int cpu;
> +
> +	head->func = func;
> +	head->next = NULL;
> +	local_irq_save(flags);
> +	cpu = smp_processor_id();
> +	r = &per_cpu(rcu_data, cpu);
> +	/*
> +	 * Avoid mixing new entries with batches which have already
> +	 * completed or have a grace period in progress.
> +	 */
> +	if (r->nxt.head && rcu_move_if_done(r))
> +		rcu_wake_daemon(r);
> +
> +	rcu_list_add(&r->nxt, head);
> +	if (r->nxtcount++ == 0) {
> +		r->nxtbatch = (rcu_batch & RCU_BATCH_MASK) + RCU_INCREMENT;
> +		barrier();
> +		if (!rcu_timestamp)
> +			rcu_timestamp = jiffies ?: 1;
> +	}
> +	/* If we reach the limit start a batch. */
> +	if (r->nxtcount > rcu_max_count) {
> +		if (rcu_set_state(RCU_NEXT_PENDING) == RCU_COMPLETE)
> +			rcu_start_batch();
> +	}
> +	local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
> +
> +/*
> + * Revisit - my patch treats any code not protected by rcu_read_lock(),
> + * rcu_read_unlock() as a quiescent state.  I suspect that the call_rcu_bh()
> + * interface is not needed.
> + */
> +void call_rcu_bh(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> +{
> +	call_rcu(head, func);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu_bh);
> +
> +static void rcu_barrier_callback(struct rcu_head *notused)
> +{
> +	if (atomic_dec_and_test(&rcu_barrier_cpu_count))
> +		complete(&rcu_barrier_completion);
> +}
> +
> +/*
> + * Called with preemption disabled, and from cross-cpu IRQ context.
> + */
> +static void rcu_barrier_func(void *notused)
> +{
> +	int cpu = smp_processor_id();
> +	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> +	struct rcu_head *head;
> +
> +	head = &rdp->barrier;
> +	atomic_inc(&rcu_barrier_cpu_count);
> +	call_rcu(head, rcu_barrier_callback);
> +}
> +
> +/**
> + * rcu_barrier - Wait until all the in-flight RCUs are complete.
> + */
> +void rcu_barrier(void)
> +{
> +	BUG_ON(in_interrupt());
> +	/* Take cpucontrol semaphore to protect against CPU hotplug */
> +	mutex_lock(&rcu_barrier_mutex);
> +	init_completion(&rcu_barrier_completion);
> +	atomic_set(&rcu_barrier_cpu_count, 0);
> +	on_each_cpu(rcu_barrier_func, NULL, 1);
> +	wait_for_completion(&rcu_barrier_completion);
> +	mutex_unlock(&rcu_barrier_mutex);
> +}
> +EXPORT_SYMBOL(rcu_barrier);
> +
> +
> +/*
> + * cpu went through a quiescent state since the beginning of the grace period.
> + * Clear it from the cpu mask and complete the grace period if it was the last
> + * cpu. Start another grace period if someone has further entries pending
> + */
> +
> +static void rcu_grace_period_complete(void)
> +{
> +	struct rcu_data *r;
> +	int cpu, last;
> +
> +	/*
> +	 * Mark the batch as complete.  If RCU_COMPLETE was
> +	 * already set we raced with another processor
> +	 * and it will finish the completion processing.
> +	 */
> +	last = rcu_set_state(RCU_COMPLETE);
> +	if (last & RCU_COMPLETE)
> +		return;
> +	/*
> +	 * If RCU_NEXT_PENDING is set, start the new batch.
> +	 */
> +	if (last & RCU_NEXT_PENDING)
> +		rcu_start_batch();
> +	/*
> +	 * Wake the krcud for any cpu which has requests queued.
> +	 */
> +	for_each_online_cpu(cpu) {
> +		r = &per_cpu(rcu_data, cpu);
> +		if (r->nxt.head || r->cur.head || r->done.head)
> +			rcu_wake_daemon(r);
> +	}
> +}
> +
> +/*
> + * rcu_quiescent() is called from rcu_read_unlock() when a
> + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
> + * critical section was executing.
> + */
> +
> +void rcu_quiescent(int cpu)
> +{

What prevents two different CPUs from calling this concurrently?
Ah, apparently nothing -- the idea being that rcu_grace_period_complete()
sorts it out.  Though if the second CPU was delayed, it seems like it
might incorrectly end a subsequent grace period as follows:

o	CPU 0 clears the second-to-last bit.

o	CPU 1 clears the last bit.

o	CPU 1 sees that the mask is empty, so invokes
	rcu_grace_period_complete(), but is delayed in the function
	preamble.

o	CPU 0 sees that the mask is empty, so invokes
	rcu_grace_period_complete(), ending the grace period.
	Because the RCU_NEXT_PENDING is set, it also starts
	a new grace period.

o	CPU 1 continues in rcu_grace_period_complete(), incorrectly
	ending the new grace period.

Or am I missing something here?

> +	cpu_clear(cpu, rcu_state.rcu_cpu_mask);
> +	if (cpus_empty(rcu_state.rcu_cpu_mask))
> +		rcu_grace_period_complete();
> +}
> +
> +/*
> + * Check if the other cpus are in rcu_read_lock/rcu_read_unlock protected code.
> + * If not they are assumed to be quiescent and we can clear the bit in
> + * bitmap.  If not set DO_RCU_COMPLETION to request a quiescent point on
> + * the rcu_read_unlock.
> + *
> + * Do this in two passes.  On the first pass we sample the flags value.
> + * The second pass only looks at processors which were found in the read
> + * side critical section on the first pass.  The flags value contains
> + * a sequence value so we can tell if the processor has completed a
> + * critical section even if it has started another.
> + */
> +long rcu_grace_periods;
> +long rcu_count1;
> +long rcu_count2;
> +long rcu_count3;

The above three rcu_countN variables are for debug, correct?

> +void rcu_poll_other_cpus(void)
> +{
> +	struct rcu_data *r;
> +	int cpu;
> +	cpumask_t mask;
> +	unsigned int f, flags[NR_CPUS];

The NR_CPUS array will be a problem for large numbers of CPUs, but
this can be worked around.

> +	rcu_grace_periods++;
> +	for_each_online_cpu(cpu) {
> +		r = &per_cpu(rcu_data, cpu);
> +		f = flags[cpu] = r->flags;
> +		if (f == 0) {
> +			cpu_clear(cpu, rcu_state.rcu_cpu_mask);
> +			rcu_count1++;
> +		}
> +	}

My first thought was that we needed a memory barrier here, but after some
thought the fact that we are accessing the same ->flags fields before
and after, and without any interdependencies among these variables,
seems to be why you don't need a barrier.

> +	mask = rcu_state.rcu_cpu_mask;
> +	for_each_cpu_mask(cpu, mask) {
> +		r = &per_cpu(rcu_data, cpu);
> +		/*
> +		 * If the remote processor is still in the same read-side
> +		 * critical section set DO_RCU_COMPLETION to request that
> +		 * the cpu participate in the grace period.
> +		 */
> +		f = r->flags;
> +		if (f == flags[cpu])
> +			f = cmpxchg(&r->flags, f, f | DO_RCU_COMPLETION);
> +		/*
> +		 * If the other processors flags value changes before
> +		 * the cmpxchg() that processor is nolonger in the
> +		 * read-side critical section so we clear its bit.
> +		 */
> +		if (f != flags[cpu]) {
> +			cpu_clear(cpu, rcu_state.rcu_cpu_mask);
> +			rcu_count2++;
> +		} else
> +			rcu_count3++;
> +
> +	}

At this point, one of the CPUs that we hit with DO_RCU_COMPLETION might
have finished the grace period.  So how do we know that we are still
in the same grace period that we were in when this function was called?

If this is a real problem rather than a figment of my imagination,
then one way to solve it would be to set a local flag if we
set DO_RCU_COMPLETION on any CPU's ->flags field, and to invoke
rcu_grace_period_complete() only if that local flag is clear.

> +	if (cpus_empty(rcu_state.rcu_cpu_mask))
> +		rcu_grace_period_complete();
> +}
> +
> +/*
> + * Grace period handling:
> + * The grace period handling consists out of two steps:
> + * - A new grace period is started.
> + *   This is done by rcu_start_batch. The rcu_poll_other_cpus()
> + *   call drives the synchronization.  It loops checking if each
> + *   of the other cpus are executing in a rcu_read_lock/rcu_read_unlock
> + *   critical section.  The flags word for the cpus it finds in a
> + *   rcu_read_lock/rcu_read_unlock critical section will be updated to
> + *   request a rcu_quiescent() call.
> + * - Each of the cpus which were in the rcu_read_lock/rcu_read_unlock
> + *   critical section will eventually call rcu_quiescent() and clear
> + *   the bit corresponding to their cpu in rcu_state.rcu_cpu_mask.
> + * - The processor which clears the last bit wakes the krcud for
> + *   the cpus which have rcu callback requests queued.
> + *
> + * The process of starting a batch is arbitrated with the RCU_COMPLETE &
> + * RCU_NEXT_PENDING bits. These bits can be set in either order but the
> + * thread which sets the second bit must call rcu_start_batch().
> + * Multiple processors might try to set these bits at the same time.
> + * By using cmpxchg() we can determine which processor actually set
> + * the bit and be sure that only a single thread trys to start the batch.
> + *
> + */
> +static void rcu_start_batch(void)
> +{
> +	long batch, new;
> +
> +	batch = rcu_batch;
> +	BUG_ON((batch & (RCU_COMPLETE|RCU_NEXT_PENDING)) !=
> +		(RCU_COMPLETE|RCU_NEXT_PENDING));
> +	rcu_timestamp = 0;
> +	smp_mb();
> +	/*
> +	 * nohz_cpu_mask can go away because only cpus executing
> +	 * rcu_read_lock/rcu_read_unlock critical sections need to
> +	 * participate in the rendezvous.
> +	 */
> +	cpumask_andnot(&rcu_state.rcu_cpu_mask, cpu_online_mask, nohz_cpu_mask);

Hmmm...  Suppose that a CPU has its nohz_cpu_mask bit set, but is
currently in an interrupt handler where it is executing RCU read-side
critical sections?  If I understand this code correctly, any such
read-side critical sections would be incorrectly ignored.  (And yes,
we did have a similar bug in mainline for something like 5 years before
people started hitting it.)

> +	new = (batch & RCU_BATCH_MASK) + RCU_INCREMENT;
> +	smp_mb();
> +	rcu_batch = new;
> +	smp_mb();
> +	rcu_poll_other_cpus();
> +}
> +
> +
> +
> +#ifdef CONFIG_HOTPLUG_CPU
> +
> +static void rcu_offline_cpu(int cpu)
> +{
> +	struct rcu_data *this_rdp = &get_cpu_var(rcu_data);
> +	struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
> +
> +#if 0
> +	/*
> +	 * The cpu should not have been in a read side critical
> +	 * section when it was removed.  So this code is not needed.
> +	 */
> +	/* if the cpu going offline owns the grace period
> +	 * we can block indefinitely waiting for it, so flush
> +	 * it here
> +	 */
> +	if (!(rcu_batch & RCU_COMPLETE))
> +		rcu_quiescent(cpu);
> +#endif
> +	local_irq_disable();
> +	/*
> +	 * The rcu lists are per-cpu private data only protected by
> +	 * disabling interrupts.  Since we know the other cpu is dead
> +	 * it should not be manipulating these lists.
> +	 */
> +	rcu_list_move(&this_rdp->cur, &rdp->cur);
> +	rcu_list_move(&this_rdp->nxt, &rdp->nxt);
> +	this_rdp->nxtbatch = (rcu_batch & RCU_BATCH_MASK) + RCU_INCREMENT;
> +	local_irq_enable();
> +	put_cpu_var(rcu_data);
> +}
> +
> +#else
> +
> +static inline void rcu_offline_cpu(int cpu)
> +{
> +}
> +
> +#endif
> +
> +/*
> + * Process the completed RCU callbacks.
> + */
> +static void rcu_process_callbacks(struct rcu_data *r)
> +{
> +	struct rcu_head *list, *next;
> +
> +	local_irq_disable();
> +	rcu_move_if_done(r);
> +	list = r->done.head;
> +	rcu_list_init(&r->done);
> +	local_irq_enable();
> +
> +	while (list) {
> +		next = list->next;
> +		list->func(list);
> +		list = next;

For large systems, we need to limit the number of callbacks executed
in one shot, but this is easy to fix.

> +	}
> +}
> +
> +/*
> + * Poll rcu_timestamp to start a RCU batch if there are
> + * any pending request which have been waiting longer
> + * than rcu_max_time.
> + */
> +struct timer_list rcu_timer;
> +
> +void rcu_timeout(unsigned long unused)
> +{
> +	do_delayed_rcu_daemon_wakeups();
> +
> +	if (rcu_timestamp
> +	&& time_after(jiffies, (rcu_timestamp + rcu_max_time))) {
> +		if (rcu_set_state(RCU_NEXT_PENDING) == RCU_COMPLETE)
> +			rcu_start_batch();
> +	}
> +	init_timer(&rcu_timer);
> +	rcu_timer.expires = jiffies + (rcu_max_time/2?:1);
> +	add_timer(&rcu_timer);

Ah, a self-spawning timer.  This needs to be on a "sacrificial lamb"
CPU.

> +}
> +
> +static void __devinit rcu_online_cpu(int cpu)
> +{
> +	struct rcu_data *r = &per_cpu(rcu_data, cpu);
> +
> +	memset(&per_cpu(rcu_data, cpu), 0, sizeof(struct rcu_data));
> +	rcu_list_init(&r->nxt);
> +	rcu_list_init(&r->cur);
> +	rcu_list_init(&r->done);
> +}
> +
> +int rcu_pending(struct rcu_data *r)
> +{
> +	return r->done.head ||
> +		(r->cur.head && rcu_batch_complete(r->batch)) ||
> +		(r->nxt.head && rcu_batch_complete(r->nxtbatch));
> +}
> +
> +static int krcud(void *__bind_cpu)
> +{
> +	int cpu = (int)(long) __bind_cpu;
> +	struct rcu_data *r = &per_cpu(rcu_data, cpu);
> +
> +	set_user_nice(current, 19);
> +	current->flags |= PF_NOFREEZE;
> +
> +	set_current_state(TASK_INTERRUPTIBLE);
> +
> +	while (!kthread_should_stop()) {
> +		if (!rcu_pending(r))
> +			schedule();
> +
> +		__set_current_state(TASK_RUNNING);
> +
> +		while (rcu_pending(r)) {
> +			/* Preempt disable stops cpu going offline.
> +			   If already offline, we'll be on wrong CPU:
> +			   don't process */
> +			preempt_disable();
> +			if (cpu_is_offline((long)__bind_cpu))
> +				goto wait_to_die;
> +			preempt_enable();
> +			rcu_process_callbacks(r);
> +			cond_resched();
> +		}
> +
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	}
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +
> +wait_to_die:
> +	preempt_enable();
> +	/* Wait for kthread_stop */
> +	set_current_state(TASK_INTERRUPTIBLE);
> +	while (!kthread_should_stop()) {
> +		schedule();
> +		set_current_state(TASK_INTERRUPTIBLE);
> +	}
> +	__set_current_state(TASK_RUNNING);
> +	return 0;
> +}
> +
> +static int __devinit rcu_cpu_notify(struct notifier_block *nfb,
> +				  unsigned long action,
> +				  void *hcpu)
> +{
> +	int cpu = (unsigned long)hcpu;
> +	struct rcu_data *r = &per_cpu(rcu_data, cpu);
> +	struct task_struct *p;
> +
> +	switch (action) {
> +	case CPU_UP_PREPARE:
> +		rcu_online_cpu(cpu);
> +		p = kthread_create(krcud, hcpu, "krcud/%d", cpu);
> +		if (IS_ERR(p)) {
> +			printk(KERN_INFO "krcud for %i failed\n", cpu);
> +			return NOTIFY_BAD;
> +		}
> +		kthread_bind(p, cpu);
> +		r->krcud = p;
> +		break;
> +	case CPU_ONLINE:
> +		wake_up_process(r->krcud);
> +		break;
> +#ifdef CONFIG_HOTPLUG_CPU
> +	case CPU_UP_CANCELED:
> +		/* Unbind so it can run.  Fall thru. */
> +		kthread_bind(r->krcud, smp_processor_id());
> +	case CPU_DEAD:
> +		p = r->krcud;
> +		r->krcud = NULL;
> +		kthread_stop(p);
> +		rcu_offline_cpu(cpu);
> +		break;
> +#endif /* CONFIG_HOTPLUG_CPU */
> +	}
> +	return NOTIFY_OK;
> +}
> +
> +static struct notifier_block __devinitdata rcu_nb = {
> +	.notifier_call	= rcu_cpu_notify,
> +};
> +
> +static __init int spawn_krcud(void)
> +{
> +	void *cpu = (void *)(long)smp_processor_id();
> +	rcu_cpu_notify(&rcu_nb, CPU_UP_PREPARE, cpu);
> +	rcu_cpu_notify(&rcu_nb, CPU_ONLINE, cpu);
> +	register_cpu_notifier(&rcu_nb);
> +	return 0;
> +}
> +early_initcall(spawn_krcud);
> +/*
> + * Initializes rcu mechanism.  Assumed to be called early.
> + * That is before local timer(SMP) or jiffie timer (uniproc) is setup.
> + * Note that rcu_qsctr and friends are implicitly
> + * initialized due to the choice of ``0'' for RCU_CTR_INVALID.
> + */
> +void __init rcu_init(void)
> +{
> +	mutex_init(&rcu_barrier_mutex);
> +	rcu_online_cpu(smp_processor_id());
> +	/*
> +	 * Use a timer to catch the elephants which would otherwise
> +	 * fall throught the cracks on local timer shielded cpus.
> +	 */
> +	init_timer(&rcu_timer);
> +	rcu_timer.function = rcu_timeout;
> +	rcu_timer.expires = jiffies + (rcu_max_time/2?:1);
> +	add_timer(&rcu_timer);

OK, so CPU 0 is apparently the sacrificial lamb for the timer.

> +}
> +
> +
> +struct rcu_synchronize {
> +	struct rcu_head head;
> +	struct completion completion;
> +};
> +
> +/* Because of FASTCALL declaration of complete, we use this wrapper */
> +static void wakeme_after_rcu(struct rcu_head  *head)
> +{
> +	struct rcu_synchronize *rcu;
> +
> +	rcu = container_of(head, struct rcu_synchronize, head);
> +	complete(&rcu->completion);
> +}
> +
> +/**
> + * synchronize_rcu - wait until a grace period has elapsed.
> + *
> + * Control will return to the caller some time after a full grace
> + * period has elapsed, in other words after all currently executing RCU
> + * read-side critical sections have completed.  RCU read-side critical
> + * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
> + * and may be nested.
> + *
> + * If your read-side code is not protected by rcu_read_lock(), do -not-
> + * use synchronize_rcu().
> + */
> +void synchronize_rcu(void)
> +{
> +	struct rcu_synchronize rcu;
> +
> +	init_completion(&rcu.completion);
> +	/* Will wake me after RCU finished */
> +	call_rcu(&rcu.head, wakeme_after_rcu);
> +
> +	/* Wait for it */
> +	wait_for_completion(&rcu.completion);
> +}
> +EXPORT_SYMBOL_GPL(synchronize_rcu);
> +
> +/*
> + * Deprecated, use synchronize_rcu() or synchronize_sched() instead.
> + */
> +void synchronize_kernel(void)
> +{
> +	synchronize_rcu();
> +}
> +EXPORT_SYMBOL(synchronize_kernel);
> +
> +module_param(rcu_max_count, long, 0644);
> +module_param(rcu_max_time, long, 0644);
> Index: b/kernel/sysctl.c
> ===================================================================
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -215,6 +215,10 @@ extern struct ctl_table random_table[];
>  extern struct ctl_table epoll_table[];
>  #endif
> 
> +#ifdef CONFIG_SHIELDING_RCU
> +extern ctl_table rcu_table[];
> +#endif
> +
>  #ifdef HAVE_ARCH_PICK_MMAP_LAYOUT
>  int sysctl_legacy_va_layout;
>  #endif
> @@ -808,6 +812,13 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  #endif
> +#ifdef CONFIG_SHIELDING_RCU
> +	{
> +		.procname	= "rcu",
> +		.mode		= 0555,
> +		.child		= rcu_table,
> +	},
> +#endif
>  #if defined(CONFIG_S390) && defined(CONFIG_SMP)
>  	{
>  		.procname	= "spin_retry",
> Index: b/kernel/timer.c
> ===================================================================
> --- a/kernel/timer.c
> +++ b/kernel/timer.c
> @@ -1272,12 +1272,15 @@ unsigned long get_next_timer_interrupt(u
>  void update_process_times(int user_tick)
>  {
>  	struct task_struct *p = current;
> -	int cpu = smp_processor_id();
> 
>  	/* Note: this timer irq context must be accounted for as well. */
>  	account_process_tick(p, user_tick);
>  	run_local_timers();
> -	rcu_check_callbacks(cpu, user_tick);
> +#ifndef CONFIG_SHIELDING_RCU
> +	rcu_check_callbacks(smp_processor_id(), user_tick);
> +#else
> +	do_delayed_rcu_daemon_wakeups();
> +#endif
>  	printk_tick();
>  	perf_event_do_pending();
>  	scheduler_tick();
> Index: b/lib/Kconfig.debug
> ===================================================================
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -791,6 +791,7 @@ config BOOT_PRINTK_DELAY
>  config RCU_TORTURE_TEST
>  	tristate "torture tests for RCU"
>  	depends on DEBUG_KERNEL
> +	depends on !SHIELDING_RCU
>  	default n
>  	help
>  	  This option provides a kernel module that runs torture tests
> Index: b/init/Kconfig
> ===================================================================
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -365,6 +365,13 @@ config TINY_RCU
>  	  is not required.  This option greatly reduces the
>  	  memory footprint of RCU.
> 
> +config SHIELDING_RCU
> +	bool "Shielding RCU"
> +	help
> +	  This option selects the RCU implementation that does not
> +	  depend on a per-cpu periodic interrupt to do garbage
> +	  collection.  This is good when one is trying to shield
> +	  some set of CPUs from as much system activity as possible.
>  endchoice
> 
>  config RCU_TRACE
> Index: b/include/linux/hardirq.h
> ===================================================================
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -138,7 +138,12 @@ static inline void account_system_vtime(
>  }
>  #endif
> 
> -#if defined(CONFIG_NO_HZ)
> +#if defined(CONFIG_SHIELDING_RCU)
> +# define rcu_irq_enter() do { } while (0)
> +# define rcu_irq_exit() do { } while (0)
> +# define rcu_nmi_enter() do { } while (0)
> +# define rcu_nmi_exit() do { } while (0)
> +#elif defined(CONFIG_NO_HZ)
>  #if defined(CONFIG_TINY_RCU)
>  extern void rcu_enter_nohz(void);
>  extern void rcu_exit_nohz(void);
> @@ -161,13 +166,13 @@ static inline void rcu_nmi_exit(void)
>  {
>  }
> 
> -#else
> +#else /* !CONFIG_TINY_RCU */
>  extern void rcu_irq_enter(void);
>  extern void rcu_irq_exit(void);
>  extern void rcu_nmi_enter(void);
>  extern void rcu_nmi_exit(void);
>  #endif
> -#else
> +#else /* !CONFIG_NO_HZ */
>  # define rcu_irq_enter() do { } while (0)
>  # define rcu_irq_exit() do { } while (0)
>  # define rcu_nmi_enter() do { } while (0)
> Index: b/kernel/sysctl_binary.c
> ===================================================================
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -61,6 +61,11 @@ static const struct bin_table bin_pty_ta
>  	{}
>  };
> 
> +static const struct bin_table bin_rcu_table[] = {
> +	{ CTL_INT,	RCU_BATCH,	"batch" },
> +	{}
> +};
> +
>  static const struct bin_table bin_kern_table[] = {
>  	{ CTL_STR,	KERN_OSTYPE,			"ostype" },
>  	{ CTL_STR,	KERN_OSRELEASE,			"osrelease" },
> @@ -138,6 +143,7 @@ static const struct bin_table bin_kern_t
>  	{ CTL_INT,	KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
>  	{ CTL_INT,	KERN_NMI_WATCHDOG,		"nmi_watchdog" },
>  	{ CTL_INT,	KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
> +	{ CTL_DIR,	KERN_RCU,			"rcu", bin_rcu_table },
>  	{}
>  };
> 
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -9119,6 +9119,7 @@ struct cgroup_subsys cpuacct_subsys = {
>  };
>  #endif	/* CONFIG_CGROUP_CPUACCT */
> 
> +#ifndef CONFIG_SHIELDING_RCU
>  #ifndef CONFIG_SMP
> 
>  void synchronize_sched_expedited(void)
> @@ -9188,3 +9189,4 @@ void synchronize_sched_expedited(void)
>  EXPORT_SYMBOL_GPL(synchronize_sched_expedited);
> 
>  #endif /* #else #ifndef CONFIG_SMP */
> +#endif /* CONFIG_SHIELDING_RCU */

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-06 19:28   ` Paul E. McKenney
@ 2010-11-06 19:34     ` Mathieu Desnoyers
  2010-11-06 19:42       ` Mathieu Desnoyers
  2010-11-08  2:11     ` Udo A. Steinberg
  2010-11-08 15:06     ` Frederic Weisbecker
  2 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2010-11-06 19:34 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joe Korty, fweisbec, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Nov 05, 2010 at 05:00:59PM -0400, Joe Korty wrote:
[...]
> > + *
> > + * RCU read-side critical sections may be nested.  Any deferred actions
> > + * will be deferred until the outermost RCU read-side critical section
> > + * completes.
> > + *
> > + * It is illegal to block while in an RCU read-side critical section.
> > + */
> > +void __rcu_read_lock(void)
> > +{
> > +	struct rcu_data *r;
> > +
> > +	r = &per_cpu(rcu_data, smp_processor_id());
> > +	if (r->nest_count++ == 0)
> > +		/*
> > +		 * Set the flags value to show that we are in
> > +		 * a read side critical section.  The code starting
> > +		 * a batch uses this to determine if a processor
> > +		 * needs to participate in the batch.  Including
> > +		 * a sequence allows the remote processor to tell
> > +		 * that a critical section has completed and another
> > +		 * has begun.
> > +		 */
> > +		r->flags = IN_RCU_READ_LOCK | (r->sequence++ << 2);
> 
> It seems to me that we need a memory barrier here -- what am I missing?

Agreed, I spotted it too. One more is needed, see below,

> 
> > +}
> > +EXPORT_SYMBOL(__rcu_read_lock);
> > +
> > +/**
> > + * rcu_read_unlock - marks the end of an RCU read-side critical section.
> > + * Check if a RCU batch was started while we were in the critical
> > + * section.  If so, call rcu_quiescent() join the rendezvous.
> > + *
> > + * See rcu_read_lock() for more information.
> > + */
> > +void __rcu_read_unlock(void)
> > +{
> > +	struct rcu_data *r;
> > +	int	cpu, flags;
> > +

Another memory barrier would be needed here to ensure that the memory accesses
performed within the C.S. are not reordered wrt nest_count decrement.

> > +	cpu = smp_processor_id();
> > +	r = &per_cpu(rcu_data, cpu);
> > +	if (--r->nest_count == 0) {
> > +		flags = xchg(&r->flags, 0);
> > +		if (flags & DO_RCU_COMPLETION)
> > +			rcu_quiescent(cpu);
> > +	}
> > +}
> > +EXPORT_SYMBOL(__rcu_read_unlock);

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-06 19:34     ` Mathieu Desnoyers
@ 2010-11-06 19:42       ` Mathieu Desnoyers
  2010-11-06 19:44         ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Mathieu Desnoyers @ 2010-11-06 19:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joe Korty, fweisbec, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

* Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> > > +/**
> > > + * rcu_read_unlock - marks the end of an RCU read-side critical section.
> > > + * Check if a RCU batch was started while we were in the critical
> > > + * section.  If so, call rcu_quiescent() join the rendezvous.
> > > + *
> > > + * See rcu_read_lock() for more information.
> > > + */
> > > +void __rcu_read_unlock(void)
> > > +{
> > > +	struct rcu_data *r;
> > > +	int	cpu, flags;
> > > +
> 
> Another memory barrier would be needed here to ensure that the memory accesses
> performed within the C.S. are not reordered wrt nest_count decrement.

Nevermind. xchg() acts as a memory barrier, and nest_count is only ever touched
by the local CPU. No memory barrier needed here.

Thanks,

Mathieu

> 
> > > +	cpu = smp_processor_id();
> > > +	r = &per_cpu(rcu_data, cpu);
> > > +	if (--r->nest_count == 0) {
> > > +		flags = xchg(&r->flags, 0);
> > > +		if (flags & DO_RCU_COMPLETION)
> > > +			rcu_quiescent(cpu);
> > > +	}
> > > +}
> > > +EXPORT_SYMBOL(__rcu_read_unlock);
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-06 19:42       ` Mathieu Desnoyers
@ 2010-11-06 19:44         ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-06 19:44 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: Joe Korty, fweisbec, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

On Sat, Nov 06, 2010 at 03:42:19PM -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers (mathieu.desnoyers@efficios.com) wrote:
> > > > +/**
> > > > + * rcu_read_unlock - marks the end of an RCU read-side critical section.
> > > > + * Check if a RCU batch was started while we were in the critical
> > > > + * section.  If so, call rcu_quiescent() join the rendezvous.
> > > > + *
> > > > + * See rcu_read_lock() for more information.
> > > > + */
> > > > +void __rcu_read_unlock(void)
> > > > +{
> > > > +	struct rcu_data *r;
> > > > +	int	cpu, flags;
> > > > +
> > 
> > Another memory barrier would be needed here to ensure that the memory accesses
> > performed within the C.S. are not reordered wrt nest_count decrement.
> 
> Nevermind. xchg() acts as a memory barrier, and nest_count is only ever touched
> by the local CPU. No memory barrier needed here.

You beat me to it.  ;-)

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > > > +	cpu = smp_processor_id();
> > > > +	r = &per_cpu(rcu_data, cpu);
> > > > +	if (--r->nest_count == 0) {
> > > > +		flags = xchg(&r->flags, 0);
> > > > +		if (flags & DO_RCU_COMPLETION)
> > > > +			rcu_quiescent(cpu);
> > > > +	}
> > > > +}
> > > > +EXPORT_SYMBOL(__rcu_read_unlock);
> > 
> > Thanks,
> > 
> > Mathieu
> > 
> > -- 
> > Mathieu Desnoyers
> > Operating System Efficiency R&D Consultant
> > EfficiOS Inc.
> > http://www.efficios.com
> 
> -- 
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
  2010-11-06 19:28   ` Paul E. McKenney
@ 2010-11-06 20:03   ` Mathieu Desnoyers
  2010-11-09  9:22   ` Lai Jiangshan
  2 siblings, 0 replies; 63+ messages in thread
From: Mathieu Desnoyers @ 2010-11-06 20:03 UTC (permalink / raw)
  To: Joe Korty
  Cc: Paul E. McKenney, fweisbec, dhowells, loic.minier, dhaval.giani,
	tglx, peterz, linux-kernel, josh

Hi Joe,

Thanks for sending these patches. Here are some comments below,

* Joe Korty (joe.korty@ccur.com) wrote:
[...]
> 
> Jim Houston's timer-less version of RCU.
> 	
> This rather ancient version of RCU handles RCU garbage
> collection in the absence of a per-cpu local timer
> interrupt.
> 
> This is a minimal forward port to 2.6.36.  It works,
> but it is not yet a complete implementation of RCU.
> 
> Developed-by: Jim Houston <jim.houston@ccur.com>
> Signed-off-by: Joe Korty <joe.korty@ccur.com>
> 
[...]
> +/*
> + * rcu_quiescent() is called from rcu_read_unlock() when a
> + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
> + * critical section was executing.
> + */
> +
> +void rcu_quiescent(int cpu)
> +{
> +	cpu_clear(cpu, rcu_state.rcu_cpu_mask);
> +	if (cpus_empty(rcu_state.rcu_cpu_mask))
> +		rcu_grace_period_complete();
> +}

These seems to be a race here when the number of CPUs is large enough to occupy
a bitmap larger than one word. Two unlock racing could each clear their own bit
on different words, and each thinking that the cpu mask is not empty, which
would hold the grace period forever.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-06 19:28   ` Paul E. McKenney
  2010-11-06 19:34     ` Mathieu Desnoyers
@ 2010-11-08  2:11     ` Udo A. Steinberg
  2010-11-08  2:19       ` Udo A. Steinberg
  2010-11-08 15:06     ` Frederic Weisbecker
  2 siblings, 1 reply; 63+ messages in thread
From: Udo A. Steinberg @ 2010-11-08  2:11 UTC (permalink / raw)
  To: paulmck
  Cc: Joe Korty, fweisbec, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

[-- Attachment #1: Type: text/plain, Size: 2077 bytes --]

On Sat, 6 Nov 2010 12:28:12 -0700 Paul E. McKenney (PEM) wrote:

PEM> > + * rcu_quiescent() is called from rcu_read_unlock() when a
PEM> > + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
PEM> > + * critical section was executing.
PEM> > + */
PEM> > +
PEM> > +void rcu_quiescent(int cpu)
PEM> > +{
PEM> 
PEM> What prevents two different CPUs from calling this concurrently?
PEM> Ah, apparently nothing -- the idea being that
PEM> rcu_grace_period_complete() sorts it out.  Though if the second CPU was
PEM> delayed, it seems like it might incorrectly end a subsequent grace
PEM> period as follows:
PEM> 
PEM> o	CPU 0 clears the second-to-last bit.
PEM> 
PEM> o	CPU 1 clears the last bit.
PEM> 
PEM> o	CPU 1 sees that the mask is empty, so invokes
PEM> 	rcu_grace_period_complete(), but is delayed in the function
PEM> 	preamble.
PEM> 
PEM> o	CPU 0 sees that the mask is empty, so invokes
PEM> 	rcu_grace_period_complete(), ending the grace period.
PEM> 	Because the RCU_NEXT_PENDING is set, it also starts
PEM> 	a new grace period.
PEM> 
PEM> o	CPU 1 continues in rcu_grace_period_complete(), incorrectly
PEM> 	ending the new grace period.
PEM> 
PEM> Or am I missing something here?

The scenario you describe seems possible. However, it should be easily fixed
by passing the perceived batch number as another parameter to rcu_set_state()
and making it part of the cmpxchg. So if the caller tries to set state bits
on a stale batch number (e.g., batch != rcu_batch), it can be detected.

There is a similar, although harmless, issue in call_rcu(): Two CPUs can
concurrently add callbacks to their respective nxt list and compute the same
value for nxtbatch. One CPU succeeds in setting the PENDING bit while
observing COMPLETE to be clear, so it starts a new batch. Afterwards, the
other CPU also sets the PENDING bit, but this time for the next batch. So
it ends up requesting nxtbatch+1, although there is no need to. This also
would be fixed by making the batch number part of the cmpxchg.

Cheers,

	- Udo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08  2:11     ` Udo A. Steinberg
@ 2010-11-08  2:19       ` Udo A. Steinberg
  2010-11-08  2:54         ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Udo A. Steinberg @ 2010-11-08  2:19 UTC (permalink / raw)
  To: paulmck
  Cc: Joe Korty, fweisbec, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

[-- Attachment #1: Type: text/plain, Size: 2438 bytes --]

On Mon, 8 Nov 2010 03:11:36 +0100 Udo A. Steinberg (UAS) wrote:

UAS> On Sat, 6 Nov 2010 12:28:12 -0700 Paul E. McKenney (PEM) wrote:
UAS> 
UAS> PEM> > + * rcu_quiescent() is called from rcu_read_unlock() when a
UAS> PEM> > + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
UAS> PEM> > + * critical section was executing.
UAS> PEM> > + */
UAS> PEM> > +
UAS> PEM> > +void rcu_quiescent(int cpu)
UAS> PEM> > +{
UAS> PEM> 
UAS> PEM> What prevents two different CPUs from calling this concurrently?
UAS> PEM> Ah, apparently nothing -- the idea being that
UAS> PEM> rcu_grace_period_complete() sorts it out.  Though if the second
UAS> PEM> CPU was delayed, it seems like it might incorrectly end a
UAS> PEM> subsequent grace period as follows:
UAS> PEM> 
UAS> PEM> o	CPU 0 clears the second-to-last bit.
UAS> PEM> 
UAS> PEM> o	CPU 1 clears the last bit.
UAS> PEM> 
UAS> PEM> o	CPU 1 sees that the mask is empty, so invokes
UAS> PEM> 	rcu_grace_period_complete(), but is delayed in the function
UAS> PEM> 	preamble.
UAS> PEM> 
UAS> PEM> o	CPU 0 sees that the mask is empty, so invokes
UAS> PEM> 	rcu_grace_period_complete(), ending the grace period.
UAS> PEM> 	Because the RCU_NEXT_PENDING is set, it also starts
UAS> PEM> 	a new grace period.
UAS> PEM> 
UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
UAS> PEM> incorrectly ending the new grace period.
UAS> PEM> 
UAS> PEM> Or am I missing something here?
UAS> 
UAS> The scenario you describe seems possible. However, it should be easily
UAS> fixed by passing the perceived batch number as another parameter to
UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
UAS> tries to set state bits on a stale batch number (e.g., batch !=
UAS> rcu_batch), it can be detected.
UAS> 
UAS> There is a similar, although harmless, issue in call_rcu(): Two CPUs can
UAS> concurrently add callbacks to their respective nxt list and compute the
UAS> same value for nxtbatch. One CPU succeeds in setting the PENDING bit
UAS> while observing COMPLETE to be clear, so it starts a new batch.

Correction: while observing COMPLETE to be set!

UAS> Afterwards, the other CPU also sets the PENDING bit, but this time for
UAS> the next batch. So it ends up requesting nxtbatch+1, although there is
UAS> no need to. This also would be fixed by making the batch number part of
UAS> the cmpxchg.

Cheers,

	- Udo

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08  2:19       ` Udo A. Steinberg
@ 2010-11-08  2:54         ` Paul E. McKenney
  2010-11-08 15:32           ` Frederic Weisbecker
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-08  2:54 UTC (permalink / raw)
  To: Udo A. Steinberg
  Cc: Joe Korty, fweisbec, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 03:19:36AM +0100, Udo A. Steinberg wrote:
> On Mon, 8 Nov 2010 03:11:36 +0100 Udo A. Steinberg (UAS) wrote:
> 
> UAS> On Sat, 6 Nov 2010 12:28:12 -0700 Paul E. McKenney (PEM) wrote:
> UAS> 
> UAS> PEM> > + * rcu_quiescent() is called from rcu_read_unlock() when a
> UAS> PEM> > + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
> UAS> PEM> > + * critical section was executing.
> UAS> PEM> > + */
> UAS> PEM> > +
> UAS> PEM> > +void rcu_quiescent(int cpu)
> UAS> PEM> > +{
> UAS> PEM> 
> UAS> PEM> What prevents two different CPUs from calling this concurrently?
> UAS> PEM> Ah, apparently nothing -- the idea being that
> UAS> PEM> rcu_grace_period_complete() sorts it out.  Though if the second
> UAS> PEM> CPU was delayed, it seems like it might incorrectly end a
> UAS> PEM> subsequent grace period as follows:
> UAS> PEM> 
> UAS> PEM> o	CPU 0 clears the second-to-last bit.
> UAS> PEM> 
> UAS> PEM> o	CPU 1 clears the last bit.
> UAS> PEM> 
> UAS> PEM> o	CPU 1 sees that the mask is empty, so invokes
> UAS> PEM> 	rcu_grace_period_complete(), but is delayed in the function
> UAS> PEM> 	preamble.
> UAS> PEM> 
> UAS> PEM> o	CPU 0 sees that the mask is empty, so invokes
> UAS> PEM> 	rcu_grace_period_complete(), ending the grace period.
> UAS> PEM> 	Because the RCU_NEXT_PENDING is set, it also starts
> UAS> PEM> 	a new grace period.
> UAS> PEM> 
> UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
> UAS> PEM> incorrectly ending the new grace period.
> UAS> PEM> 
> UAS> PEM> Or am I missing something here?
> UAS> 
> UAS> The scenario you describe seems possible. However, it should be easily
> UAS> fixed by passing the perceived batch number as another parameter to
> UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
> UAS> tries to set state bits on a stale batch number (e.g., batch !=
> UAS> rcu_batch), it can be detected.
> UAS> 
> UAS> There is a similar, although harmless, issue in call_rcu(): Two CPUs can
> UAS> concurrently add callbacks to their respective nxt list and compute the
> UAS> same value for nxtbatch. One CPU succeeds in setting the PENDING bit
> UAS> while observing COMPLETE to be clear, so it starts a new batch.
> 
> Correction: while observing COMPLETE to be set!
> 
> UAS> Afterwards, the other CPU also sets the PENDING bit, but this time for
> UAS> the next batch. So it ends up requesting nxtbatch+1, although there is
> UAS> no need to. This also would be fixed by making the batch number part of
> UAS> the cmpxchg.

Another approach is to map the underlying algorithm onto the TREE_RCU
data structures.  And make preempt_disable(), local_irq_save(), and
friends invoke rcu_read_lock() -- irq and nmi handlers already have
the dyntick calls into RCU, so should be easy to handle as well.
Famous last words.  ;-)

						Thanx, Paul

> Cheers,
> 
> 	- Udo



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: dyntick-hpc and RCU
  2010-11-05 15:04   ` Paul E. McKenney
@ 2010-11-08 14:10     ` Frederic Weisbecker
  0 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 14:10 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	peterz, linux-kernel, josh

On Fri, Nov 05, 2010 at 08:04:36AM -0700, Paul E. McKenney wrote:
> On Fri, Nov 05, 2010 at 06:27:46AM +0100, Frederic Weisbecker wrote:
> > Yet another solution is to require users of bh and sched rcu flavours to
> > call a specific rcu_read_lock_sched()/bh, or something similar, that would
> > be only implemented in this new rcu config. We would only need to touch the
> > existing users and the future ones instead of adding an explicit call
> > to every implicit paths.
> 
> This approach would be a much nicer solution, and I do wish I had required
> this to start with.  Unfortunately, at that time, there was no preemptible
> RCU, CONFIG_PREEMPT, nor any RCU-bh, so there was no way to enforce this.
> Besides which, I was thinking in terms of maybe 100 occurrences of the RCU
> API in the kernel.  ;-)



Ok, I'll continue the discussion about this specific point in the
non-timer based rcu patch thread.



 
> > > 4.	Substitute an RCU implementation based on one of the
> > > 	user-level RCU implementations.  This has roughly the same
> > > 	advantages and disadvantages as does #3 above.
> > > 
> > > 5.	Don't tell RCU about dyntick-hpc mode, but instead make RCU
> > > 	push processing through via some processor that is kept out
> > > 	of dyntick-hpc mode.
> > 
> > I don't understand what you mean.
> > Do you mean that dyntick-hpc cpu would enqueue rcu callbacks to
> > another CPU? But how does that protect rcu critical sections
> > in our dyntick-hpc CPU?
> 
> There is a large range of possible solutions, but any solution will need
> to check for RCU read-side critical sections on the dyntick-hpc CPU.  I
> was thinking in terms of IPIing the dyntick-hpc CPUs, but very infrequently,
> say once per second.



Everytime we want to notify a quiescent state, right?
But I fear that forcing an IPI, even only once per second, breaks our
initial requirement.



> > >       This requires that the rcutree RCU
> > > 	priority boosting be pushed further along so that RCU grace period
> > > 	and callback processing is done in kthread context, permitting
> > > 	remote forcing of grace periods.
> > 
> > 
> > 
> > I should have a look at the rcu priority boosting to understand what you
> > mean here.
> 
> The only thing that you really need to know about it is that I will be
> moving the current softirq processing to kthread context.  The key point
> here is that we can wake up a kthread on some other CPU.


Ok.



> > >       The RCU_JIFFIES_TILL_FORCE_QS
> > > 	macro is promoted to a config variable, retaining its value
> > > 	of 3 in absence of dyntick-hpc, but getting value of HZ
> > > 	(or thereabouts) for dyntick-hpc builds.  In dyntick-hpc
> > > 	builds, force_quiescent_state() would push grace periods
> > > 	for CPUs lacking a scheduling-clock interrupt.
> > > 
> > > 	+	Relatively small changes to RCU, some of which is
> > > 		coming with RCU priority boosting anyway.
> > > 
> > > 	+	No need to inform RCU of user/kernel transitions.
> > > 
> > > 	+	No need to turn scheduling-clock interrupts on
> > > 		at each user/kernel transition.
> > > 
> > > 	-	Some IPIs to dyntick-hpc CPUs remain, but these
> > > 		are down in the every-second-or-so frequency,
> > > 		so hopefully are not a real problem.
> > 
> > 
> > Hmm, I hope we could avoid that, ideally the task in userspace shouldn't be
> > interrupted at all.
> 
> Yep.  But if we do need to interrupt it, let's do it as infrequently as
> we can!



If we have no other solution yeah, but I'm not sure that's a right way
to go.



> > I wonder if we shouldn't go back to #3 eventually.
> 
> And there are variants of #3 that permit preemption of RCU read-side
> critical sections.


Ok.



> > At that time yeah.
> > 
> > But now I don't know, I really need to dig deeper into it and really
> > understand how #5 works before picking that orientation :)
> 
> This is probably true for all of us for all of the options.  ;-)


Hehe ;-)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-06 19:28   ` Paul E. McKenney
  2010-11-06 19:34     ` Mathieu Desnoyers
  2010-11-08  2:11     ` Udo A. Steinberg
@ 2010-11-08 15:06     ` Frederic Weisbecker
  2010-11-08 15:18       ` Joe Korty
  2010-11-08 19:49       ` Paul E. McKenney
  2 siblings, 2 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 15:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joe Korty, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

On Sat, Nov 06, 2010 at 12:28:12PM -0700, Paul E. McKenney wrote:
> On Fri, Nov 05, 2010 at 05:00:59PM -0400, Joe Korty wrote:
> > +/**
> > + * synchronize_sched - block until all CPUs have exited any non-preemptive
> > + * kernel code sequences.
> > + *
> > + * This means that all preempt_disable code sequences, including NMI and
> > + * hardware-interrupt handlers, in progress on entry will have completed
> > + * before this primitive returns.  However, this does not guarantee that
> > + * softirq handlers will have completed, since in some kernels
> 
> OK, so your approach treats preempt_disable code sequences as RCU
> read-side critical sections by relying on the fact that the per-CPU
> ->krcud task cannot run until such code sequences complete, correct?
> 
> This seems to require that each CPU's ->krcud task be awakened at
> least once per grace period, but I might well be missing something.



I understood it differently, but I might also be wrong as well. krcud
executes the callbacks, but it is only woken up for CPUs that want to
execute callbacks, not for those that only signal a quiescent state,
which is only determined in two ways through rcu_poll_other_cpus():

- if the CPU is in an rcu_read_lock() critical section, it has the
  IN_RCU_READ_LOCK flag. If so then we set up its DO_RCU_COMPLETION flag so
  that it signals its quiescent state on rcu_read_unlock().

- otherwise it's in a quiescent state.


This works for rcu and rcu bh critical sections.
But this works in rcu sched critical sections only if rcu_read_lock_sched() has
been called explicitly, otherwise that doesn't work (in preempt_disable(),
local_irq_save(), etc...). I think this is what is not complete when
Joe said it's not yet a complete rcu implementation.

This is also the part that scaries me most :)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 15:06     ` Frederic Weisbecker
@ 2010-11-08 15:18       ` Joe Korty
  2010-11-08 19:50         ` Paul E. McKenney
  2010-11-08 19:49       ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Joe Korty @ 2010-11-08 15:18 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh, Jim Houston

On Mon, Nov 08, 2010 at 10:06:47AM -0500, Frederic Weisbecker wrote:
> On Sat, Nov 06, 2010 at 12:28:12PM -0700, Paul E. McKenney wrote:
>> OK, so your approach treats preempt_disable code sequences as RCU
>> read-side critical sections by relying on the fact that the per-CPU
>> ->krcud task cannot run until such code sequences complete, correct?
>> 
>> This seems to require that each CPU's ->krcud task be awakened at
>> least once per grace period, but I might well be missing something.
> 
> I understood it differently, but I might also be wrong as well. krcud
> executes the callbacks, but it is only woken up for CPUs that want to
> execute callbacks, not for those that only signal a quiescent state,
> which is only determined in two ways through rcu_poll_other_cpus():
> 
> - if the CPU is in an rcu_read_lock() critical section, it has the
>   IN_RCU_READ_LOCK flag. If so then we set up its DO_RCU_COMPLETION flag so
>   that it signals its quiescent state on rcu_read_unlock().
> 
> - otherwise it's in a quiescent state.
> 
> This works for rcu and rcu bh critical sections.
> But this works in rcu sched critical sections only if rcu_read_lock_sched() has
> been called explicitly, otherwise that doesn't work (in preempt_disable(),
> local_irq_save(), etc...). I think this is what is not complete when
> Joe said it's not yet a complete rcu implementation.
> 
> This is also the part that scaries me most :)

Mostly, I meant that the new RCU API interfaces that have come into
existance since 2004 were only hastily wrapped or NOPed by me to get
things going.

Jim's method only works with explicit rcu_read_lock..unlock sequences,
implicit sequences via preempt_disable..enable and the like are not
handled.  I had thought all such sequences were converted to rcu_read_lock
but maybe that is not yet correct.

Jim will have to comment on the full history.  He is incommunicado
at the moment; hopefully he will be able to participate sometime in
the next few days.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08  2:54         ` Paul E. McKenney
@ 2010-11-08 15:32           ` Frederic Weisbecker
  2010-11-08 19:38             ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 15:32 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Udo A. Steinberg, Joe Korty, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh

On Sun, Nov 07, 2010 at 06:54:00PM -0800, Paul E. McKenney wrote:
> On Mon, Nov 08, 2010 at 03:19:36AM +0100, Udo A. Steinberg wrote:
> > On Mon, 8 Nov 2010 03:11:36 +0100 Udo A. Steinberg (UAS) wrote:
> > 
> > UAS> On Sat, 6 Nov 2010 12:28:12 -0700 Paul E. McKenney (PEM) wrote:
> > UAS> 
> > UAS> PEM> > + * rcu_quiescent() is called from rcu_read_unlock() when a
> > UAS> PEM> > + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
> > UAS> PEM> > + * critical section was executing.
> > UAS> PEM> > + */
> > UAS> PEM> > +
> > UAS> PEM> > +void rcu_quiescent(int cpu)
> > UAS> PEM> > +{
> > UAS> PEM> 
> > UAS> PEM> What prevents two different CPUs from calling this concurrently?
> > UAS> PEM> Ah, apparently nothing -- the idea being that
> > UAS> PEM> rcu_grace_period_complete() sorts it out.  Though if the second
> > UAS> PEM> CPU was delayed, it seems like it might incorrectly end a
> > UAS> PEM> subsequent grace period as follows:
> > UAS> PEM> 
> > UAS> PEM> o	CPU 0 clears the second-to-last bit.
> > UAS> PEM> 
> > UAS> PEM> o	CPU 1 clears the last bit.
> > UAS> PEM> 
> > UAS> PEM> o	CPU 1 sees that the mask is empty, so invokes
> > UAS> PEM> 	rcu_grace_period_complete(), but is delayed in the function
> > UAS> PEM> 	preamble.
> > UAS> PEM> 
> > UAS> PEM> o	CPU 0 sees that the mask is empty, so invokes
> > UAS> PEM> 	rcu_grace_period_complete(), ending the grace period.
> > UAS> PEM> 	Because the RCU_NEXT_PENDING is set, it also starts
> > UAS> PEM> 	a new grace period.
> > UAS> PEM> 
> > UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
> > UAS> PEM> incorrectly ending the new grace period.
> > UAS> PEM> 
> > UAS> PEM> Or am I missing something here?
> > UAS> 
> > UAS> The scenario you describe seems possible. However, it should be easily
> > UAS> fixed by passing the perceived batch number as another parameter to
> > UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
> > UAS> tries to set state bits on a stale batch number (e.g., batch !=
> > UAS> rcu_batch), it can be detected.
> > UAS> 
> > UAS> There is a similar, although harmless, issue in call_rcu(): Two CPUs can
> > UAS> concurrently add callbacks to their respective nxt list and compute the
> > UAS> same value for nxtbatch. One CPU succeeds in setting the PENDING bit
> > UAS> while observing COMPLETE to be clear, so it starts a new batch.
> > 
> > Correction: while observing COMPLETE to be set!
> > 
> > UAS> Afterwards, the other CPU also sets the PENDING bit, but this time for
> > UAS> the next batch. So it ends up requesting nxtbatch+1, although there is
> > UAS> no need to. This also would be fixed by making the batch number part of
> > UAS> the cmpxchg.
> 
> Another approach is to map the underlying algorithm onto the TREE_RCU
> data structures.  And make preempt_disable(), local_irq_save(), and
> friends invoke rcu_read_lock() -- irq and nmi handlers already have
> the dyntick calls into RCU, so should be easy to handle as well.
> Famous last words.  ;-)


So, this looks very scary for performances to add rcu_read_lock() in
preempt_disable() and local_irq_save(), that notwithstanding it won't
handle the "raw" rcu sched implicit path. We should check all rcu_dereference_sched
users to ensure there are not in such raw path.


There is also my idea from the other discussion: change rcu_read_lock_sched()
semantics and map it to rcu_read_lock() in this rcu config (would be a nop
in other configs). So every users of rcu_dereference_sched() would now need
to protect their critical section with this.
Would it be too late to change this semantic?

What is scary with this is that it also changes rcu sched semantics, and users
of call_rcu_sched() and synchronize_sched(), who rely on that to do more
tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
like really wanting for preempt_disable and local_irq_save/disable, those
users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
for them...


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 15:32           ` Frederic Weisbecker
@ 2010-11-08 19:38             ` Paul E. McKenney
  2010-11-08 20:40               ` Frederic Weisbecker
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-08 19:38 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Udo A. Steinberg, Joe Korty, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 04:32:17PM +0100, Frederic Weisbecker wrote:
> On Sun, Nov 07, 2010 at 06:54:00PM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 08, 2010 at 03:19:36AM +0100, Udo A. Steinberg wrote:
> > > On Mon, 8 Nov 2010 03:11:36 +0100 Udo A. Steinberg (UAS) wrote:
> > > 
> > > UAS> On Sat, 6 Nov 2010 12:28:12 -0700 Paul E. McKenney (PEM) wrote:
> > > UAS> 
> > > UAS> PEM> > + * rcu_quiescent() is called from rcu_read_unlock() when a
> > > UAS> PEM> > + * RCU batch was started while the rcu_read_lock/rcu_read_unlock
> > > UAS> PEM> > + * critical section was executing.
> > > UAS> PEM> > + */
> > > UAS> PEM> > +
> > > UAS> PEM> > +void rcu_quiescent(int cpu)
> > > UAS> PEM> > +{
> > > UAS> PEM> 
> > > UAS> PEM> What prevents two different CPUs from calling this concurrently?
> > > UAS> PEM> Ah, apparently nothing -- the idea being that
> > > UAS> PEM> rcu_grace_period_complete() sorts it out.  Though if the second
> > > UAS> PEM> CPU was delayed, it seems like it might incorrectly end a
> > > UAS> PEM> subsequent grace period as follows:
> > > UAS> PEM> 
> > > UAS> PEM> o	CPU 0 clears the second-to-last bit.
> > > UAS> PEM> 
> > > UAS> PEM> o	CPU 1 clears the last bit.
> > > UAS> PEM> 
> > > UAS> PEM> o	CPU 1 sees that the mask is empty, so invokes
> > > UAS> PEM> 	rcu_grace_period_complete(), but is delayed in the function
> > > UAS> PEM> 	preamble.
> > > UAS> PEM> 
> > > UAS> PEM> o	CPU 0 sees that the mask is empty, so invokes
> > > UAS> PEM> 	rcu_grace_period_complete(), ending the grace period.
> > > UAS> PEM> 	Because the RCU_NEXT_PENDING is set, it also starts
> > > UAS> PEM> 	a new grace period.
> > > UAS> PEM> 
> > > UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
> > > UAS> PEM> incorrectly ending the new grace period.
> > > UAS> PEM> 
> > > UAS> PEM> Or am I missing something here?
> > > UAS> 
> > > UAS> The scenario you describe seems possible. However, it should be easily
> > > UAS> fixed by passing the perceived batch number as another parameter to
> > > UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
> > > UAS> tries to set state bits on a stale batch number (e.g., batch !=
> > > UAS> rcu_batch), it can be detected.
> > > UAS> 
> > > UAS> There is a similar, although harmless, issue in call_rcu(): Two CPUs can
> > > UAS> concurrently add callbacks to their respective nxt list and compute the
> > > UAS> same value for nxtbatch. One CPU succeeds in setting the PENDING bit
> > > UAS> while observing COMPLETE to be clear, so it starts a new batch.
> > > 
> > > Correction: while observing COMPLETE to be set!
> > > 
> > > UAS> Afterwards, the other CPU also sets the PENDING bit, but this time for
> > > UAS> the next batch. So it ends up requesting nxtbatch+1, although there is
> > > UAS> no need to. This also would be fixed by making the batch number part of
> > > UAS> the cmpxchg.
> > 
> > Another approach is to map the underlying algorithm onto the TREE_RCU
> > data structures.  And make preempt_disable(), local_irq_save(), and
> > friends invoke rcu_read_lock() -- irq and nmi handlers already have
> > the dyntick calls into RCU, so should be easy to handle as well.
> > Famous last words.  ;-)
> 
> 
> So, this looks very scary for performances to add rcu_read_lock() in
> preempt_disable() and local_irq_save(), that notwithstanding it won't
> handle the "raw" rcu sched implicit path.

Ah -- I would arrange for the rcu_read_lock() to be added only in the
dyntick-hpc case.  So no effect on normal builds, overhead is added only
in the dyntick-hpc case.

>                                           We should check all rcu_dereference_sched
> users to ensure there are not in such raw path.

Indeed!  ;-)

> There is also my idea from the other discussion: change rcu_read_lock_sched()
> semantics and map it to rcu_read_lock() in this rcu config (would be a nop
> in other configs). So every users of rcu_dereference_sched() would now need
> to protect their critical section with this.
> Would it be too late to change this semantic?

I was expecting that we would fold RCU, RCU bh, and RCU sched into
the same set of primitives (as Jim Houston did), but again only in the
dyntick-hpc case.  However, rcu_read_lock_bh() would still disable BH,
and similarly, rcu_read_lock_sched() would still disable preemption.

> What is scary with this is that it also changes rcu sched semantics, and users
> of call_rcu_sched() and synchronize_sched(), who rely on that to do more
> tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
> like really wanting for preempt_disable and local_irq_save/disable, those
> users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
> for them...

So rcu_read_lock() would be the underlying primitive.  The implementation
of rcu_read_lock_sched() would disable preemption and then invoke
rcu_read_lock().  The implementation of rcu_read_lock_bh() would
disable BH and then invoke rcu_read_lock().  This would allow
synchronize_rcu_sched() and synchronize_rcu_bh() to simply invoke
synchronize_rcu().

Seem reasonable?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 15:06     ` Frederic Weisbecker
  2010-11-08 15:18       ` Joe Korty
@ 2010-11-08 19:49       ` Paul E. McKenney
  2010-11-08 20:51         ` Frederic Weisbecker
  1 sibling, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-08 19:49 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Joe Korty, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 04:06:47PM +0100, Frederic Weisbecker wrote:
> On Sat, Nov 06, 2010 at 12:28:12PM -0700, Paul E. McKenney wrote:
> > On Fri, Nov 05, 2010 at 05:00:59PM -0400, Joe Korty wrote:
> > > +/**
> > > + * synchronize_sched - block until all CPUs have exited any non-preemptive
> > > + * kernel code sequences.
> > > + *
> > > + * This means that all preempt_disable code sequences, including NMI and
> > > + * hardware-interrupt handlers, in progress on entry will have completed
> > > + * before this primitive returns.  However, this does not guarantee that
> > > + * softirq handlers will have completed, since in some kernels
> > 
> > OK, so your approach treats preempt_disable code sequences as RCU
> > read-side critical sections by relying on the fact that the per-CPU
> > ->krcud task cannot run until such code sequences complete, correct?
> > 
> > This seems to require that each CPU's ->krcud task be awakened at
> > least once per grace period, but I might well be missing something.
> 
> 
> 
> I understood it differently, but I might also be wrong as well. krcud
> executes the callbacks, but it is only woken up for CPUs that want to
> execute callbacks, not for those that only signal a quiescent state,
> which is only determined in two ways through rcu_poll_other_cpus():
> 
> - if the CPU is in an rcu_read_lock() critical section, it has the
>   IN_RCU_READ_LOCK flag. If so then we set up its DO_RCU_COMPLETION flag so
>   that it signals its quiescent state on rcu_read_unlock().
> 
> - otherwise it's in a quiescent state.
> 
> 
> This works for rcu and rcu bh critical sections.

Unfortunately, local_irq_save() is allowed to stand in for
rcu_read_lock_bh().  :-/

> But this works in rcu sched critical sections only if rcu_read_lock_sched() has
> been called explicitly, otherwise that doesn't work (in preempt_disable(),
> local_irq_save(), etc...). I think this is what is not complete when
> Joe said it's not yet a complete rcu implementation.
> 
> This is also the part that scaries me most :)

And if we can make all the the preempt_disable(), local_irq_disable(), ...
invoke rcu_read_lock(), then we have some chance of being able to dispense
with the IPIs to CPUs not having callbacks that need to be executed.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 15:18       ` Joe Korty
@ 2010-11-08 19:50         ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-08 19:50 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh, Jim Houston

On Mon, Nov 08, 2010 at 10:18:29AM -0500, Joe Korty wrote:
> On Mon, Nov 08, 2010 at 10:06:47AM -0500, Frederic Weisbecker wrote:
> > On Sat, Nov 06, 2010 at 12:28:12PM -0700, Paul E. McKenney wrote:
> >> OK, so your approach treats preempt_disable code sequences as RCU
> >> read-side critical sections by relying on the fact that the per-CPU
> >> ->krcud task cannot run until such code sequences complete, correct?
> >> 
> >> This seems to require that each CPU's ->krcud task be awakened at
> >> least once per grace period, but I might well be missing something.
> > 
> > I understood it differently, but I might also be wrong as well. krcud
> > executes the callbacks, but it is only woken up for CPUs that want to
> > execute callbacks, not for those that only signal a quiescent state,
> > which is only determined in two ways through rcu_poll_other_cpus():
> > 
> > - if the CPU is in an rcu_read_lock() critical section, it has the
> >   IN_RCU_READ_LOCK flag. If so then we set up its DO_RCU_COMPLETION flag so
> >   that it signals its quiescent state on rcu_read_unlock().
> > 
> > - otherwise it's in a quiescent state.
> > 
> > This works for rcu and rcu bh critical sections.
> > But this works in rcu sched critical sections only if rcu_read_lock_sched() has
> > been called explicitly, otherwise that doesn't work (in preempt_disable(),
> > local_irq_save(), etc...). I think this is what is not complete when
> > Joe said it's not yet a complete rcu implementation.
> > 
> > This is also the part that scaries me most :)
> 
> Mostly, I meant that the new RCU API interfaces that have come into
> existance since 2004 were only hastily wrapped or NOPed by me to get
> things going.

Ah, understood.

> Jim's method only works with explicit rcu_read_lock..unlock sequences,
> implicit sequences via preempt_disable..enable and the like are not
> handled.  I had thought all such sequences were converted to rcu_read_lock
> but maybe that is not yet correct.

Not yet, unfortunately.  Having them all marked, for lockdep if nothing
else, could be a big benefit.

> Jim will have to comment on the full history.  He is incommunicado
> at the moment; hopefully he will be able to participate sometime in
> the next few days.

Sounds good!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 19:38             ` Paul E. McKenney
@ 2010-11-08 20:40               ` Frederic Weisbecker
  2010-11-10 18:08                 ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 20:40 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Udo A. Steinberg, Joe Korty, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 11:38:32AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 08, 2010 at 04:32:17PM +0100, Frederic Weisbecker wrote:
> > So, this looks very scary for performances to add rcu_read_lock() in
> > preempt_disable() and local_irq_save(), that notwithstanding it won't
> > handle the "raw" rcu sched implicit path.
> 
> Ah -- I would arrange for the rcu_read_lock() to be added only in the
> dyntick-hpc case.  So no effect on normal builds, overhead is added only
> in the dyntick-hpc case.



Yeah sure, but I wonder if the resulting rcu config will have a
large performance impact because of that.

In fact, my worry is: if the last resort to have a sane non-timer based
rcu is to bloat fast path functions like preempt_disable() or local_irq...
(that notwithstanding we have a bloated rcu_read_unlock() on this rcu config
because of its main nature), wouldn't it be better to eventually pick the
syscall/exception tweaked fast path version?

Perhaps I'll need to measure the impact of both, but I suspect I'll get
controversial results depending on the workload.


 
> > There is also my idea from the other discussion: change rcu_read_lock_sched()
> > semantics and map it to rcu_read_lock() in this rcu config (would be a nop
> > in other configs). So every users of rcu_dereference_sched() would now need
> > to protect their critical section with this.
> > Would it be too late to change this semantic?
> 
> I was expecting that we would fold RCU, RCU bh, and RCU sched into
> the same set of primitives (as Jim Houston did), but again only in the
> dyntick-hpc case.


Yeah, the resulting change must be NULL in others rcu configs.



> However, rcu_read_lock_bh() would still disable BH,
> and similarly, rcu_read_lock_sched() would still disable preemption.



Probably yeah, otherwise there will be a kind of sense split against
the usual rcu_read_lock() and everybody will be confused.

Perhaps we need a different API for the underlying rcu_read_lock()
call in the other flavours when preempt is already disabled or
bh is already disabled:

	rcu_enter_read_lock_sched();
	__rcu_read_lock_sched();
	rcu_start_read_lock_sched();

	(same for bh)

Hmm...


> > What is scary with this is that it also changes rcu sched semantics, and users
> > of call_rcu_sched() and synchronize_sched(), who rely on that to do more
> > tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
> > like really wanting for preempt_disable and local_irq_save/disable, those
> > users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
> > for them...
> 
> So rcu_read_lock() would be the underlying primitive.  The implementation
> of rcu_read_lock_sched() would disable preemption and then invoke
> rcu_read_lock().  The implementation of rcu_read_lock_bh() would
> disable BH and then invoke rcu_read_lock().  This would allow
> synchronize_rcu_sched() and synchronize_rcu_bh() to simply invoke
> synchronize_rcu().
> 
> Seem reasonable?


Perfect. That could be further optimized with what I said above but
other than that, that's what I was thinking about.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 19:49       ` Paul E. McKenney
@ 2010-11-08 20:51         ` Frederic Weisbecker
  0 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-08 20:51 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joe Korty, mathieu.desnoyers, dhowells, loic.minier,
	dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 11:49:04AM -0800, Paul E. McKenney wrote:
> On Mon, Nov 08, 2010 at 04:06:47PM +0100, Frederic Weisbecker wrote:
> > On Sat, Nov 06, 2010 at 12:28:12PM -0700, Paul E. McKenney wrote:
> > > On Fri, Nov 05, 2010 at 05:00:59PM -0400, Joe Korty wrote:
> > > > +/**
> > > > + * synchronize_sched - block until all CPUs have exited any non-preemptive
> > > > + * kernel code sequences.
> > > > + *
> > > > + * This means that all preempt_disable code sequences, including NMI and
> > > > + * hardware-interrupt handlers, in progress on entry will have completed
> > > > + * before this primitive returns.  However, this does not guarantee that
> > > > + * softirq handlers will have completed, since in some kernels
> > > 
> > > OK, so your approach treats preempt_disable code sequences as RCU
> > > read-side critical sections by relying on the fact that the per-CPU
> > > ->krcud task cannot run until such code sequences complete, correct?
> > > 
> > > This seems to require that each CPU's ->krcud task be awakened at
> > > least once per grace period, but I might well be missing something.
> > 
> > 
> > 
> > I understood it differently, but I might also be wrong as well. krcud
> > executes the callbacks, but it is only woken up for CPUs that want to
> > execute callbacks, not for those that only signal a quiescent state,
> > which is only determined in two ways through rcu_poll_other_cpus():
> > 
> > - if the CPU is in an rcu_read_lock() critical section, it has the
> >   IN_RCU_READ_LOCK flag. If so then we set up its DO_RCU_COMPLETION flag so
> >   that it signals its quiescent state on rcu_read_unlock().
> > 
> > - otherwise it's in a quiescent state.
> > 
> > 
> > This works for rcu and rcu bh critical sections.
> 
> Unfortunately, local_irq_save() is allowed to stand in for
> rcu_read_lock_bh().  :-/


Ah...right I missed that.



> 
> > But this works in rcu sched critical sections only if rcu_read_lock_sched() has
> > been called explicitly, otherwise that doesn't work (in preempt_disable(),
> > local_irq_save(), etc...). I think this is what is not complete when
> > Joe said it's not yet a complete rcu implementation.
> > 
> > This is also the part that scaries me most :)
> 
> And if we can make all the the preempt_disable(), local_irq_disable(), ...
> invoke rcu_read_lock(), then we have some chance of being able to dispense
> with the IPIs to CPUs not having callbacks that need to be executed.


Right.


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
  2010-11-06 19:28   ` Paul E. McKenney
  2010-11-06 20:03   ` Mathieu Desnoyers
@ 2010-11-09  9:22   ` Lai Jiangshan
  2010-11-10 15:54     ` Frederic Weisbecker
  2 siblings, 1 reply; 63+ messages in thread
From: Lai Jiangshan @ 2010-11-09  9:22 UTC (permalink / raw)
  To: Joe Korty
  Cc: Paul E. McKenney, fweisbec, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh,
	houston.jim, Lai Jiangshan



On Sat, Nov 6, 2010 at 5:00 AM, Joe Korty <joe.korty@ccur.com> wrote:
> +}
> +
> +/**
> + * rcu_read_lock - mark the beginning of an RCU read-side critical section.
> + *
> + * When synchronize_rcu() is invoked on one CPU while other CPUs
> + * are within RCU read-side critical sections, then the
> + * synchronize_rcu() is guaranteed to block until after all the other
> + * CPUs exit their critical sections.  Similarly, if call_rcu() is invoked
> + * on one CPU while other CPUs are within RCU read-side critical
> + * sections, invocation of the corresponding RCU callback is deferred
> + * until after the all the other CPUs exit their critical sections.
> + *
> + * Note, however, that RCU callbacks are permitted to run concurrently
> + * with RCU read-side critical sections.  One way that this can happen
> + * is via the following sequence of events: (1) CPU 0 enters an RCU
> + * read-side critical section, (2) CPU 1 invokes call_rcu() to register
> + * an RCU callback, (3) CPU 0 exits the RCU read-side critical section,
> + * (4) CPU 2 enters a RCU read-side critical section, (5) the RCU
> + * callback is invoked.  This is legal, because the RCU read-side critical
> + * section that was running concurrently with the call_rcu() (and which
> + * therefore might be referencing something that the corresponding RCU
> + * callback would free up) has completed before the corresponding
> + * RCU callback is invoked.
> + *
> + * RCU read-side critical sections may be nested.  Any deferred actions
> + * will be deferred until the outermost RCU read-side critical section
> + * completes.
> + *
> + * It is illegal to block while in an RCU read-side critical section.
> + */
> +void __rcu_read_lock(void)
> +{
> +       struct rcu_data *r;
> +
> +       r = &per_cpu(rcu_data, smp_processor_id());
> +       if (r->nest_count++ == 0)
> +               /*
> +                * Set the flags value to show that we are in
> +                * a read side critical section.  The code starting
> +                * a batch uses this to determine if a processor
> +                * needs to participate in the batch.  Including
> +                * a sequence allows the remote processor to tell
> +                * that a critical section has completed and another
> +                * has begun.
> +                */

memory barrier is needed as Paul noted.

> +               r->flags = IN_RCU_READ_LOCK | (r->sequence++ << 2);
> +}
> +EXPORT_SYMBOL(__rcu_read_lock);
> +
> +/**
> + * rcu_read_unlock - marks the end of an RCU read-side critical section.
> + * Check if a RCU batch was started while we were in the critical
> + * section.  If so, call rcu_quiescent() join the rendezvous.
> + *
> + * See rcu_read_lock() for more information.
> + */
> +void __rcu_read_unlock(void)
> +{
> +       struct rcu_data *r;
> +       int     cpu, flags;
> +
> +       cpu = smp_processor_id();
> +       r = &per_cpu(rcu_data, cpu);
> +       if (--r->nest_count == 0) {
> +               flags = xchg(&r->flags, 0);
> +               if (flags & DO_RCU_COMPLETION)
> +                       rcu_quiescent(cpu);
> +       }
> +}
> +EXPORT_SYMBOL(__rcu_read_unlock);

It is hardly acceptable when there are memory barriers or
atomic operations in the fast paths of rcu_read_lock(),
rcu_read_unlock().

We need some thing to drive the completion of GP
(and the callbacks process). There is no free lunch,
if the completion of GP is driven by rcu_read_unlock(),
we very probably need memory barriers or atomic operations
in the fast paths of rcu_read_lock(), rcu_read_unlock().

We need look for some periodic/continuous events of
the kernel for GP-driven, schedule-tick and schedule() are
most ideal events sources in the kernel I think.

schedule-tick and schedule() are not happened when NO_HZ
and dyntick-hpc, so we need some approaches to fix it. I vote up
for the #5 approach of Paul's.

Also, I propose an unmature idea here.

	Don't tell RCU about dyntick-hpc mode, but instead
	stop the RCU function of a CPU when the first time RCU disturbs
	dyntick-hpc mode or NO_HZ mode.

	rcu_read_lock()
		if the RCU function of this CPU is not started, start it and
		start a RCU timer.
		handle rcu_read_lock()

	enter NO_HZ
		if interrupts are just happened very frequently, do nothing, else:
		stop the rcu function and the rcu timer of the current CPU.

	exit interrupt:
		if this interrupt is just caused by RCU timer && it just disrurbs
		dyntick-hpc mode or NO_HZ mode(and will reenter these modes),
		stop the rcu function and stop the rcu timer of the current CPU.

	schedule-tick:
		requeue the rcu timer before it causes an unneeded interrupt.
		handle rcu things.

	+	Not big changes to RCU, use the same code to handle
		dyntick-hpc mode or NO_HZ mode, reuse some code of rcu_offline_cpu()

	+	No need to inform RCU of user/kernel transitions.

	+	No need to turn scheduling-clock interrupts on
		at each user/kernel transition.

	-	carefully handle some critical region which also implies
		rcu critical region.

	-	Introduce some unneeded interrupt, but it is very infrequency.

Thanks,
Lai

> +
> +/**
> + * call_rcu - Queue an RCU callback for invocation after a grace period.
> + * @head: structure to be used for queueing the RCU updates.
> + * @func: actual update function to be invoked after the grace period
> + *
> + * The update function will be invoked some time after a full grace
> + * period elapses, in other words after all currently executing RCU
> + * read-side critical sections have completed.  RCU read-side critical
> + * sections are delimited by rcu_read_lock() and rcu_read_unlock(),
> + * and may be nested.
> + */
> +void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
> +{
> +       struct rcu_data *r;
> +       unsigned long flags;
> +       int cpu;
> +
> +       head->func = func;
> +       head->next = NULL;
> +       local_irq_save(flags);
> +       cpu = smp_processor_id();
> +       r = &per_cpu(rcu_data, cpu);
> +       /*
> +        * Avoid mixing new entries with batches which have already
> +        * completed or have a grace period in progress.
> +        */
> +       if (r->nxt.head && rcu_move_if_done(r))
> +               rcu_wake_daemon(r);
> +
> +       rcu_list_add(&r->nxt, head);
> +       if (r->nxtcount++ == 0) {

memory barrier is needed. (before read the rcu_batch)

> +               r->nxtbatch = (rcu_batch & RCU_BATCH_MASK) + RCU_INCREMENT;
> +               barrier();
> +               if (!rcu_timestamp)
> +                       rcu_timestamp = jiffies ?: 1;
> +       }
> +       /* If we reach the limit start a batch. */
> +       if (r->nxtcount > rcu_max_count) {
> +               if (rcu_set_state(RCU_NEXT_PENDING) == RCU_COMPLETE)
> +                       rcu_start_batch();
> +       }
> +       local_irq_restore(flags);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
> +
> +


> +/*
> + * Process the completed RCU callbacks.
> + */
> +static void rcu_process_callbacks(struct rcu_data *r)
> +{
> +       struct rcu_head *list, *next;
> +
> +       local_irq_disable();
> +       rcu_move_if_done(r);
> +       list = r->done.head;
> +       rcu_list_init(&r->done);
> +       local_irq_enable();
> +

memory barrier is needed. (after read the rcu_batch)

> +       while (list) {
> +               next = list->next;
> +               list->func(list);
> +               list = next;
> +       }
> +}

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-09  9:22   ` Lai Jiangshan
@ 2010-11-10 15:54     ` Frederic Weisbecker
  2010-11-10 17:31       ` Peter Zijlstra
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 15:54 UTC (permalink / raw)
  To: Lai Jiangshan
  Cc: Joe Korty, Paul E. McKenney, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh,
	houston.jim

On Tue, Nov 09, 2010 at 05:22:49PM +0800, Lai Jiangshan wrote:
> It is hardly acceptable when there are memory barriers or
> atomic operations in the fast paths of rcu_read_lock(),
> rcu_read_unlock().
> 
> We need some thing to drive the completion of GP
> (and the callbacks process). There is no free lunch,
> if the completion of GP is driven by rcu_read_unlock(),
> we very probably need memory barriers or atomic operations
> in the fast paths of rcu_read_lock(), rcu_read_unlock().
> 
> We need look for some periodic/continuous events of
> the kernel for GP-driven, schedule-tick and schedule() are
> most ideal events sources in the kernel I think.
> 
> schedule-tick and schedule() are not happened when NO_HZ
> and dyntick-hpc, so we need some approaches to fix it. I vote up
> for the #5 approach of Paul's.
> 
> Also, I propose an unmature idea here.
> 
> 	Don't tell RCU about dyntick-hpc mode, but instead
> 	stop the RCU function of a CPU when the first time RCU disturbs
> 	dyntick-hpc mode or NO_HZ mode.
> 
> 	rcu_read_lock()
> 		if the RCU function of this CPU is not started, start it and
> 		start a RCU timer.
> 		handle rcu_read_lock()
> 
> 	enter NO_HZ
> 		if interrupts are just happened very frequently, do nothing, else:
> 		stop the rcu function and the rcu timer of the current CPU.
> 
> 	exit interrupt:
> 		if this interrupt is just caused by RCU timer && it just disrurbs
> 		dyntick-hpc mode or NO_HZ mode(and will reenter these modes),
> 		stop the rcu function and stop the rcu timer of the current CPU.
> 
> 	schedule-tick:
> 		requeue the rcu timer before it causes an unneeded interrupt.
> 		handle rcu things.
> 
> 	+	Not big changes to RCU, use the same code to handle
> 		dyntick-hpc mode or NO_HZ mode, reuse some code of rcu_offline_cpu()
> 
> 	+	No need to inform RCU of user/kernel transitions.
> 
> 	+	No need to turn scheduling-clock interrupts on
> 		at each user/kernel transition.
> 
> 	-	carefully handle some critical region which also implies
> 		rcu critical region.
> 
> 	-	Introduce some unneeded interrupt, but it is very infrequency.



I like this idea.

I would just merge the concept of "rcu timer" into the sched tick,
as per Peter Zijlstra idea:

In this CPU isolation mode, we are going to do a kind of adapative
sched tick already: run the sched tick and if there was nothing to do
for some time and we are in userspace, deactivate it. If suddenly
a new task arrives on the CPU (which means there is now more than
one task running and we want preemption tick back), or if we have
timers enqueued, or so, reactivate it.

So I would simply merge your rcu idea into this, + the reactivation
on rcu_read_lock().

Someone see a bad thing in this idea?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-10 15:54     ` Frederic Weisbecker
@ 2010-11-10 17:31       ` Peter Zijlstra
  2010-11-10 17:45         ` Frederic Weisbecker
  2010-11-11  4:19         ` Paul E. McKenney
  0 siblings, 2 replies; 63+ messages in thread
From: Peter Zijlstra @ 2010-11-10 17:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Lai Jiangshan, Joe Korty, Paul E. McKenney, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Wed, 2010-11-10 at 16:54 +0100, Frederic Weisbecker wrote:
> run the sched tick and if there was nothing to do
> for some time and we are in userspace, deactivate it. 

Not for some time, immediately, have the tick track if it was useful, if
it was not, have it stop itself, like:

tick()
{
 int stop = 1;

 if (nr_running > 1)
  stop = 0;

 if(rcu_needs_cpu())
  stop = 0;

 ...


 if (stop)
  enter_nohz_mode();
}



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-10 17:31       ` Peter Zijlstra
@ 2010-11-10 17:45         ` Frederic Weisbecker
  2010-11-11  4:19         ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-10 17:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Lai Jiangshan, Joe Korty, Paul E. McKenney, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Wed, Nov 10, 2010 at 06:31:11PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-10 at 16:54 +0100, Frederic Weisbecker wrote:
> > run the sched tick and if there was nothing to do
> > for some time and we are in userspace, deactivate it. 
> 
> Not for some time, immediately, have the tick track if it was useful, if
> it was not, have it stop itself, like:
> 
> tick()
> {
>  int stop = 1;
> 
>  if (nr_running > 1)
>   stop = 0;
> 
>  if(rcu_needs_cpu())
>   stop = 0;
> 
>  ...
> 
> 
>  if (stop)
>   enter_nohz_mode();



If we are in userspace then yeah, let's enter nohz immediately.
But if we are in kernelspace, we should probably wait a bit before
switching to nohz, like one HZ, as rcu critical sections are quite
numerous.
We probably don't want to reprogram the timer everytime we enter
a new rcu_read_lock().


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 20:40               ` Frederic Weisbecker
@ 2010-11-10 18:08                 ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-10 18:08 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Udo A. Steinberg, Joe Korty, mathieu.desnoyers, dhowells,
	loic.minier, dhaval.giani, tglx, peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 09:40:11PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 08, 2010 at 11:38:32AM -0800, Paul E. McKenney wrote:
> > On Mon, Nov 08, 2010 at 04:32:17PM +0100, Frederic Weisbecker wrote:
> > > So, this looks very scary for performances to add rcu_read_lock() in
> > > preempt_disable() and local_irq_save(), that notwithstanding it won't
> > > handle the "raw" rcu sched implicit path.
> > 
> > Ah -- I would arrange for the rcu_read_lock() to be added only in the
> > dyntick-hpc case.  So no effect on normal builds, overhead is added only
> > in the dyntick-hpc case.
> 
> 
> 
> Yeah sure, but I wonder if the resulting rcu config will have a
> large performance impact because of that.
> 
> In fact, my worry is: if the last resort to have a sane non-timer based
> rcu is to bloat fast path functions like preempt_disable() or local_irq...
> (that notwithstanding we have a bloated rcu_read_unlock() on this rcu config
> because of its main nature), wouldn't it be better to eventually pick the
> syscall/exception tweaked fast path version?
> 
> Perhaps I'll need to measure the impact of both, but I suspect I'll get
> controversial results depending on the workload.

Do you have a workload that you can use to measure this?  If so, I would
be very interested in seeing the result of upping the value of
RCU_JIFFIES_TILL_FORCE_QS to 30, 300, and HZ.

> > > There is also my idea from the other discussion: change rcu_read_lock_sched()
> > > semantics and map it to rcu_read_lock() in this rcu config (would be a nop
> > > in other configs). So every users of rcu_dereference_sched() would now need
> > > to protect their critical section with this.
> > > Would it be too late to change this semantic?
> > 
> > I was expecting that we would fold RCU, RCU bh, and RCU sched into
> > the same set of primitives (as Jim Houston did), but again only in the
> > dyntick-hpc case.
> 
> Yeah, the resulting change must be NULL in others rcu configs.

Indeed!!!

> > However, rcu_read_lock_bh() would still disable BH,
> > and similarly, rcu_read_lock_sched() would still disable preemption.
> 
> Probably yeah, otherwise there will be a kind of sense split against
> the usual rcu_read_lock() and everybody will be confused.
> 
> Perhaps we need a different API for the underlying rcu_read_lock()
> call in the other flavours when preempt is already disabled or
> bh is already disabled:
> 
> 	rcu_enter_read_lock_sched();
> 	__rcu_read_lock_sched();
> 	rcu_start_read_lock_sched();
> 
> 	(same for bh)
> 
> Hmm...

I would really really like to avoid adding to the already-large RCU API.  ;-)

> > > What is scary with this is that it also changes rcu sched semantics, and users
> > > of call_rcu_sched() and synchronize_sched(), who rely on that to do more
> > > tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
> > > like really wanting for preempt_disable and local_irq_save/disable, those
> > > users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
> > > for them...
> > 
> > So rcu_read_lock() would be the underlying primitive.  The implementation
> > of rcu_read_lock_sched() would disable preemption and then invoke
> > rcu_read_lock().  The implementation of rcu_read_lock_bh() would
> > disable BH and then invoke rcu_read_lock().  This would allow
> > synchronize_rcu_sched() and synchronize_rcu_bh() to simply invoke
> > synchronize_rcu().
> > 
> > Seem reasonable?
> 
> Perfect. That could be further optimized with what I said above but
> other than that, that's what I was thinking about.

OK, sounds good!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-10 17:31       ` Peter Zijlstra
  2010-11-10 17:45         ` Frederic Weisbecker
@ 2010-11-11  4:19         ` Paul E. McKenney
  2010-11-13 22:30           ` Frederic Weisbecker
  1 sibling, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-11  4:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Frederic Weisbecker, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Wed, Nov 10, 2010 at 06:31:11PM +0100, Peter Zijlstra wrote:
> On Wed, 2010-11-10 at 16:54 +0100, Frederic Weisbecker wrote:
> > run the sched tick and if there was nothing to do
> > for some time and we are in userspace, deactivate it. 
> 
> Not for some time, immediately, have the tick track if it was useful, if
> it was not, have it stop itself, like:
> 
> tick()
> {
>  int stop = 1;
> 
>  if (nr_running > 1)
>   stop = 0;
> 
>  if(rcu_needs_cpu())
>   stop = 0;
> 
>  ...
> 
> 
>  if (stop)
>   enter_nohz_mode();
> }

I am still holding out for a dyntick-hpc version of RCU that does not
need the tick.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-11  4:19         ` Paul E. McKenney
@ 2010-11-13 22:30           ` Frederic Weisbecker
  2010-11-16  1:28             ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-13 22:30 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Wed, Nov 10, 2010 at 08:19:20PM -0800, Paul E. McKenney wrote:
> On Wed, Nov 10, 2010 at 06:31:11PM +0100, Peter Zijlstra wrote:
> > On Wed, 2010-11-10 at 16:54 +0100, Frederic Weisbecker wrote:
> > > run the sched tick and if there was nothing to do
> > > for some time and we are in userspace, deactivate it. 
> > 
> > Not for some time, immediately, have the tick track if it was useful, if
> > it was not, have it stop itself, like:
> > 
> > tick()
> > {
> >  int stop = 1;
> > 
> >  if (nr_running > 1)
> >   stop = 0;
> > 
> >  if(rcu_needs_cpu())
> >   stop = 0;
> > 
> >  ...
> > 
> > 
> >  if (stop)
> >   enter_nohz_mode();
> > }
> 
> I am still holding out for a dyntick-hpc version of RCU that does not
> need the tick.  ;-)


So you don't think it would be an appropriate solution? Keeping the tick for short
periods of time while we need it only, that looks quite a good way to try.

Hmm?


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-13 22:30           ` Frederic Weisbecker
@ 2010-11-16  1:28             ` Paul E. McKenney
  2010-11-16 13:52               ` Frederic Weisbecker
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-16  1:28 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Sat, Nov 13, 2010 at 11:30:49PM +0100, Frederic Weisbecker wrote:
> On Wed, Nov 10, 2010 at 08:19:20PM -0800, Paul E. McKenney wrote:
> > On Wed, Nov 10, 2010 at 06:31:11PM +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-11-10 at 16:54 +0100, Frederic Weisbecker wrote:
> > > > run the sched tick and if there was nothing to do
> > > > for some time and we are in userspace, deactivate it. 
> > > 
> > > Not for some time, immediately, have the tick track if it was useful, if
> > > it was not, have it stop itself, like:
> > > 
> > > tick()
> > > {
> > >  int stop = 1;
> > > 
> > >  if (nr_running > 1)
> > >   stop = 0;
> > > 
> > >  if(rcu_needs_cpu())
> > >   stop = 0;
> > > 
> > >  ...
> > > 
> > > 
> > >  if (stop)
> > >   enter_nohz_mode();
> > > }
> > 
> > I am still holding out for a dyntick-hpc version of RCU that does not
> > need the tick.  ;-)
> 
> 
> So you don't think it would be an appropriate solution? Keeping the tick for short
> periods of time while we need it only, that looks quite a good way to try.

My concern is not the tick -- it is really easy to work around lack of a
tick from an RCU viewpoint.  In fact, this happens automatically given the
current implementations!  If there is a callback anywhere in the system,
then RCU will prevent the corresponding CPU from entering dyntick-idle
mode, and that CPU's clock will drive the rest of RCU as needed via
force_quiescent_state().  The force_quiescent_state() workings would
want to be slightly different for dyntick-hpc, but not significantly so
(especially once I get TREE_RCU moved to kthreads).

My concern is rather all the implicit RCU-sched read-side critical
sections, particularly those that arch-specific code is creating.
And it recently occurred to me that there are necessarily more implicit
irq/preempt disables than there are exception entries.

So would you be OK with telling RCU about kernel entries/exits, but
simply not enabling the tick?  The irq and NMI kernel entries/exits are
already covered, of course.

This seems to me to work out as follows:

1.	If there are no RCU callbacks anywhere in the system, RCU
	is quiescent and does not cause any IPIs or interrupts of
	any kind.  For HPC workloads, this should be the common case.

2.	If there is an RCU callback, then one CPU keeps a tick going
	and drives RCU core processing on all CPUs.  (This probably
	works with RCU as is, but somewhat painfully.)  This results
	in some IPIs, but only to those CPUs that remain running in
	the kernel for extended time periods.  Appropriate adjustment
	of RCU_JIFFIES_TILL_FORCE_QS, possibly promoted to be a
	kernel configuration parameter, should make such IPIs
	-extremely- rare.  After all, how many kernel code paths
	are going to consume (say) 10 jiffies of CPU time?  (Keep
	in mind that if the system call blocks, the CPU will enter
	dyntick-idle mode, and RCU will still recognize it as an
	innocent bystander without needing to IPI it.)

3.	The implicit RCU-sched read-side critical sections just work
	as they do today.

Or am I missing some other problems with this approach?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-16  1:28             ` Paul E. McKenney
@ 2010-11-16 13:52               ` Frederic Weisbecker
  2010-11-16 15:51                 ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-16 13:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Mon, Nov 15, 2010 at 05:28:46PM -0800, Paul E. McKenney wrote:
> My concern is not the tick -- it is really easy to work around lack of a
> tick from an RCU viewpoint.  In fact, this happens automatically given the
> current implementations!  If there is a callback anywhere in the system,
> then RCU will prevent the corresponding CPU from entering dyntick-idle
> mode, and that CPU's clock will drive the rest of RCU as needed via
> force_quiescent_state().



Now, I'm confused, I thought a CPU entering idle nohz had nothing to do
if it has no local callbacks, and rcu_enter_nohz already deals with
everything.

There is certainly tons of subtle things in RCU anyway :)



> The force_quiescent_state() workings would
> want to be slightly different for dyntick-hpc, but not significantly so
> (especially once I get TREE_RCU moved to kthreads).
> 
> My concern is rather all the implicit RCU-sched read-side critical
> sections, particularly those that arch-specific code is creating.
> And it recently occurred to me that there are necessarily more implicit
> irq/preempt disables than there are exception entries.



Doh! You're right, I don't know why I thought that adaptive tick would
solve the implicit rcu sched/bh cases, my vision took a shortcut.



> So would you be OK with telling RCU about kernel entries/exits, but
> simply not enabling the tick?



Let's try that.



> The irq and NMI kernel entries/exits are
> already covered, of course.


Yep.


> This seems to me to work out as follows:
> 
> 1.	If there are no RCU callbacks anywhere in the system, RCU
> 	is quiescent and does not cause any IPIs or interrupts of
> 	any kind.  For HPC workloads, this should be the common case.


Right.



> 2.	If there is an RCU callback, then one CPU keeps a tick going
> 	and drives RCU core processing on all CPUs.  (This probably
> 	works with RCU as is, but somewhat painfully.)  This results
> 	in some IPIs, but only to those CPUs that remain running in
> 	the kernel for extended time periods.  Appropriate adjustment
> 	of RCU_JIFFIES_TILL_FORCE_QS, possibly promoted to be a
> 	kernel configuration parameter, should make such IPIs
> 	-extremely- rare.  After all, how many kernel code paths
> 	are going to consume (say) 10 jiffies of CPU time?  (Keep
> 	in mind that if the system call blocks, the CPU will enter
> 	dyntick-idle mode, and RCU will still recognize it as an
> 	innocent bystander without needing to IPI it.)



Makes all sense. Also there may be periods when these "isolated" CPUs
will restart the tick, like when there is more than one task running
on that CPU, in which case we can of course fall back to usual
grace periods processing.



> 
> 3.	The implicit RCU-sched read-side critical sections just work
> 	as they do today.
> 
> Or am I missing some other problems with this approach?


No, looks good, now I'm going to implement/test a draft of these ideas.

Thanks a lot!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-16 13:52               ` Frederic Weisbecker
@ 2010-11-16 15:51                 ` Paul E. McKenney
  2010-11-17  0:52                   ` Frederic Weisbecker
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-16 15:51 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Tue, Nov 16, 2010 at 02:52:34PM +0100, Frederic Weisbecker wrote:
> On Mon, Nov 15, 2010 at 05:28:46PM -0800, Paul E. McKenney wrote:
> > My concern is not the tick -- it is really easy to work around lack of a
> > tick from an RCU viewpoint.  In fact, this happens automatically given the
> > current implementations!  If there is a callback anywhere in the system,
> > then RCU will prevent the corresponding CPU from entering dyntick-idle
> > mode, and that CPU's clock will drive the rest of RCU as needed via
> > force_quiescent_state().
> 
> Now, I'm confused, I thought a CPU entering idle nohz had nothing to do
> if it has no local callbacks, and rcu_enter_nohz already deals with
> everything.
> 
> There is certainly tons of subtle things in RCU anyway :)

Well, I wasn't being all that clear above, apologies!!!

If a given CPU hasn't responded to the current RCU grace period,
perhaps due to being in a longer-than-average irq handler, then it
doesn't necessarily need its own scheduler tick enabled.  If there is a
callback anywhere else in the system, then there is some other CPU with
its scheduler tick enabled.  That other CPU can drive the slow-to-respond
CPU through the grace-period process.

The current RCU code should work in the common case.  There are probably
a few bugs, but I will make you a deal.  You find them, I will fix them.
Particularly if you are willing to test the  fixes.

> > The force_quiescent_state() workings would
> > want to be slightly different for dyntick-hpc, but not significantly so
> > (especially once I get TREE_RCU moved to kthreads).
> > 
> > My concern is rather all the implicit RCU-sched read-side critical
> > sections, particularly those that arch-specific code is creating.
> > And it recently occurred to me that there are necessarily more implicit
> > irq/preempt disables than there are exception entries.
> 
> Doh! You're right, I don't know why I thought that adaptive tick would
> solve the implicit rcu sched/bh cases, my vision took a shortcut.

Yeah, and I was clearly suffering from a bit of sleep deprivation when
we discussed this in Boston.  :-/

> > So would you be OK with telling RCU about kernel entries/exits, but
> > simply not enabling the tick?
> 
> Let's try that.

Cool!!!

> > The irq and NMI kernel entries/exits are
> > already covered, of course.
> 
> Yep.
> 
> > This seems to me to work out as follows:
> > 
> > 1.	If there are no RCU callbacks anywhere in the system, RCU
> > 	is quiescent and does not cause any IPIs or interrupts of
> > 	any kind.  For HPC workloads, this should be the common case.
> 
> Right.
> 
> > 2.	If there is an RCU callback, then one CPU keeps a tick going
> > 	and drives RCU core processing on all CPUs.  (This probably
> > 	works with RCU as is, but somewhat painfully.)  This results
> > 	in some IPIs, but only to those CPUs that remain running in
> > 	the kernel for extended time periods.  Appropriate adjustment
> > 	of RCU_JIFFIES_TILL_FORCE_QS, possibly promoted to be a
> > 	kernel configuration parameter, should make such IPIs
> > 	-extremely- rare.  After all, how many kernel code paths
> > 	are going to consume (say) 10 jiffies of CPU time?  (Keep
> > 	in mind that if the system call blocks, the CPU will enter
> > 	dyntick-idle mode, and RCU will still recognize it as an
> > 	innocent bystander without needing to IPI it.)
> 
> Makes all sense. Also there may be periods when these "isolated" CPUs
> will restart the tick, like when there is more than one task running
> on that CPU, in which case we can of course fall back to usual
> grace periods processing.

Yep!

> > 3.	The implicit RCU-sched read-side critical sections just work
> > 	as they do today.
> > 
> > Or am I missing some other problems with this approach?
> 
> No, looks good, now I'm going to implement/test a draft of these ideas.
> 
> Thanks a lot!

Very cool, and thank you!!!  I am sure that you will not be shy about
letting me know of any RCU problems that you might encounter.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-16 15:51                 ` Paul E. McKenney
@ 2010-11-17  0:52                   ` Frederic Weisbecker
  2010-11-17  1:25                     ` Paul E. McKenney
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
  0 siblings, 2 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2010-11-17  0:52 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Tue, Nov 16, 2010 at 07:51:04AM -0800, Paul E. McKenney wrote:
> On Tue, Nov 16, 2010 at 02:52:34PM +0100, Frederic Weisbecker wrote:
> > On Mon, Nov 15, 2010 at 05:28:46PM -0800, Paul E. McKenney wrote:
> > > My concern is not the tick -- it is really easy to work around lack of a
> > > tick from an RCU viewpoint.  In fact, this happens automatically given the
> > > current implementations!  If there is a callback anywhere in the system,
> > > then RCU will prevent the corresponding CPU from entering dyntick-idle
> > > mode, and that CPU's clock will drive the rest of RCU as needed via
> > > force_quiescent_state().
> > 
> > Now, I'm confused, I thought a CPU entering idle nohz had nothing to do
> > if it has no local callbacks, and rcu_enter_nohz already deals with
> > everything.
> > 
> > There is certainly tons of subtle things in RCU anyway :)
> 
> Well, I wasn't being all that clear above, apologies!!!
> 
> If a given CPU hasn't responded to the current RCU grace period,
> perhaps due to being in a longer-than-average irq handler, then it
> doesn't necessarily need its own scheduler tick enabled.  If there is a
> callback anywhere else in the system, then there is some other CPU with
> its scheduler tick enabled.  That other CPU can drive the slow-to-respond
> CPU through the grace-period process.



So, the scenario is that a first CPU (CPU 0) enqueues a callback and then
starts a new GP. But the GP is abnormally long because another CPU (CPU 1)
takes too much time to respond. But the CPU 2 enqueues a new callback.

What you're saying is that CPU 2 will take care of the current grace period
that hasn't finished, because it needs to start another one?
So this CPU 2 is going to be more insistant and will then send IPIs to
CPU 1.

Or am I completely confused? :-D

Ah, and if I understood well, if nobody like CPU 2 had been starting a new
grace period, then nobody would send those IPIs?

Looking at the rcu tree code, the IPI is sent from the state machine in
force_quiescent_state(), if the given CPU is not in dyntick mode.
And force_quiescent_state() is either called from the rcu softirq
or when one queues a callback. So, yeah, I think I understood correctly :)

But it also means that if we have two CPUs only, and CPU 0 starts a grace
period and then goes idle. CPU 1 may never respond and the grace period
may end in a rough while.



> The current RCU code should work in the common case.  There are probably
> a few bugs, but I will make you a deal.  You find them, I will fix them.
> Particularly if you are willing to test the  fixes.


Of course :)


 
> > > The force_quiescent_state() workings would
> > > want to be slightly different for dyntick-hpc, but not significantly so
> > > (especially once I get TREE_RCU moved to kthreads).
> > > 
> > > My concern is rather all the implicit RCU-sched read-side critical
> > > sections, particularly those that arch-specific code is creating.
> > > And it recently occurred to me that there are necessarily more implicit
> > > irq/preempt disables than there are exception entries.
> > 
> > Doh! You're right, I don't know why I thought that adaptive tick would
> > solve the implicit rcu sched/bh cases, my vision took a shortcut.
> 
> Yeah, and I was clearly suffering from a bit of sleep deprivation when
> we discussed this in Boston.  :-/



I suspect the real problem was my oral english understanding ;-)



> > > 3.	The implicit RCU-sched read-side critical sections just work
> > > 	as they do today.
> > > 
> > > Or am I missing some other problems with this approach?
> > 
> > No, looks good, now I'm going to implement/test a draft of these ideas.
> > 
> > Thanks a lot!
> 
> Very cool, and thank you!!!  I am sure that you will not be shy about
> letting me know of any RCU problems that you might encounter.  ;-)


Of course not ;-)

Thanks!!


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-17  0:52                   ` Frederic Weisbecker
@ 2010-11-17  1:25                     ` Paul E. McKenney
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-17  1:25 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Peter Zijlstra, Lai Jiangshan, Joe Korty, mathieu.desnoyers,
	dhowells, loic.minier, dhaval.giani, tglx, linux-kernel, josh,
	houston.jim

On Wed, Nov 17, 2010 at 01:52:33AM +0100, Frederic Weisbecker wrote:
> On Tue, Nov 16, 2010 at 07:51:04AM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 16, 2010 at 02:52:34PM +0100, Frederic Weisbecker wrote:
> > > On Mon, Nov 15, 2010 at 05:28:46PM -0800, Paul E. McKenney wrote:
> > > > My concern is not the tick -- it is really easy to work around lack of a
> > > > tick from an RCU viewpoint.  In fact, this happens automatically given the
> > > > current implementations!  If there is a callback anywhere in the system,
> > > > then RCU will prevent the corresponding CPU from entering dyntick-idle
> > > > mode, and that CPU's clock will drive the rest of RCU as needed via
> > > > force_quiescent_state().
> > > 
> > > Now, I'm confused, I thought a CPU entering idle nohz had nothing to do
> > > if it has no local callbacks, and rcu_enter_nohz already deals with
> > > everything.
> > > 
> > > There is certainly tons of subtle things in RCU anyway :)
> > 
> > Well, I wasn't being all that clear above, apologies!!!
> > 
> > If a given CPU hasn't responded to the current RCU grace period,
> > perhaps due to being in a longer-than-average irq handler, then it
> > doesn't necessarily need its own scheduler tick enabled.  If there is a
> > callback anywhere else in the system, then there is some other CPU with
> > its scheduler tick enabled.  That other CPU can drive the slow-to-respond
> > CPU through the grace-period process.
> 
> So, the scenario is that a first CPU (CPU 0) enqueues a callback and then
> starts a new GP. But the GP is abnormally long because another CPU (CPU 1)
> takes too much time to respond. But the CPU 2 enqueues a new callback.
> 
> What you're saying is that CPU 2 will take care of the current grace period
> that hasn't finished, because it needs to start another one?
> So this CPU 2 is going to be more insistant and will then send IPIs to
> CPU 1.
> 
> Or am I completely confused? :-D

The main thing is that all CPUs that have at least one callback queued
will also have their scheduler tick enabled.  So in your example above,
both CPU 0 and CPU 2 would get insistent at about the same time.  Internal
RCU locking would choose which one of the two actually send the IPIs
(currently just resched IPIs, but can be changed fairly easily if needed).

> Ah, and if I understood well, if nobody like CPU 2 had been starting a new
> grace period, then nobody would send those IPIs?

Yep, if there are no callbacks, there is no grace period, so RCU would
have no reason to send any IPIs.  And again, this should be the common
case for HPC applications.

> Looking at the rcu tree code, the IPI is sent from the state machine in
> force_quiescent_state(), if the given CPU is not in dyntick mode.
> And force_quiescent_state() is either called from the rcu softirq
> or when one queues a callback. So, yeah, I think I understood correctly :)

Yep!!!

> But it also means that if we have two CPUs only, and CPU 0 starts a grace
> period and then goes idle. CPU 1 may never respond and the grace period
> may end in a rough while.

Well, if CPU 0 started a grace period, there must have been an RCU
callback in the system somewhere.  (Otherwise, there is an RCU bug, though
a fairly minor one -- if there are no RCU callbacks, then there isn't
too much of a problem if the needless RCU grace period takes forever.)
That RCU callback will be enqueued on one of the two CPUs, and that CPU
will keep its scheduler tick running, and thus will help the grace period
along as needed.

> > The current RCU code should work in the common case.  There are probably
> > a few bugs, but I will make you a deal.  You find them, I will fix them.
> > Particularly if you are willing to test the  fixes.
> 
> Of course :)
> 
> > > > The force_quiescent_state() workings would
> > > > want to be slightly different for dyntick-hpc, but not significantly so
> > > > (especially once I get TREE_RCU moved to kthreads).
> > > > 
> > > > My concern is rather all the implicit RCU-sched read-side critical
> > > > sections, particularly those that arch-specific code is creating.
> > > > And it recently occurred to me that there are necessarily more implicit
> > > > irq/preempt disables than there are exception entries.
> > > 
> > > Doh! You're right, I don't know why I thought that adaptive tick would
> > > solve the implicit rcu sched/bh cases, my vision took a shortcut.
> > 
> > Yeah, and I was clearly suffering from a bit of sleep deprivation when
> > we discussed this in Boston.  :-/
> 
> I suspect the real problem was my oral english understanding ;-)

Mostly I didn't think to ask if re-enabling the scheduler tick was
the only problem.  ;-)

> > > > 3.	The implicit RCU-sched read-side critical sections just work
> > > > 	as they do today.
> > > > 
> > > > Or am I missing some other problems with this approach?
> > > 
> > > No, looks good, now I'm going to implement/test a draft of these ideas.
> > > 
> > > Thanks a lot!
> > 
> > Very cool, and thank you!!!  I am sure that you will not be shy about
> > letting me know of any RCU problems that you might encounter.  ;-)
> 
> Of course not ;-)

Sounds good!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH] An RCU for SMP with a single CPU garbage collector
  2010-11-17  0:52                   ` Frederic Weisbecker
  2010-11-17  1:25                     ` Paul E. McKenney
@ 2011-03-07 20:31                     ` Joe Korty
       [not found]                       ` <20110307210157.GG3104@linux.vnet.ibm.com>
                                         ` (5 more replies)
  1 sibling, 6 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-07 20:31 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

Hi Paul & Fredrick & other interested parties.

We would like for Linux to eventually support user
dedicated cpus.  That is, cpus that are completely free of
periodic system duties, leaving 100% of their capacity
available to service user applications.  This will
eventually be important for those realtime applications
requiring full use of dedicated cpus.

An important milestone to that goal would be to have
an offical, supported RCU implementation which did not
make each and every CPU periodically participate in RCU
garbage collection.

The attached RCU implementation does precisely that.
Like TinyRCU it is both tiny and very simple, but unlike
TinyRCU it supports SMP, and unlike the other SMP RCUs,
it does its garbage collection from a single CPU.

For performance, each cpu is given its own 'current' and
'previous' callback queue, and all interactions between
these queues and the global garbage collector proceed in
a lock-free manner.

This patch is a quick port to 2.6.38-rc7 from a 2.6.36.4
implementation developed over the last two weeks. The
earlier version was tested over the weekend under i386 and
x86_64, this version has only been spot tested on x86_64.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: b/kernel/jrcu.c
===================================================================
--- /dev/null
+++ b/kernel/jrcu.c
@@ -0,0 +1,604 @@
+/*
+ * Joe's tiny single-cpu RCU, for small SMP systems.
+ *
+ * Running RCU end-of-batch operations from a single cpu relieves the
+ * other CPUs from this periodic responsibility.  This will eventually
+ * be important for those realtime applications requiring full use of
+ * dedicated cpus.  JRCU is also a lockless implementation, currently,
+ * although some anticipated features will eventually require a per
+ * cpu rcu_lock along some minimal-contention paths.
+ *
+ * Author: Joe Korty <joe.korty@ccur.com>
+ *
+ * Acknowledgements: Paul E. McKenney's 'TinyRCU for uniprocessors' inspired
+ * the thought that there could could be something similiarly simple for SMP.
+ * The rcu_list chain operators are from Jim Houston's Alternative RCU.
+ *
+ * Copyright Concurrent Computer Corporation, 2011
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+
+/*
+ * This RCU maintains three callback lists: the current batch (per cpu),
+ * the previous batch (also per cpu), and the pending list (global).
+ */
+
+#include <linux/bug.h>
+#include <linux/smp.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+#include <linux/kernel.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/stddef.h>
+#include <linux/preempt.h>
+#include <linux/compiler.h>
+#include <linux/irqflags.h>
+#include <linux/rcupdate.h>
+
+#include <asm/system.h>
+
+/*
+ * Define an rcu list type and operators.  An rcu list has only ->next
+ * pointers for the chain nodes; the list head however is special and
+ * has pointers to both the first and last nodes of the chain.  Tweaked
+ * so that null head, tail pointers can be used to signify an empty list.
+ */
+struct rcu_list {
+	struct rcu_head *head;
+	struct rcu_head **tail;
+	int count;		/* stats-n-debug */
+};
+
+static inline void rcu_list_init(struct rcu_list *l)
+{
+	l->head = NULL;
+	l->tail = NULL;
+	l->count = 0;
+}
+
+/*
+ * Add an element to the tail of an rcu list
+ */
+static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
+{
+	if (unlikely(l->tail == NULL))
+		l->tail = &l->head;
+	*l->tail = h;
+	l->tail = &h->next;
+	l->count++;
+	h->next = NULL;
+}
+
+/*
+ * Append the contents of one rcu list to another.  The 'from' list is left
+ * corrupted on exit; the caller must re-initialize it before it can be used
+ * again.
+ */
+static inline void rcu_list_join(struct rcu_list *to, struct rcu_list *from)
+{
+	if (from->head) {
+		if (unlikely(to->tail == NULL)) {
+			to->tail = &to->head;
+			to->count = 0;
+		}
+		*to->tail = from->head;
+		to->tail = from->tail;
+		to->count += from->count;
+	}
+}
+
+
+#define RCU_HZ 20		/* max rate at which batches are retired */
+
+struct rcu_data {
+	u8 wait;		/* goes false when this cpu consents to
+				 * the retirement of the current batch */
+	u8 which;		/* selects the current callback list */
+	struct rcu_list cblist[2]; /* current & previous callback lists */
+} ____cacheline_aligned_in_smp;
+
+static struct rcu_data rcu_data[NR_CPUS];
+
+/* debug & statistics stuff */
+static struct rcu_stats {
+	unsigned nbatches;	/* #end-of-batches (eobs) seen */
+	atomic_t nbarriers;	/* #rcu barriers processed */
+	u64 ninvoked;		/* #invoked (ie, finished) callbacks */
+	atomic_t nleft;		/* #callbacks left (ie, not yet invoked) */
+	unsigned nforced;	/* #forced eobs (should be zero) */
+} rcu_stats;
+
+int rcu_scheduler_active __read_mostly;
+int rcu_nmi_seen __read_mostly;
+static u64 rcu_timestamp;
+
+/*
+ * Return our CPU id or zero if we are too early in the boot process to
+ * know what that is.  For RCU to work correctly, a cpu named '0' must
+ * eventually be present (but need not ever be online).
+ */
+static inline int rcu_cpu(void)
+{
+	return current_thread_info()->cpu;
+}
+
+/*
+ * Invoke whenever the calling CPU consents to end-of-batch.  All CPUs
+ * must so consent before the batch is truly ended.
+ */
+static inline void rcu_eob(int cpu)
+{
+	struct rcu_data *rd = &rcu_data[cpu];
+	if (unlikely(rd->wait)) {
+		rd->wait = 0;
+#ifdef CONFIG_RCU_PARANOID
+		/* not needed, we can tolerate some fuzziness on exactly
+		 * when other CPUs see the above write insn. */
+		smp_wmb();
+#endif
+	}
+}
+
+void rcu_note_context_switch(int cpu)
+{
+	rcu_eob(cpu);
+}
+
+void __rcu_preempt_sub(void)
+{
+	rcu_eob(rcu_cpu());
+}
+EXPORT_SYMBOL(__rcu_preempt_sub);
+
+void rcu_barrier(void)
+{
+	struct rcu_synchronize rcu;
+
+	if (!rcu_scheduler_active)
+		return;
+
+	init_completion(&rcu.completion);
+	call_rcu(&rcu.head, wakeme_after_rcu);
+	wait_for_completion(&rcu.completion);
+	atomic_inc(&rcu_stats.nbarriers);
+
+}
+EXPORT_SYMBOL_GPL(rcu_barrier);
+
+void rcu_force_quiescent_state(void)
+{
+}
+EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
+
+
+/*
+ * Insert an RCU callback onto the calling CPUs list of 'current batch'
+ * callbacks.  Lockless version, can be invoked anywhere except under NMI.
+ */
+void call_rcu(struct rcu_head *cb, void (*func)(struct rcu_head *rcu))
+{
+	unsigned long flags;
+	struct rcu_data *rd;
+	struct rcu_list *cblist;
+	int which;
+
+	cb->func = func;
+	cb->next = NULL;
+
+	raw_local_irq_save(flags);
+	smp_rmb();
+
+	rd = &rcu_data[rcu_cpu()];
+	which = ACCESS_ONCE(rd->which) & 1;
+	cblist = &rd->cblist[which];
+
+	/* The following is not NMI-safe, therefore call_rcu()
+	 * cannot be invoked under NMI. */
+	rcu_list_add(cblist, cb);
+	smp_wmb();
+	raw_local_irq_restore(flags);
+	atomic_inc(&rcu_stats.nleft);
+}
+EXPORT_SYMBOL_GPL(call_rcu);
+
+/*
+ * For a given cpu, push the previous batch of callbacks onto a (global)
+ * pending list, then make the current batch the previous.  A new, empty
+ * current batch exists after this operation.
+ *
+ * Locklessly tolerates changes being made by call_rcu() to the current
+ * batch, locklessly tolerates the current batch becoming the previous
+ * batch, and locklessly tolerates a new, empty current batch becoming
+ * available.  Requires that the previous batch be quiescent by the time
+ * rcu_end_batch is invoked.
+ */
+static void rcu_end_batch(struct rcu_data *rd, struct rcu_list *pending)
+{
+	int prev;
+	struct rcu_list *plist;	/* some cpus' previous list */
+
+	prev = (ACCESS_ONCE(rd->which) & 1) ^ 1;
+	plist = &rd->cblist[prev];
+
+	/* Chain previous batch of callbacks, if any, to the pending list */
+	if (plist->head) {
+		rcu_list_join(pending, plist);
+		rcu_list_init(plist);
+		smp_wmb();
+	}
+	/*
+	 * Swap current and previous lists.  Other cpus must not see this
+	 * out-of-order w.r.t. the just-completed plist init, hence the above
+	 * smp_wmb().
+	 */
+	rd->which++;
+}
+
+/*
+ * Invoke all callbacks on the passed-in list.
+ */
+static void rcu_invoke_callbacks(struct rcu_list *pending)
+{
+	struct rcu_head *curr, *next;
+
+	for (curr = pending->head; curr;) {
+		next = curr->next;
+		curr->func(curr);
+		curr = next;
+		rcu_stats.ninvoked++;
+		atomic_dec(&rcu_stats.nleft);
+	}
+}
+
+/*
+ * Check if the conditions for ending the current batch are true. If
+ * so then end it.
+ *
+ * Must be invoked periodically, and the periodic invocations must be
+ * far enough apart in time for the previous batch to become quiescent.
+ * This is a few tens of microseconds unless NMIs are involved; an NMI
+ * stretches out the requirement by the duration of the NMI.
+ *
+ * "Quiescent" means the owning cpu is no longer appending callbacks
+ * and has completed execution of a trailing write-memory-barrier insn.
+ */
+static void __rcu_delimit_batches(struct rcu_list *pending)
+{
+	struct rcu_data *rd;
+	int cpu, eob;
+	u64 rcu_now;
+
+	/* If an NMI occured then the previous batch may not yet be
+	 * quiescent.  Let's wait till it is.
+	 */
+	if (rcu_nmi_seen) {
+		rcu_nmi_seen = 0;
+		return;
+	}
+
+	if (!rcu_scheduler_active)
+		return;
+
+	/*
+	 * Find out if the current batch has ended
+	 * (end-of-batch).
+	 */
+	eob = 1;
+	for_each_online_cpu(cpu) {
+		rd = &rcu_data[cpu];
+		if (rd->wait) {
+			eob = 0;
+			break;
+		}
+	}
+
+	/*
+	 * Force end-of-batch if too much time (n seconds) has
+	 * gone by.  The forcing method is slightly questionable,
+	 * hence the WARN_ON.
+	 */
+	rcu_now = sched_clock();
+	if (!eob && !rcu_timestamp
+	&& ((rcu_now - rcu_timestamp) > 3LL * NSEC_PER_SEC)) {
+		rcu_stats.nforced++;
+		WARN_ON_ONCE(1);
+		eob = 1;
+	}
+
+	/*
+	 * Just return if the current batch has not yet
+	 * ended.  Also, keep track of just how long it
+	 * has been since we've actually seen end-of-batch.
+	 */
+
+	if (!eob)
+		return;
+
+	rcu_timestamp = rcu_now;
+
+	/*
+	 * End the current RCU batch and start a new one.
+	 */
+	for_each_present_cpu(cpu) {
+		rd = &rcu_data[cpu];
+		rcu_end_batch(rd, pending);
+		if (cpu_online(cpu)) /* wins race with offlining every time */
+			rd->wait = preempt_count_cpu(cpu) > idle_cpu(cpu);
+		else
+			rd->wait = 0;
+	}
+	rcu_stats.nbatches++;
+}
+
+static void rcu_delimit_batches(void)
+{
+	unsigned long flags;
+	struct rcu_list pending;
+
+	rcu_list_init(&pending);
+
+	raw_local_irq_save(flags);
+	smp_rmb();
+	__rcu_delimit_batches(&pending);
+	smp_wmb();
+	raw_local_irq_restore(flags);
+
+	if (pending.head)
+		rcu_invoke_callbacks(&pending);
+}
+
+/* ------------------ interrupt driver section ------------------ */
+
+/*
+ * We drive RCU from a periodic interrupt during most of boot. Once boot
+ * is complete we (optionally) transition to a daemon.
+ */
+
+#include <linux/time.h>
+#include <linux/delay.h>
+#include <linux/hrtimer.h>
+#include <linux/interrupt.h>
+
+#define RCU_PERIOD_NS		(NSEC_PER_SEC / RCU_HZ)
+#define RCU_PERIOD_DELTA_NS	(((NSEC_PER_SEC / HZ) * 3) / 2)
+
+#define RCU_PERIOD_MIN_NS	RCU_PERIOD_NS
+#define RCU_PERIOD_MAX_NS	(RCU_PERIOD_NS + RCU_PERIOD_DELTA_NS)
+
+static struct hrtimer rcu_timer;
+
+static void rcu_softirq_func(struct softirq_action *h)
+{
+	rcu_delimit_batches();
+}
+
+static enum hrtimer_restart rcu_timer_func(struct hrtimer *t)
+{
+	ktime_t next;
+
+	raise_softirq(RCU_SOFTIRQ);
+
+	next = ktime_add_ns(ktime_get(), RCU_PERIOD_NS);
+	hrtimer_set_expires_range_ns(&rcu_timer, next, RCU_PERIOD_DELTA_NS);
+	return HRTIMER_RESTART;
+}
+
+static void rcu_timer_restart(void)
+{
+	pr_info("JRCU: starting timer. rate is %d Hz\n", RCU_HZ);
+	hrtimer_forward_now(&rcu_timer, ns_to_ktime(RCU_PERIOD_NS));
+	hrtimer_start_expires(&rcu_timer, HRTIMER_MODE_ABS);
+}
+
+static __init int rcu_timer_start(void)
+{
+	open_softirq(RCU_SOFTIRQ, rcu_softirq_func);
+
+	hrtimer_init(&rcu_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
+	rcu_timer.function = rcu_timer_func;
+	rcu_timer_restart();
+
+	return 0;
+}
+
+#ifdef CONFIG_JRCU_DAEMON
+static void rcu_timer_stop(void)
+{
+	int stat;
+
+	stat = hrtimer_cancel(&rcu_timer);
+	if (stat)
+		pr_info("JRCU: timer canceled.\n");
+}
+#endif
+
+/*
+ * Transition from a simple to a full featured, interrupt driven RCU.
+ *
+ * This is to protect us against RCU being used very very early in the boot
+ * process, where ideas like 'tasks' and 'cpus' and 'timers' and such are
+ * not yet fully formed.  During this very early time, we use a simple,
+ * not-fully-functional braindead version of RCU.
+ *
+ * Invoked from main() at the earliest point where scheduling and timers
+ * are functional.
+ */
+void __init rcu_scheduler_starting(void)
+{
+	int stat;
+
+	stat = rcu_timer_start();
+	if (stat) {
+		pr_err("JRCU: failed to start.  This is fatal.\n");
+		return;
+	}
+
+	rcu_scheduler_active = 1;
+	smp_wmb();
+
+	pr_info("JRCU: started\n");
+}
+
+#ifdef CONFIG_JRCU_DAEMON
+
+/* ------------------ daemon driver section --------------------- */
+
+#define RCU_PERIOD_MIN_US	(RCU_PERIOD_MIN_NS / NSEC_PER_USEC)
+#define RCU_PERIOD_MAX_US	(RCU_PERIOD_MAX_NS / NSEC_PER_USEC)
+
+/*
+ * Once the system is fully up, we will drive the periodic-polling part
+ * of JRCU from a kernel daemon, jrcud.  Until then it is driven by
+ * an interrupt.
+ */
+#include <linux/err.h>
+#include <linux/param.h>
+#include <linux/kthread.h>
+
+static int jrcud_func(void *arg)
+{
+	set_user_nice(current, -19);
+	current->flags |= PF_NOFREEZE;
+
+	pr_info("JRCU: daemon started. Will operate at ~%d Hz.\n", RCU_HZ);
+	rcu_timer_stop();
+
+	while (!kthread_should_stop()) {
+		usleep_range(RCU_PERIOD_MIN_US, RCU_PERIOD_MAX_US);
+		rcu_delimit_batches();
+	}
+
+	pr_info("JRCU: daemon exiting\n");
+	rcu_timer_restart();
+	return 0;
+}
+
+static __init int jrcud_start(void)
+{
+	struct task_struct *p;
+
+	p = kthread_run(jrcud_func, NULL, "jrcud");
+	if (IS_ERR(p)) {
+		pr_warn("JRCU: daemon not started\n");
+		return -ENODEV;
+	}
+	return 0;
+}
+late_initcall(jrcud_start);
+
+#endif /* CONFIG_JRCU_DAEMON */
+
+/* ------------------ debug and statistics section -------------- */
+
+#ifdef CONFIG_RCU_TRACE
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+static int rcu_debugfs_show(struct seq_file *m, void *unused)
+{
+	int cpu, q, s[2], msecs;
+
+	raw_local_irq_disable();
+	msecs = div_s64(sched_clock() - rcu_timestamp, NSEC_PER_MSEC);
+	raw_local_irq_enable();
+
+	seq_printf(m, "%14u: #batches seen\n",
+		rcu_stats.nbatches);
+	seq_printf(m, "%14u: #barriers seen\n",
+		atomic_read(&rcu_stats.nbarriers));
+	seq_printf(m, "%14llu: #callbacks invoked\n",
+		rcu_stats.ninvoked);
+	seq_printf(m, "%14u: #callbacks left to invoke\n",
+		atomic_read(&rcu_stats.nleft));
+	seq_printf(m, "%14u: #msecs since last end-of-batch\n",
+		msecs);
+	seq_printf(m, "%14u: #passes forced (0 is best)\n",
+		rcu_stats.nforced);
+	seq_printf(m, "\n");
+
+	for_each_online_cpu(cpu)
+		seq_printf(m, "%4d ", cpu);
+	seq_printf(m, "  CPU\n");
+
+	s[1] = s[0] = 0;
+	for_each_online_cpu(cpu) {
+		struct rcu_data *rd = &rcu_data[cpu];
+		int w = ACCESS_ONCE(rd->which) & 1;
+		seq_printf(m, "%c%c%c%d ",
+			'-',
+			idle_cpu(cpu) ? 'I' : '-',
+			rd->wait ? 'W' : '-',
+			w);
+		s[w]++;
+	}
+	seq_printf(m, "  FLAGS\n");
+
+	for (q = 0; q < 2; q++) {
+		for_each_online_cpu(cpu) {
+			struct rcu_data *rd = &rcu_data[cpu];
+			struct rcu_list *l = &rd->cblist[q];
+			seq_printf(m, "%4d ", l->count);
+		}
+		seq_printf(m, "  Q%d%c\n", q, " *"[s[q] > s[q^1]]);
+	}
+	seq_printf(m, "\nFLAGS:\n");
+	seq_printf(m, "  I - cpu idle, 0|1 - Q0 or Q1 is current Q, other is previous Q,\n");
+	seq_printf(m, "  W - cpu does not permit current batch to end (waiting),\n");
+	seq_printf(m, "  * - marks the Q that is current for most CPUs.\n");
+
+	return 0;
+}
+
+static int rcu_debugfs_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, rcu_debugfs_show, NULL);
+}
+
+static const struct file_operations rcu_debugfs_fops = {
+	.owner = THIS_MODULE,
+	.open = rcu_debugfs_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static struct dentry *rcudir;
+
+static int __init rcu_debugfs_init(void)
+{
+	struct dentry *retval;
+
+	rcudir = debugfs_create_dir("rcu", NULL);
+	if (!rcudir)
+		goto error;
+
+	retval = debugfs_create_file("rcudata", 0444, rcudir,
+			NULL, &rcu_debugfs_fops);
+	if (!retval)
+		goto error;
+
+	pr_info("JRCU: Created debugfs files\n");
+	return 0;
+
+error:
+	debugfs_remove_recursive(rcudir);
+	pr_warn("JRCU: Could not create debugfs files\n");
+	return -ENOSYS;
+}
+late_initcall(rcu_debugfs_init);
+#endif /* CONFIG_RCU_TRACE */
Index: b/include/linux/jrcu.h
===================================================================
--- /dev/null
+++ b/include/linux/jrcu.h
@@ -0,0 +1,75 @@
+/*
+ * JRCU - A tiny single-cpu RCU for small SMP systems.
+ *
+ * Author: Joe Korty <joe.korty@ccur.com>
+ * Copyright Concurrent Computer Corporation, 2011
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the
+ * Free Software Foundation; either version 2 of the License, or (at your
+ * option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
+ * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, write to the Free Software Foundation, Inc.,
+ * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ */
+#ifndef __LINUX_JRCU_H
+#define __LINUX_JRCU_H
+
+#define __rcu_read_lock()			preempt_disable()
+#define __rcu_read_unlock()			preempt_enable()
+
+#define __rcu_read_lock_bh()			__rcu_read_lock()
+#define __rcu_read_unlock_bh()			__rcu_read_unlock()
+
+extern void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
+
+#define call_rcu_sched				call_rcu
+#define call_rcu_bh				call_rcu
+
+extern void rcu_barrier(void);
+
+#define rcu_barrier_sched			rcu_barrier
+#define rcu_barrier_bh				rcu_barrier
+
+#define synchronize_rcu				rcu_barrier
+#define synchronize_sched			rcu_barrier
+#define synchronize_sched_expedited		rcu_barrier
+#define synchronize_rcu_bh			rcu_barrier
+#define synchronize_rcu_expedited		rcu_barrier
+#define synchronize_rcu_bh_expedited		rcu_barrier
+
+#define rcu_init(cpu)				do { } while (0)
+#define rcu_init_sched()			do { } while (0)
+#define exit_rcu()				do { } while (0)
+
+static inline void __rcu_check_callbacks(int cpu, int user) { }
+#define rcu_check_callbacks			__rcu_check_callbacks
+
+#define rcu_needs_cpu(cpu)			(0)
+#define rcu_batches_completed()			(0)
+#define rcu_batches_completed_bh()		(0)
+#define rcu_preempt_depth()			(0)
+
+extern void rcu_force_quiescent_state(void);
+
+#define rcu_sched_force_quiescent_state		rcu_force_quiescent_state
+#define rcu_bh_force_quiescent_state		rcu_force_quiescent_state
+
+#define rcu_enter_nohz()			do { } while (0)
+#define rcu_exit_nohz()				do { } while (0)
+
+extern void rcu_note_context_switch(int cpu);
+
+#define rcu_sched_qs				rcu_note_context_switch
+#define rcu_bh_qs				rcu_note_context_switch
+
+extern void rcu_scheduler_starting(void);
+extern int rcu_scheduler_active __read_mostly;
+
+#endif /* __LINUX_JRCU_H */
Index: b/include/linux/rcupdate.h
===================================================================
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -145,6 +145,8 @@ static inline void rcu_exit_nohz(void)
 #include <linux/rcutree.h>
 #elif defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
 #include <linux/rcutiny.h>
+#elif defined(CONFIG_JRCU)
+#include <linux/jrcu.h>
 #else
 #error "Unknown RCU implementation specified to kernel configuration"
 #endif
Index: b/init/Kconfig
===================================================================
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -384,6 +384,22 @@ config TREE_PREEMPT_RCU
 	  is also required.  It also scales down nicely to
 	  smaller systems.
 
+config JRCU
+	bool "A tiny single-CPU RCU for small SMP systems"
+	depends on PREEMPT
+	depends on SMP
+	help
+	  This option selects a minimal-footprint RCU suitable for small SMP
+	  systems -- that is, those with fewer than 16 or perhaps 32, and
+	  certainly less than 64 processors.
+
+	  This RCU variant may be a good choice for systems with low latency
+	  requirements.  It does RCU garbage collection from a single CPU
+	  rather than have each CPU do its own.  This frees up all but one
+	  CPU from interference by this periodic requirement.
+
+	  Most users should say N here.
+
 config TINY_RCU
 	bool "UP-only small-memory-footprint RCU"
 	depends on !SMP
@@ -409,6 +425,17 @@ config PREEMPT_RCU
 	  This option enables preemptible-RCU code that is common between
 	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
 
+config JRCU_DAEMON
+	bool "Drive JRCU from a daemon"
+	depends on JRCU
+	default Y
+	help
+	  Normally JRCU end-of-batch processing is driven from a SoftIRQ
+	  'interrupt' driver.  If you consider this to be too invasive,
+	  this option can be used to drive JRCU from a kernel daemon.
+
+	  If unsure, say Y here.
+
 config RCU_TRACE
 	bool "Enable tracing for RCU"
 	help
Index: b/kernel/Makefile
===================================================================
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -87,6 +87,7 @@ obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutre
 obj-$(CONFIG_TREE_RCU_TRACE) += rcutree_trace.o
 obj-$(CONFIG_TINY_RCU) += rcutiny.o
 obj-$(CONFIG_TINY_PREEMPT_RCU) += rcutiny.o
+obj-$(CONFIG_JRCU) += jrcu.o
 obj-$(CONFIG_RELAY) += relay.o
 obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
 obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
Index: b/include/linux/hardirq.h
===================================================================
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -146,7 +146,13 @@ static inline void account_system_vtime(
 extern void account_system_vtime(struct task_struct *tsk);
 #endif
 
-#if defined(CONFIG_NO_HZ)
+#if defined(CONFIG_JRCU)
+extern int rcu_nmi_seen;
+#define rcu_irq_enter() do { } while (0)
+#define rcu_irq_exit() do { } while (0)
+#define rcu_nmi_enter() do { rcu_nmi_seen = 1; } while (0)
+#define rcu_nmi_exit() do { } while (0)
+#elif defined(CONFIG_NO_HZ)
 #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
 extern void rcu_enter_nohz(void);
 extern void rcu_exit_nohz(void);
@@ -168,7 +174,6 @@ static inline void rcu_nmi_enter(void)
 static inline void rcu_nmi_exit(void)
 {
 }
-
 #else
 extern void rcu_irq_enter(void);
 extern void rcu_irq_exit(void);
Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2658,6 +2658,21 @@ void sched_fork(struct task_struct *p, i
 }
 
 /*
+ * Fetch the preempt count of some cpu's current task.  Must be called
+ * with interrupts blocked.  Stale return value.
+ *
+ * No locking needed as this always wins the race with context-switch-out
+ * + task destruction, since that is so heavyweight.  The smp_rmb() is
+ * to protect the pointers in that race, not the data being pointed to
+ * (which, being guaranteed stale, can stand a bit of fuzziness).
+ */
+int preempt_count_cpu(int cpu)
+{
+	smp_rmb(); /* stop data prefetch until program ctr gets here */
+	return task_thread_info(cpu_curr(cpu))->preempt_count;
+}
+
+/*
  * wake_up_new_task - wake up a newly created task for the first time.
  *
  * This function will do some initial scheduler statistics housekeeping
@@ -3811,7 +3826,7 @@ void __kprobes add_preempt_count(int val
 	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
 		return;
 #endif
-	preempt_count() += val;
+	__add_preempt_count(val);
 #ifdef CONFIG_DEBUG_PREEMPT
 	/*
 	 * Spinlock count overflowing soon?
@@ -3842,7 +3857,7 @@ void __kprobes sub_preempt_count(int val
 
 	if (preempt_count() == val)
 		trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
-	preempt_count() -= val;
+	__sub_preempt_count(val);
 }
 EXPORT_SYMBOL(sub_preempt_count);
 
@@ -3994,6 +4009,7 @@ need_resched_nonpreemptible:
 
 		rq->nr_switches++;
 		rq->curr = next;
+		smp_wmb(); /* for preempt_count_cpu() */
 		++*switch_count;
 
 		context_switch(rq, prev, next); /* unlocks the rq */
@@ -8209,6 +8225,7 @@ struct task_struct *curr_task(int cpu)
 void set_curr_task(int cpu, struct task_struct *p)
 {
 	cpu_curr(cpu) = p;
+	smp_wmb(); /* for preempt_count_cpu() */
 }
 
 #endif
Index: b/include/linux/preempt.h
===================================================================
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -10,18 +10,33 @@
 #include <linux/linkage.h>
 #include <linux/list.h>
 
+# define __add_preempt_count(val) do { preempt_count() += (val); } while (0)
+
+#ifndef CONFIG_JRCU
+# define __sub_preempt_count(val) do { preempt_count() -= (val); } while (0)
+#else
+  extern void __rcu_preempt_sub(void);
+# define __sub_preempt_count(val) do { \
+	if (!(preempt_count() -= (val))) { \
+		/* preempt is enabled, RCU OK with consequent stale result */ \
+		__rcu_preempt_sub(); \
+	} \
+} while (0)
+#endif
+
 #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
   extern void add_preempt_count(int val);
   extern void sub_preempt_count(int val);
 #else
-# define add_preempt_count(val)	do { preempt_count() += (val); } while (0)
-# define sub_preempt_count(val)	do { preempt_count() -= (val); } while (0)
+# define add_preempt_count(val)	__add_preempt_count(val)
+# define sub_preempt_count(val)	__sub_preempt_count(val)
 #endif
 
 #define inc_preempt_count() add_preempt_count(1)
 #define dec_preempt_count() sub_preempt_count(1)
 
 #define preempt_count()	(current_thread_info()->preempt_count)
+extern int preempt_count_cpu(int cpu);
 
 #ifdef CONFIG_PREEMPT
 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
       [not found]                       ` <20110307210157.GG3104@linux.vnet.ibm.com>
@ 2011-03-07 21:16                         ` Joe Korty
  2011-03-07 21:33                           ` Joe Korty
                                             ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-07 21:16 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Mon, Mar 07, 2011 at 04:01:57PM -0500, Paul E. McKenney wrote:
> Interesting!
> 
> But I would really prefer leveraging the existing RCU implementations
> to the extent possible.  Are the user-dedicated CPUs able to invoke
> system calls?  If so, something like Frederic's approach should permit
> the existing RCU implementations to operate normally.  If not, what is
> doing the RCU read-side critical sections on the dedicated CPUs?
> 
>                                                         Thanx, Paul

I haven't seen Frederic's patch.  Sorry I missed it!
It might have saved a bit of work...

I thought about the system call approach but rejected it.
Some (maybe many) customers needing dedicated CPUs will
have apps that never make any system calls at all.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-07 21:16                         ` Joe Korty
@ 2011-03-07 21:33                           ` Joe Korty
  2011-03-07 22:51                           ` Joe Korty
  2011-03-09 22:29                           ` Frederic Weisbecker
  2 siblings, 0 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-07 21:33 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Mon, Mar 07, 2011 at 04:16:13PM -0500, Korty, Joe wrote:
> On Mon, Mar 07, 2011 at 04:01:57PM -0500, Paul E. McKenney wrote:
>> If not, what is
>> doing the RCU read-side critical sections on the dedicated CPUs?

Oops, forgot to answer this.  RCU critical regions are
delimited by preempt_enable ... preempt_disable.  There is
a tie-in to preempt_disable(): it sets a special per-cpu
variable to zero whenever preempt_count() goes to zero.

The per-cpu variables are all periodically examined by
the the global garbage collector.  The current batch ends
when all of the per-cpu variables have gone to zero.  It
then resets each to 1 or to 0, depending on the current
state of the corresponding cpu.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-07 21:16                         ` Joe Korty
  2011-03-07 21:33                           ` Joe Korty
@ 2011-03-07 22:51                           ` Joe Korty
  2011-03-08  9:07                             ` Paul E. McKenney
  2011-03-09 22:29                           ` Frederic Weisbecker
  2 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-07 22:51 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Mon, Mar 07, 2011 at 04:16:13PM -0500, Joe Korty wrote:
> On Mon, Mar 07, 2011 at 04:01:57PM -0500, Paul E. McKenney wrote:
>> But I would really prefer leveraging the existing RCU implementations
>> to the extent possible.  Are the user-dedicated CPUs able to invoke
>> system calls?  If so, something like Frederic's approach should permit
>> the existing RCU implementations to operate normally.  If not, what is
>> doing the RCU read-side critical sections on the dedicated CPUs?
> 
> I thought about the system call approach but rejected it.
> Some (maybe many) customers needing dedicated CPUs will
> have apps that never make any system calls at all.

Hi Paul,
Thinking about it some more, the tap-into-syscall approach might
work in my implementation, in which case the tap-into-preempt-enable
code could go away.

Nice thing about RCU, the algorithms are infinitely mallable :)

Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-07 22:51                           ` Joe Korty
@ 2011-03-08  9:07                             ` Paul E. McKenney
  2011-03-08 15:57                               ` Joe Korty
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-08  9:07 UTC (permalink / raw)
  To: Joe Korty; +Cc: Frederic Weisbecker, linux-kernel

On Mon, Mar 07, 2011 at 05:51:10PM -0500, Joe Korty wrote:
> On Mon, Mar 07, 2011 at 04:16:13PM -0500, Joe Korty wrote:
> > On Mon, Mar 07, 2011 at 04:01:57PM -0500, Paul E. McKenney wrote:
> >> But I would really prefer leveraging the existing RCU implementations
> >> to the extent possible.  Are the user-dedicated CPUs able to invoke
> >> system calls?  If so, something like Frederic's approach should permit
> >> the existing RCU implementations to operate normally.  If not, what is
> >> doing the RCU read-side critical sections on the dedicated CPUs?
> > 
> > I thought about the system call approach but rejected it.
> > Some (maybe many) customers needing dedicated CPUs will
> > have apps that never make any system calls at all.
> 
> Hi Paul,
> Thinking about it some more, the tap-into-syscall approach might
> work in my implementation, in which case the tap-into-preempt-enable
> code could go away.

OK, please let me know how that goes!

> Nice thing about RCU, the algorithms are infinitely mallable :)

Just trying to keep the code size finite.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-08  9:07                             ` Paul E. McKenney
@ 2011-03-08 15:57                               ` Joe Korty
  2011-03-08 22:53                                 ` Joe Korty
  2011-03-10  0:28                                 ` Paul E. McKenney
  0 siblings, 2 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-08 15:57 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Mar 08, 2011 at 04:07:42AM -0500, Paul E. McKenney wrote:
>> Thinking about it some more, the tap-into-syscall approach might
>> work in my implementation, in which case the tap-into-preempt-enable
>> code could go away.
> 
> OK, please let me know how that goes!
> 
>> Nice thing about RCU, the algorithms are infinitely mallable :)
> 
> Just trying to keep the code size finite.  ;-)

I hope to get to it this afternoon!  I especially like
the lockless nature of JRCU, and that the dedicated cpus
are not loaded down with callback inovcations either.
Not sure how to support the PREEMPT_RCU mode though; so
if Fredrick is planning to support that, that alone would
make his approach the very best.

Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-08 15:57                               ` Joe Korty
@ 2011-03-08 22:53                                 ` Joe Korty
  2011-03-10  0:30                                   ` Paul E. McKenney
  2011-03-10  0:28                                 ` Paul E. McKenney
  1 sibling, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-08 22:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Mar 08, 2011 at 10:57:10AM -0500, Joe Korty wrote:
> On Tue, Mar 08, 2011 at 04:07:42AM -0500, Paul E. McKenney wrote:
>>> Thinking about it some more, the tap-into-syscall approach might
>>> work in my implementation, in which case the tap-into-preempt-enable
>>> code could go away.
> > 
>> OK, please let me know how that goes!
>> 
>>> Nice thing about RCU, the algorithms are infinitely mallable :)
>> 
>> Just trying to keep the code size finite.  ;-)
> 
> I hope to get to it this afternoon!  I especially like
> the lockless nature of JRCU, and that the dedicated cpus
> are not loaded down with callback inovcations either.
> Not sure how to support the PREEMPT_RCU mode though; so
> if Fredrick is planning to support that, that alone would
> make his approach the very best.



Hi Paul,
I had a brainstorm. It _seems_ that JRCU might work fine if
all I did was remove the expensive preempt_enable() tap.
No new taps on system calls or anywhere else.  That would
leave only the context switch tap plus the batch start/end
sampling that is remotely performed on each cpu by the
garbage collector.  Not even rcu_read_unlock has a tap --
it is just a plain-jane preempt_enable() now.

And indeed it works!  I am able to turn off the local
timer interrupt on one (of 15) cpus and the batches
keep flowing on.  I have two user 100%  use test apps
(one of them does no system calls), when I run that
on the timer-disabled cpu the batches still advance.
Admittedly the batches do not advance as fast as before
.. they used to advance at the max rate of 50 msecs/batch.
Now I regularly see batch lengths approaching 400 msecs.

I plan to put some taps into some other low overhead places
-- at all the voluntary preemption points, at might_sleep,
at rcu_read_unlock, for safety purposes.  But it is nice
to see a zero overhead approach that works fine without
any of that.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 2/4] jrcu: tap rcu_read_unlock
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
       [not found]                       ` <20110307210157.GG3104@linux.vnet.ibm.com>
@ 2011-03-09 22:15                       ` Joe Korty
  2011-03-10  0:34                         ` Paul E. McKenney
  2011-03-09 22:16                       ` [PATCH 3/4] jrcu: tap might_resched() Joe Korty
                                         ` (3 subsequent siblings)
  5 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-09 22:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

jrcu: tap into rcu_read_unlock().

All places where rcu_read_unlock() is the final lock in
a set of nested rcu locks are known rcu quiescent points.
This patch recognizes that subset of those which also make
the task preemptable.  The others are left unrecognized.

Not fundamentally needed, accelerates rcu batching.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: b/include/linux/jrcu.h
===================================================================
--- a/include/linux/jrcu.h
+++ b/include/linux/jrcu.h
@@ -21,8 +21,10 @@
 #ifndef __LINUX_JRCU_H
 #define __LINUX_JRCU_H
 
+extern void rcu_read_unlock_jrcu(void);
+
 #define __rcu_read_lock()			preempt_disable()
-#define __rcu_read_unlock()			preempt_enable()
+#define __rcu_read_unlock()			rcu_read_unlock_jrcu()
 
 #define __rcu_read_lock_bh()			__rcu_read_lock()
 #define __rcu_read_unlock_bh()			__rcu_read_unlock()
Index: b/kernel/jrcu.c
===================================================================
--- a/kernel/jrcu.c
+++ b/kernel/jrcu.c
@@ -153,6 +153,14 @@ static inline void rcu_eob(int cpu)
 	}
 }
 
+void rcu_read_unlock_jrcu(void)
+{
+	if (preempt_count() == 1)
+		rcu_eob(rcu_cpu());
+	preempt_enable();
+}
+EXPORT_SYMBOL_GPL(rcu_read_unlock_jrcu);
+
 void rcu_note_context_switch(int cpu)
 {
 	rcu_eob(cpu);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 3/4] jrcu: tap might_resched()
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
       [not found]                       ` <20110307210157.GG3104@linux.vnet.ibm.com>
  2011-03-09 22:15                       ` [PATCH 2/4] jrcu: tap rcu_read_unlock Joe Korty
@ 2011-03-09 22:16                       ` Joe Korty
  2011-03-09 22:17                       ` [PATCH 4/4] jrcu: add new stat to /sys/kernel/debug/rcu/rcudata Joe Korty
                                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-09 22:16 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

jrcu: tap into might_resched.

All places where the voluntary preemption patches have marked
as safe to context switch are also known rcu quiescent points.

Not fundamentally needed, accelerates rcu batching.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: b/include/linux/kernel.h
===================================================================
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -114,11 +114,18 @@ struct completion;
 struct pt_regs;
 struct user;
 
+/* cannot bring in linux/rcupdate.h at this point */
+#ifdef CONFIG_JRCU
+extern void rcu_note_might_resched(void);
+#else
+#define rcu_note_might_resched()
+#endif /*JRCU */
+
 #ifdef CONFIG_PREEMPT_VOLUNTARY
 extern int _cond_resched(void);
-# define might_resched() _cond_resched()
+# define might_resched() do { _cond_resched(); rcu_note_might_resched(); } while (0)
 #else
-# define might_resched() do { } while (0)
+# define might_resched() do { rcu_note_might_resched(); } while (0)
 #endif
 
 #ifdef CONFIG_DEBUG_SPINLOCK_SLEEP
Index: b/include/linux/jrcu.h
===================================================================
--- a/include/linux/jrcu.h
+++ b/include/linux/jrcu.h
@@ -71,6 +71,8 @@ extern void rcu_note_context_switch(int 
 #define rcu_sched_qs				rcu_note_context_switch
 #define rcu_bh_qs				rcu_note_context_switch
 
+extern void rcu_note_might_resched(void);
+
 extern void rcu_scheduler_starting(void);
 extern int rcu_scheduler_active __read_mostly;
 
Index: b/kernel/jrcu.c
===================================================================
--- a/kernel/jrcu.c
+++ b/kernel/jrcu.c
@@ -166,6 +166,16 @@ void rcu_note_context_switch(int cpu)
 	rcu_eob(cpu);
 }
 
+void rcu_note_might_resched(void)
+{
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	rcu_eob(rcu_cpu());
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL(rcu_note_might_resched);
+
 void rcu_barrier(void)
 {
 	struct rcu_synchronize rcu;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 4/4] jrcu: add new stat to /sys/kernel/debug/rcu/rcudata
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
                                         ` (2 preceding siblings ...)
  2011-03-09 22:16                       ` [PATCH 3/4] jrcu: tap might_resched() Joe Korty
@ 2011-03-09 22:17                       ` Joe Korty
  2011-03-09 22:19                       ` [PATCH 1/4] jrcu: remove preempt_enable() tap [resend] Joe Korty
  2011-03-12 14:36                       ` [PATCH] An RCU for SMP with a single CPU garbage collector Paul E. McKenney
  5 siblings, 0 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-09 22:17 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

jrcu: display #passes in /sys/kernel/debug/rcu/rcudata.

Sometimes, when things are not working right, you cannot
tell if that is because the daemon is not running at all,
or because it is running, but on every pass it decides
to do no work.  With this change those two states can
now be distinguished.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

Index: b/kernel/jrcu.c
===================================================================
--- a/kernel/jrcu.c
+++ b/kernel/jrcu.c
@@ -115,6 +115,7 @@ static struct rcu_data rcu_data[NR_CPUS]
 
 /* debug & statistics stuff */
 static struct rcu_stats {
+	unsigned npasses;	/* #passes made */
 	unsigned nbatches;	/* #end-of-batches (eobs) seen */
 	atomic_t nbarriers;	/* #rcu barriers processed */
 	u64 ninvoked;		/* #invoked (ie, finished) callbacks */
@@ -362,6 +363,7 @@ static void rcu_delimit_batches(void)
 	struct rcu_list pending;
 
 	rcu_list_init(&pending);
+	rcu_stats.npasses++;
 
 	raw_local_irq_save(flags);
 	smp_rmb();
@@ -529,6 +531,8 @@ static int rcu_debugfs_show(struct seq_f
 	msecs = div_s64(sched_clock() - rcu_timestamp, NSEC_PER_MSEC);
 	raw_local_irq_enable();
 
+	seq_printf(m, "%14u: #passes seen\n",
+		rcu_stats.npasses);
 	seq_printf(m, "%14u: #batches seen\n",
 		rcu_stats.nbatches);
 	seq_printf(m, "%14u: #barriers seen\n",

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH 1/4] jrcu: remove preempt_enable() tap [resend]
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
                                         ` (3 preceding siblings ...)
  2011-03-09 22:17                       ` [PATCH 4/4] jrcu: add new stat to /sys/kernel/debug/rcu/rcudata Joe Korty
@ 2011-03-09 22:19                       ` Joe Korty
  2011-03-12 14:36                       ` [PATCH] An RCU for SMP with a single CPU garbage collector Paul E. McKenney
  5 siblings, 0 replies; 63+ messages in thread
From: Joe Korty @ 2011-03-09 22:19 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Paul E. McKenney, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

jrcu: remove the preempt_enable() tap.

This is expensive, and does not seem to be needed.

Without this tap present, jrcu was able to successfully
recognize end-of-batch in a timely manner, under all the
tests I could throw at it.

It did however take longer. Batch lengths approaching 400
msecs now easily occur. Before it was very difficult to get
a batch length greater than one RCU_HZ period, 50 msecs.

One interesting side effect: with this change the daemon
approach is now required. If the invoking cpu is otherwise
idle, then the context switch on daemon exit is the only
source of recognized quiescent points for that cpu.

Signed-off-by: Joe Korty <joe.korty@ccur.com>

-----

From: Joe Korty <joe.korty@ccur.com>
To: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Date: Tue, 8 Mar 2011 17:53:55 -0500
Subject: Re: [PATCH] An RCU for SMP with a single CPU garbage collector

Hi Paul,
I had a brainstorm. It _seems_ that JRCU might work fine if
all I did was remove the expensive preempt_enable() tap.
No new taps on system calls or anywhere else.  That would
leave only the context switch tap plus the batch start/end
sampling that is remotely performed on each cpu by the
garbage collector.  Not even rcu_read_unlock has a tap --
it is just a plain-jane preempt_enable() now.

And indeed it works!  I am able to turn off the local
timer interrupt on one (of 15) cpus and the batches
keep flowing on.  I have two user 100%  use test apps
(one of them does no system calls), when I run that
on the timer-disabled cpu the batches still advance.
Admittedly the batches do not advance as fast as before
.. they used to advance at the max rate of 50 msecs/batch.
Now I regularly see batch lengths approaching 400 msecs.

I plan to put some taps into some other low overhead places
-- at all the voluntary preemption points, at might_sleep,
at rcu_read_unlock, for safety purposes.  But it is nice
to see a zero overhead approach that works fine without
any of that.

Regards,
Joe

Index: b/include/linux/preempt.h
===================================================================
--- a/include/linux/preempt.h
+++ b/include/linux/preempt.h
@@ -10,26 +10,12 @@
 #include <linux/linkage.h>
 #include <linux/list.h>
 
-# define __add_preempt_count(val) do { preempt_count() += (val); } while (0)
-
-#ifndef CONFIG_JRCU
-# define __sub_preempt_count(val) do { preempt_count() -= (val); } while (0)
-#else
-  extern void __rcu_preempt_sub(void);
-# define __sub_preempt_count(val) do { \
-	if (!(preempt_count() -= (val))) { \
-		/* preempt is enabled, RCU OK with consequent stale result */ \
-		__rcu_preempt_sub(); \
-	} \
-} while (0)
-#endif
-
 #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
   extern void add_preempt_count(int val);
   extern void sub_preempt_count(int val);
 #else
-# define add_preempt_count(val)	__add_preempt_count(val)
-# define sub_preempt_count(val)	__sub_preempt_count(val)
+# define add_preempt_count(val)	do { preempt_count() += (val); } while (0)
+# define sub_preempt_count(val)	do { preempt_count() -= (val); } while (0)
 #endif
 
 #define inc_preempt_count() add_preempt_count(1)
Index: b/init/Kconfig
===================================================================
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -426,15 +426,12 @@ config PREEMPT_RCU
 	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
 
 config JRCU_DAEMON
-	bool "Drive JRCU from a daemon"
+	bool
 	depends on JRCU
-	default Y
+	default y
 	help
-	  Normally JRCU end-of-batch processing is driven from a SoftIRQ
-	  'interrupt' driver.  If you consider this to be too invasive,
-	  this option can be used to drive JRCU from a kernel daemon.
-
-	  If unsure, say Y here.
+	  Required. The context switch when leaving the daemon is needed
+	  to get the CPU to reliably participate in end-of-batch processing.
 
 config RCU_TRACE
 	bool "Enable tracing for RCU"
Index: b/kernel/sched.c
===================================================================
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -3826,7 +3826,7 @@ void __kprobes add_preempt_count(int val
 	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
 		return;
 #endif
-	__add_preempt_count(val);
+	preempt_count() += val;
 #ifdef CONFIG_DEBUG_PREEMPT
 	/*
 	 * Spinlock count overflowing soon?
@@ -3857,7 +3857,7 @@ void __kprobes sub_preempt_count(int val
 
 	if (preempt_count() == val)
 		trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
-	__sub_preempt_count(val);
+	preempt_count() -= val;
 }
 EXPORT_SYMBOL(sub_preempt_count);
 
Index: b/kernel/jrcu.c
===================================================================
--- a/kernel/jrcu.c
+++ b/kernel/jrcu.c
@@ -158,12 +158,6 @@ void rcu_note_context_switch(int cpu)
 	rcu_eob(cpu);
 }
 
-void __rcu_preempt_sub(void)
-{
-	rcu_eob(rcu_cpu());
-}
-EXPORT_SYMBOL(__rcu_preempt_sub);
-
 void rcu_barrier(void)
 {
 	struct rcu_synchronize rcu;

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-07 21:16                         ` Joe Korty
  2011-03-07 21:33                           ` Joe Korty
  2011-03-07 22:51                           ` Joe Korty
@ 2011-03-09 22:29                           ` Frederic Weisbecker
  2 siblings, 0 replies; 63+ messages in thread
From: Frederic Weisbecker @ 2011-03-09 22:29 UTC (permalink / raw)
  To: Joe Korty; +Cc: Paul E. McKenney, linux-kernel

On Mon, Mar 07, 2011 at 04:16:13PM -0500, Joe Korty wrote:
> On Mon, Mar 07, 2011 at 04:01:57PM -0500, Paul E. McKenney wrote:
> > Interesting!
> > 
> > But I would really prefer leveraging the existing RCU implementations
> > to the extent possible.  Are the user-dedicated CPUs able to invoke
> > system calls?  If so, something like Frederic's approach should permit
> > the existing RCU implementations to operate normally.  If not, what is
> > doing the RCU read-side critical sections on the dedicated CPUs?
> > 
> >                                                         Thanx, Paul
> 
> I haven't seen Frederic's patch.  Sorry I missed it!
> It might have saved a bit of work...

I'm sorry it's my fault, I should have Cc'ed you in my nohz task series.

It's here: https://lkml.org/lkml/2010/12/20/209 and the rcu changes
are spread in severals patches of the series. The idea is to
switch to extended quiescent state when we resume to userspace but
temporarily exit that state when we trigger an exception or an
irq. Then exit extended quiescent state when we enter the kernel
again.

I'll soon look at the last patchset you've posted.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-08 15:57                               ` Joe Korty
  2011-03-08 22:53                                 ` Joe Korty
@ 2011-03-10  0:28                                 ` Paul E. McKenney
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-10  0:28 UTC (permalink / raw)
  To: Joe Korty; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Mar 08, 2011 at 10:57:10AM -0500, Joe Korty wrote:
> On Tue, Mar 08, 2011 at 04:07:42AM -0500, Paul E. McKenney wrote:
> >> Thinking about it some more, the tap-into-syscall approach might
> >> work in my implementation, in which case the tap-into-preempt-enable
> >> code could go away.
> > 
> > OK, please let me know how that goes!
> > 
> >> Nice thing about RCU, the algorithms are infinitely mallable :)
> > 
> > Just trying to keep the code size finite.  ;-)
> 
> I hope to get to it this afternoon!  I especially like
> the lockless nature of JRCU, and that the dedicated cpus
> are not loaded down with callback inovcations either.
> Not sure how to support the PREEMPT_RCU mode though; so
> if Fredrick is planning to support that, that alone would
> make his approach the very best.

One thing for PREEMPT_RCU -- given that you will have other CPUs reading
the writes to the counters, you will need memory barriers in both
rcu_read_lock() and rcu_read_unlock().  Unless you can do the reading
of these counters locally from rcu_note_context_switch() or some such.
But in that case, your grace periods would extend indefinitely if the
thread in question never entered the scheduler.  (Not sure if you are
supporting this notion, but Frederic is.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-08 22:53                                 ` Joe Korty
@ 2011-03-10  0:30                                   ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-10  0:30 UTC (permalink / raw)
  To: Joe Korty; +Cc: Frederic Weisbecker, linux-kernel

On Tue, Mar 08, 2011 at 05:53:55PM -0500, Joe Korty wrote:
> On Tue, Mar 08, 2011 at 10:57:10AM -0500, Joe Korty wrote:
> > On Tue, Mar 08, 2011 at 04:07:42AM -0500, Paul E. McKenney wrote:
> >>> Thinking about it some more, the tap-into-syscall approach might
> >>> work in my implementation, in which case the tap-into-preempt-enable
> >>> code could go away.
> > > 
> >> OK, please let me know how that goes!
> >> 
> >>> Nice thing about RCU, the algorithms are infinitely mallable :)
> >> 
> >> Just trying to keep the code size finite.  ;-)
> > 
> > I hope to get to it this afternoon!  I especially like
> > the lockless nature of JRCU, and that the dedicated cpus
> > are not loaded down with callback inovcations either.
> > Not sure how to support the PREEMPT_RCU mode though; so
> > if Fredrick is planning to support that, that alone would
> > make his approach the very best.
> 
> 
> 
> Hi Paul,
> I had a brainstorm. It _seems_ that JRCU might work fine if
> all I did was remove the expensive preempt_enable() tap.
> No new taps on system calls or anywhere else.  That would
> leave only the context switch tap plus the batch start/end
> sampling that is remotely performed on each cpu by the
> garbage collector.  Not even rcu_read_unlock has a tap --
> it is just a plain-jane preempt_enable() now.
> 
> And indeed it works!  I am able to turn off the local
> timer interrupt on one (of 15) cpus and the batches
> keep flowing on.  I have two user 100%  use test apps
> (one of them does no system calls), when I run that
> on the timer-disabled cpu the batches still advance.
> Admittedly the batches do not advance as fast as before
> .. they used to advance at the max rate of 50 msecs/batch.
> Now I regularly see batch lengths approaching 400 msecs.
> 
> I plan to put some taps into some other low overhead places
> -- at all the voluntary preemption points, at might_sleep,
> at rcu_read_unlock, for safety purposes.  But it is nice
> to see a zero overhead approach that works fine without
> any of that.

If you had a user-level process that never did system calls and never
entered the scheduler, what do you do to force forward progress of the RCU
grace periods?  (This is force_quiescent_state()'s job in TREE_RCU, FYI.)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH 2/4] jrcu: tap rcu_read_unlock
  2011-03-09 22:15                       ` [PATCH 2/4] jrcu: tap rcu_read_unlock Joe Korty
@ 2011-03-10  0:34                         ` Paul E. McKenney
  2011-03-10 19:50                           ` JRCU Theory of Operation Joe Korty
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-10  0:34 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

On Wed, Mar 09, 2011 at 05:15:17PM -0500, Joe Korty wrote:
> jrcu: tap into rcu_read_unlock().
> 
> All places where rcu_read_unlock() is the final lock in
> a set of nested rcu locks are known rcu quiescent points.
> This patch recognizes that subset of those which also make
> the task preemptable.  The others are left unrecognized.
> 
> Not fundamentally needed, accelerates rcu batching.

Wouldn't you need to hook rcu_read_lock() as well, at least in
the CONFIG_PREEMPT_RCU case?  Otherwise, the RCU read-side critical
section's accesses could leak out, possibly causing an RCU read-side
critical section that looked like it started after a given grace period
(thus not blocking that grace period) actually have accesses that precede
the grace period?  If this situation could arise, the grace period could
end too soon, resulting in memory corruption.

Or am I missing something here?

							Thanx, Paul

> Signed-off-by: Joe Korty <joe.korty@ccur.com>
> 
> Index: b/include/linux/jrcu.h
> ===================================================================
> --- a/include/linux/jrcu.h
> +++ b/include/linux/jrcu.h
> @@ -21,8 +21,10 @@
>  #ifndef __LINUX_JRCU_H
>  #define __LINUX_JRCU_H
> 
> +extern void rcu_read_unlock_jrcu(void);
> +
>  #define __rcu_read_lock()			preempt_disable()
> -#define __rcu_read_unlock()			preempt_enable()
> +#define __rcu_read_unlock()			rcu_read_unlock_jrcu()
> 
>  #define __rcu_read_lock_bh()			__rcu_read_lock()
>  #define __rcu_read_unlock_bh()			__rcu_read_unlock()
> Index: b/kernel/jrcu.c
> ===================================================================
> --- a/kernel/jrcu.c
> +++ b/kernel/jrcu.c
> @@ -153,6 +153,14 @@ static inline void rcu_eob(int cpu)
>  	}
>  }
> 
> +void rcu_read_unlock_jrcu(void)
> +{
> +	if (preempt_count() == 1)
> +		rcu_eob(rcu_cpu());
> +	preempt_enable();
> +}
> +EXPORT_SYMBOL_GPL(rcu_read_unlock_jrcu);
> +
>  void rcu_note_context_switch(int cpu)
>  {
>  	rcu_eob(cpu);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* JRCU Theory of Operation
  2011-03-10  0:34                         ` Paul E. McKenney
@ 2011-03-10 19:50                           ` Joe Korty
  2011-03-12 14:36                             ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-10 19:50 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Wed, Mar 09, 2011 at 07:34:19PM -0500, Paul E. McKenney wrote:
> On Wed, Mar 09, 2011 at 05:15:17PM -0500, Joe Korty wrote:
>> jrcu: tap into rcu_read_unlock().
>> 
>> All places where rcu_read_unlock() is the final lock in
>> a set of nested rcu locks are known rcu quiescent points.
>> This patch recognizes that subset of those which also make
>> the task preemptable.  The others are left unrecognized.
>> 
>> Not fundamentally needed, accelerates rcu batching.
> 
> Wouldn't you need to hook rcu_read_lock() as well, at least in
> the CONFIG_PREEMPT_RCU case?  Otherwise, the RCU read-side critical
> section's accesses could leak out, possibly causing an RCU read-side
> critical section that looked like it started after a given grace period
> (thus not blocking that grace period) actually have accesses that precede
> the grace period?  If this situation could arise, the grace period could
> end too soon, resulting in memory corruption.
> 
> Or am I missing something here?
> 
> 							Thanx, Paul




Hi Paul,
The short answer is, jrcu tolerates data leaks (but not
pointer leaks) in certain critical variables, up to the
length of one RCU_HZ sampling period (50 milliseconds).
That is, changes being made to these variables on one cpu
must be 'seeable' on other cpus within the 50 msec window.
Jrcu uses memory barriers for this purpose, but perhaps
not yet in every place such are actually needed.

  An aside: AFAICT, cpu hardware makes an active effort to
  push out store buffers to cache, where they can be snooped.
  Cpus just don't leave writes indefinately in store buffers
  for no reason. If, however, one believes (or knows) that
  there are modes where a cpu can hold store buffers out
  of cache indefinately, say over 50 msecs, then I need
  to scatter a few more more memory barriers (like those
  currently under CONFIG_RCU_PARANOID) around the kernel.
  One (maybe the only) place that would need this is in
  add_preempt_count().

The above variables being sampled by the rcu garbage
collector are either stable (unchanging) or unstable.
When stable, jrcu by definition makes correct decisions.
When unstable, it doesn't make any difference what jrcu
decides -- instability means that within the last 50 msecs
there was one or more quiescent periods, therefore, jrcu
can either end or not end the current batch one RCU_HZ period
from now. No matter what it does, it will be correct.


A longer answer, on a slighly expanded topic, goes as follows.  The heart
of jrcu is in this (slighly edited) line,

  rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);

This is in the garbage collector, at the point where it is initializing
the state of all cpus as part of setting up a new 'current batch'.
Let us first consider the consequences of a simplification of the above,

  rcu_data[cpu].wait = 1;

A value of '1' means that that cpu has not yet consented to end-of-batch.
A value of '0' means that this cpu has no objection to the current batch
ending.  The current batch actually ends only when the garbage collector
wakes up and notices that all the wait states are zero.  It does this
at a RCU_HZ==20 (50 msec) rate.

In this simplified scenario, each cpu has an obligation to zero its wait
state at at least some of the places where it knows it has no objection
to the current batch ending (quiescent point taps).  One such point is
in context switch, another possible point for a tap is all the places
where preempt_count() goes to zero.

The problem with the above "initialize all wait states to 1" scenario, is
that every cpu must then pass through some tapped quiescent point before
the garbage collector will ever consider the current batch to have ended.
This does not work well for completely idle cpus or for cpus spending 100%
of their time in user space, as these are in a quiescent region that will
never execute a tap.  Hence we now get to a more involved simplification
of the original expression,

  rcu_data[cpu].wait = preempt_count_cpu(cpu) > 0;

Here, the garbage collector is making an attempt to deduce, at the
start of the current batch, whether or not some cpu is executing code
in a quiescent region.  If it is, then that cpu's wait state can be set
to zero right away -- we don't have to wait for that cpu to execute a
quiescent point tap later on to discover that fact.  This nicely covers
the user app and idle cpu situations discussed above.

Now, we all know that fetching the preempt_count of some process running on
another cpu is guaranteed to return a stale (obsolete) value, and may even
be dangerous (pointers are being followed after all).  Putting aside the
question of safety, for now, leaves us with a trio of questions: are there
times when this inherently unstable value is in fact stable and useful?
When it is not stable, is that fact relevant or irrelevant to the correct
operation of jrcu? And finally, does the fact that we cannot tell when
it is stable and when it is not, also relevant?

First, we know the preempt_count_cpu() > 0 expression will gradually
become stable during certain critical times: when an application stays
constantly in user space, and when a cpu stays idle, doing no interrupt or
softirq processing.  In these cases the stable value converged to will be
'0' (0 == cpu is in a quiescent state).

We also know that the preempt_count_cpu() > 0 expression will gradually
get stable whenever a cpu stays in an rcu-protected region for some
long period of time.  In this case the expression converges on '1' (1 ==
cpu is executing code in an rcu-protected region).

For both of the above cases, jrcu requires that the value being fetched by
preempt_count_cpu() actually to have been that cpu's preempt_count sometime
within the last 50 msecs.  Thus, within 50 msecs of a cpu getting a stable
preempt count, preempt_count_cpu() will start returning that same value
to the garbage collector, which will then start making correct decisions
for as long as the returned value remains stable.

Next we come to the situation where preempt_count_cpu() is unstable.  It may
be unstable because its value is constantly transitioning between a zero
and nonzero state (we don't care about transitions between various nonzero
states), or it may be unstable because of context switch.  It doesn't
matter which, all that really counts is that the instability means that the
remote cpu is making transitions between rcu protected and rqu quiescent
states, and that it has done this in the recent past .. 'recent' meaning,
the time it takes for writes on the remote cpu to become 'seeable' on
the cpu with the garbage collector (which jrcu requires to be < 50msecs).

What really counts here is that instability means there was a recent
quiescent state which in turn means that we can set the cpu wait state
to either '1' (wait for a quiescent point tap) or '0' (tap not needed,
a quiescent state has already happened), and all will work correctly.

A final issue is the stability of the remote thread_info pointer that
preempt_count_cpu() uses.  First off, if this pointer is not stable that
means that the cpu has recently gone through a task switch which means it
doesn't matter what the cpu wait state is set to, as context switches by
definition are quiescent points.  But, we would also like to make sure
that the pointer actually points to valid memory, specifically to the
thread_info structure actually associated with a task recently executed
(within 50 msecs) on the remote cpu.  There is only one way for such a
pointer to become invalid: between the time it was fetched and when it
is dereference to get the remote preempt_count, the remote task switched
away and executed the overhead of performing a task exit, including doing
the final kfree of the task's thread_info structure.  That is a pretty
heavyweight execution sequence.  In terms of races, the preempt_count_cpu()
code will win out over context_switch + task exit every time, _if_ 1) it is
always invoked under raw_local_irq_disable, and 2) a write memory barrier
is added everywhere in the kernel this pointer is changed, and 3) a read
memory barrier exists just before everywhere preempt_count_cpu is executed.

In the original expression,

  rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);

The 'idle_cpu' part is needed only because preempt_count() == 1 is the
default state of the idle task.  Without it jrcu will never see idle cpus as
being in a quiescent state.  The same rules about stability and instability
for preempt_count_cpu() also apply to idle_cpu().  However, idle_cpu()
follows no pointer so that complication thankfully does not apply to it.

I am sure there are other interesting details, but my mind is burned out
now from putting this together, and I can't think of them.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-10 19:50                           ` JRCU Theory of Operation Joe Korty
@ 2011-03-12 14:36                             ` Paul E. McKenney
  2011-03-13  0:43                               ` Joe Korty
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-12 14:36 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Thu, Mar 10, 2011 at 02:50:45PM -0500, Joe Korty wrote:
> On Wed, Mar 09, 2011 at 07:34:19PM -0500, Paul E. McKenney wrote:
> > On Wed, Mar 09, 2011 at 05:15:17PM -0500, Joe Korty wrote:
> >> jrcu: tap into rcu_read_unlock().
> >> 
> >> All places where rcu_read_unlock() is the final lock in
> >> a set of nested rcu locks are known rcu quiescent points.
> >> This patch recognizes that subset of those which also make
> >> the task preemptable.  The others are left unrecognized.
> >> 
> >> Not fundamentally needed, accelerates rcu batching.
> > 
> > Wouldn't you need to hook rcu_read_lock() as well, at least in
> > the CONFIG_PREEMPT_RCU case?  Otherwise, the RCU read-side critical
> > section's accesses could leak out, possibly causing an RCU read-side
> > critical section that looked like it started after a given grace period
> > (thus not blocking that grace period) actually have accesses that precede
> > the grace period?  If this situation could arise, the grace period could
> > end too soon, resulting in memory corruption.
> > 
> > Or am I missing something here?
> > 
> > 							Thanx, Paul
> 
> 
> 
> 
> Hi Paul,
> The short answer is, jrcu tolerates data leaks (but not
> pointer leaks) in certain critical variables, up to the
> length of one RCU_HZ sampling period (50 milliseconds).
> That is, changes being made to these variables on one cpu
> must be 'seeable' on other cpus within the 50 msec window.
> Jrcu uses memory barriers for this purpose, but perhaps
> not yet in every place such are actually needed.
> 
>   An aside: AFAICT, cpu hardware makes an active effort to
>   push out store buffers to cache, where they can be snooped.
>   Cpus just don't leave writes indefinately in store buffers
>   for no reason. If, however, one believes (or knows) that
>   there are modes where a cpu can hold store buffers out
>   of cache indefinately, say over 50 msecs, then I need
>   to scatter a few more more memory barriers (like those
>   currently under CONFIG_RCU_PARANOID) around the kernel.
>   One (maybe the only) place that would need this is in
>   add_preempt_count().

Ah, because you don't expedite grace periods.

With expedited grace periods, it would be unwise to assume that
the CPU drained its store buffer within a grace period.

Actually, CPU manufacturers generally don't give out specs for the
maximum time that a write can stay in the store buffer.  At least
I have not managed to get this info from them.  So it would be way
better not to rely on this assumption.

(And yes, I am reworking the RCU-dynticks interface to get rid of
this assumption...)

> The above variables being sampled by the rcu garbage
> collector are either stable (unchanging) or unstable.
> When stable, jrcu by definition makes correct decisions.
> When unstable, it doesn't make any difference what jrcu
> decides -- instability means that within the last 50 msecs
> there was one or more quiescent periods, therefore, jrcu
> can either end or not end the current batch one RCU_HZ period
> from now. No matter what it does, it will be correct.

50 milliseconds plus or minus the maximum store-buffer residency
time, but otherwise, yes.

> A longer answer, on a slighly expanded topic, goes as follows.  The heart
> of jrcu is in this (slighly edited) line,
> 
>   rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);

So, if we are idle, then the preemption count must be 2 or greater
to make the current grace period wait on a CPU.  But if we are not
idle, then the preemption count need only be 1 or greater to make
the current grace period wait on a CPU.

But why do should an idle CPU block the current RCU grace period
in any case?  The idle loop is defined to be a quiescent state
for rcu_sched.  (Not that permitting RCU read-side critical sections
in the idle loop would be a bad thing, as long as the associated
pitfalls were all properly avoided.)

Also, this all assumes CONFIG_PREEMPT=y?  Ah, yes, you have
CONFIG_JRCU depending on CONFIG_PREEMPT in your patch.

> This is in the garbage collector, at the point where it is initializing
> the state of all cpus as part of setting up a new 'current batch'.
> Let us first consider the consequences of a simplification of the above,
> 
>   rcu_data[cpu].wait = 1;
> 
> A value of '1' means that that cpu has not yet consented to end-of-batch.
> A value of '0' means that this cpu has no objection to the current batch
> ending.  The current batch actually ends only when the garbage collector
> wakes up and notices that all the wait states are zero.  It does this
> at a RCU_HZ==20 (50 msec) rate.
> 
> In this simplified scenario, each cpu has an obligation to zero its wait
> state at at least some of the places where it knows it has no objection
> to the current batch ending (quiescent point taps).  One such point is
> in context switch, another possible point for a tap is all the places
> where preempt_count() goes to zero.
> 
> The problem with the above "initialize all wait states to 1" scenario, is
> that every cpu must then pass through some tapped quiescent point before
> the garbage collector will ever consider the current batch to have ended.
> This does not work well for completely idle cpus or for cpus spending 100%
> of their time in user space, as these are in a quiescent region that will
> never execute a tap.  Hence we now get to a more involved simplification
> of the original expression,
> 
>   rcu_data[cpu].wait = preempt_count_cpu(cpu) > 0;
> 
> Here, the garbage collector is making an attempt to deduce, at the
> start of the current batch, whether or not some cpu is executing code
> in a quiescent region.  If it is, then that cpu's wait state can be set
> to zero right away -- we don't have to wait for that cpu to execute a
> quiescent point tap later on to discover that fact.  This nicely covers
> the user app and idle cpu situations discussed above.
> 
> Now, we all know that fetching the preempt_count of some process running on
> another cpu is guaranteed to return a stale (obsolete) value, and may even
> be dangerous (pointers are being followed after all).  Putting aside the
> question of safety, for now, leaves us with a trio of questions: are there
> times when this inherently unstable value is in fact stable and useful?
> When it is not stable, is that fact relevant or irrelevant to the correct
> operation of jrcu? And finally, does the fact that we cannot tell when
> it is stable and when it is not, also relevant?

And there is also the ordering of the preempt_disable() and the accesses
within the critical section...  Just because you recently saw a quiescent
state doesn't mean that the preceding critical section has completed --
even x86 is happy to spill stores out of a critical section ended by
preempt_enable.  If one of those stores is to an RCU protected
data structure, you might end up freeing the structure before the
store completed.

Or is the idea that you would wait 50 milliseconds after detecting
the quiescent state before invoking the corresponding RCU callbacks?

> First, we know the preempt_count_cpu() > 0 expression will gradually
> become stable during certain critical times: when an application stays
> constantly in user space, and when a cpu stays idle, doing no interrupt or
> softirq processing.  In these cases the stable value converged to will be
> '0' (0 == cpu is in a quiescent state).
> 
> We also know that the preempt_count_cpu() > 0 expression will gradually
> get stable whenever a cpu stays in an rcu-protected region for some
> long period of time.  In this case the expression converges on '1' (1 ==
> cpu is executing code in an rcu-protected region).
> 
> For both of the above cases, jrcu requires that the value being fetched by
> preempt_count_cpu() actually to have been that cpu's preempt_count sometime
> within the last 50 msecs.  Thus, within 50 msecs of a cpu getting a stable
> preempt count, preempt_count_cpu() will start returning that same value
> to the garbage collector, which will then start making correct decisions
> for as long as the returned value remains stable.
> 
> Next we come to the situation where preempt_count_cpu() is unstable.  It may
> be unstable because its value is constantly transitioning between a zero
> and nonzero state (we don't care about transitions between various nonzero
> states), or it may be unstable because of context switch.  It doesn't
> matter which, all that really counts is that the instability means that the
> remote cpu is making transitions between rcu protected and rqu quiescent
> states, and that it has done this in the recent past .. 'recent' meaning,
> the time it takes for writes on the remote cpu to become 'seeable' on
> the cpu with the garbage collector (which jrcu requires to be < 50msecs).
> 
> What really counts here is that instability means there was a recent
> quiescent state which in turn means that we can set the cpu wait state
> to either '1' (wait for a quiescent point tap) or '0' (tap not needed,
> a quiescent state has already happened), and all will work correctly.
> 
> A final issue is the stability of the remote thread_info pointer that
> preempt_count_cpu() uses.  First off, if this pointer is not stable that
> means that the cpu has recently gone through a task switch which means it
> doesn't matter what the cpu wait state is set to, as context switches by
> definition are quiescent points.  But, we would also like to make sure
> that the pointer actually points to valid memory, specifically to the
> thread_info structure actually associated with a task recently executed
> (within 50 msecs) on the remote cpu.  There is only one way for such a
> pointer to become invalid: between the time it was fetched and when it
> is dereference to get the remote preempt_count, the remote task switched
> away and executed the overhead of performing a task exit, including doing
> the final kfree of the task's thread_info structure.  That is a pretty
> heavyweight execution sequence.  In terms of races, the preempt_count_cpu()
> code will win out over context_switch + task exit every time, _if_ 1) it is
> always invoked under raw_local_irq_disable, and 2) a write memory barrier
> is added everywhere in the kernel this pointer is changed, and 3) a read
> memory barrier exists just before everywhere preempt_count_cpu is executed.
> 
> In the original expression,
> 
>   rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);
> 
> The 'idle_cpu' part is needed only because preempt_count() == 1 is the
> default state of the idle task.  Without it jrcu will never see idle cpus as
> being in a quiescent state.  The same rules about stability and instability
> for preempt_count_cpu() also apply to idle_cpu().  However, idle_cpu()
> follows no pointer so that complication thankfully does not apply to it.
> 
> I am sure there are other interesting details, but my mind is burned out
> now from putting this together, and I can't think of them.

I am missing how ->which switching is safe, given the possibility of
access from other CPUs.

							Thanx, Paul

> Regards,
> Joe
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
                                         ` (4 preceding siblings ...)
  2011-03-09 22:19                       ` [PATCH 1/4] jrcu: remove preempt_enable() tap [resend] Joe Korty
@ 2011-03-12 14:36                       ` Paul E. McKenney
  2011-03-13  1:25                         ` Joe Korty
  5 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-12 14:36 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

On Mon, Mar 07, 2011 at 03:31:06PM -0500, Joe Korty wrote:
> Hi Paul & Fredrick & other interested parties.
> 
> We would like for Linux to eventually support user
> dedicated cpus.  That is, cpus that are completely free of
> periodic system duties, leaving 100% of their capacity
> available to service user applications.  This will
> eventually be important for those realtime applications
> requiring full use of dedicated cpus.
> 
> An important milestone to that goal would be to have
> an offical, supported RCU implementation which did not
> make each and every CPU periodically participate in RCU
> garbage collection.
> 
> The attached RCU implementation does precisely that.
> Like TinyRCU it is both tiny and very simple, but unlike
> TinyRCU it supports SMP, and unlike the other SMP RCUs,
> it does its garbage collection from a single CPU.
> 
> For performance, each cpu is given its own 'current' and
> 'previous' callback queue, and all interactions between
> these queues and the global garbage collector proceed in
> a lock-free manner.
> 
> This patch is a quick port to 2.6.38-rc7 from a 2.6.36.4
> implementation developed over the last two weeks. The
> earlier version was tested over the weekend under i386 and
> x86_64, this version has only been spot tested on x86_64.

Hello, Joe,

My biggest question is "what does JRCU do that Frederic's patchset
does not do?"  I am not seeing it at the moment.  Given that Frederic's
patchset integrates into RCU, thus providing the full RCU API, I really
need a good answer to consider JRCU.

In the meantime, some questions and comments below.

For one, what sort of validation have you done?

							Thanx, Paul

> Signed-off-by: Joe Korty <joe.korty@ccur.com>
> 
> Index: b/kernel/jrcu.c
> ===================================================================
> --- /dev/null
> +++ b/kernel/jrcu.c
> @@ -0,0 +1,604 @@
> +/*
> + * Joe's tiny single-cpu RCU, for small SMP systems.
> + *
> + * Running RCU end-of-batch operations from a single cpu relieves the
> + * other CPUs from this periodic responsibility.  This will eventually
> + * be important for those realtime applications requiring full use of
> + * dedicated cpus.  JRCU is also a lockless implementation, currently,
> + * although some anticipated features will eventually require a per
> + * cpu rcu_lock along some minimal-contention paths.
> + *
> + * Author: Joe Korty <joe.korty@ccur.com>
> + *
> + * Acknowledgements: Paul E. McKenney's 'TinyRCU for uniprocessors' inspired
> + * the thought that there could could be something similiarly simple for SMP.
> + * The rcu_list chain operators are from Jim Houston's Alternative RCU.
> + *
> + * Copyright Concurrent Computer Corporation, 2011
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> + * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + * for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, write to the Free Software Foundation, Inc.,
> + * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +
> +/*
> + * This RCU maintains three callback lists: the current batch (per cpu),
> + * the previous batch (also per cpu), and the pending list (global).
> + */
> +
> +#include <linux/bug.h>
> +#include <linux/smp.h>
> +#include <linux/sched.h>
> +#include <linux/types.h>
> +#include <linux/kernel.h>
> +#include <linux/module.h>
> +#include <linux/percpu.h>
> +#include <linux/stddef.h>
> +#include <linux/preempt.h>
> +#include <linux/compiler.h>
> +#include <linux/irqflags.h>
> +#include <linux/rcupdate.h>
> +
> +#include <asm/system.h>
> +
> +/*
> + * Define an rcu list type and operators.  An rcu list has only ->next
> + * pointers for the chain nodes; the list head however is special and
> + * has pointers to both the first and last nodes of the chain.  Tweaked
> + * so that null head, tail pointers can be used to signify an empty list.
> + */
> +struct rcu_list {
> +	struct rcu_head *head;
> +	struct rcu_head **tail;
> +	int count;		/* stats-n-debug */
> +};
> +
> +static inline void rcu_list_init(struct rcu_list *l)
> +{
> +	l->head = NULL;
> +	l->tail = NULL;
> +	l->count = 0;
> +}
> +
> +/*
> + * Add an element to the tail of an rcu list
> + */
> +static inline void rcu_list_add(struct rcu_list *l, struct rcu_head *h)
> +{
> +	if (unlikely(l->tail == NULL))
> +		l->tail = &l->head;
> +	*l->tail = h;
> +	l->tail = &h->next;
> +	l->count++;
> +	h->next = NULL;
> +}

This has interrupts disabled?  Or is there some other form of
mutual exclusion?  (The only caller does have interrupts disabled,
and calls this only on per-CPU data, so should be OK.)

> +/*
> + * Append the contents of one rcu list to another.  The 'from' list is left
> + * corrupted on exit; the caller must re-initialize it before it can be used
> + * again.
> + */
> +static inline void rcu_list_join(struct rcu_list *to, struct rcu_list *from)
> +{
> +	if (from->head) {
> +		if (unlikely(to->tail == NULL)) {
> +			to->tail = &to->head;
> +			to->count = 0;
> +		}
> +		*to->tail = from->head;
> +		to->tail = from->tail;
> +		to->count += from->count;
> +	}
> +}
> +
> +
> +#define RCU_HZ 20		/* max rate at which batches are retired */
> +
> +struct rcu_data {
> +	u8 wait;		/* goes false when this cpu consents to
> +				 * the retirement of the current batch */
> +	u8 which;		/* selects the current callback list */
> +	struct rcu_list cblist[2]; /* current & previous callback lists */
> +} ____cacheline_aligned_in_smp;
> +
> +static struct rcu_data rcu_data[NR_CPUS];

Why not DEFINE_PER_CPU(struct rcu_data, rcu_data)?

> +
> +/* debug & statistics stuff */
> +static struct rcu_stats {
> +	unsigned nbatches;	/* #end-of-batches (eobs) seen */
> +	atomic_t nbarriers;	/* #rcu barriers processed */
> +	u64 ninvoked;		/* #invoked (ie, finished) callbacks */
> +	atomic_t nleft;		/* #callbacks left (ie, not yet invoked) */
> +	unsigned nforced;	/* #forced eobs (should be zero) */
> +} rcu_stats;
> +
> +int rcu_scheduler_active __read_mostly;
> +int rcu_nmi_seen __read_mostly;
> +static u64 rcu_timestamp;
> +
> +/*
> + * Return our CPU id or zero if we are too early in the boot process to
> + * know what that is.  For RCU to work correctly, a cpu named '0' must
> + * eventually be present (but need not ever be online).
> + */
> +static inline int rcu_cpu(void)
> +{
> +	return current_thread_info()->cpu;

OK, I'll bite...  Why not smp_processor_id()?

And what to do about the architectures that put the CPU number somewhere
else?

> +}
> +
> +/*
> + * Invoke whenever the calling CPU consents to end-of-batch.  All CPUs
> + * must so consent before the batch is truly ended.
> + */
> +static inline void rcu_eob(int cpu)
> +{
> +	struct rcu_data *rd = &rcu_data[cpu];
> +	if (unlikely(rd->wait)) {
> +		rd->wait = 0;
> +#ifdef CONFIG_RCU_PARANOID
> +		/* not needed, we can tolerate some fuzziness on exactly
> +		 * when other CPUs see the above write insn. */
> +		smp_wmb();
> +#endif
> +	}
> +}
> +
> +void rcu_note_context_switch(int cpu)
> +{
> +	rcu_eob(cpu);
> +}
> +
> +void __rcu_preempt_sub(void)
> +{
> +	rcu_eob(rcu_cpu());
> +}
> +EXPORT_SYMBOL(__rcu_preempt_sub);
> +
> +void rcu_barrier(void)
> +{
> +	struct rcu_synchronize rcu;
> +
> +	if (!rcu_scheduler_active)
> +		return;
> +
> +	init_completion(&rcu.completion);
> +	call_rcu(&rcu.head, wakeme_after_rcu);
> +	wait_for_completion(&rcu.completion);
> +	atomic_inc(&rcu_stats.nbarriers);
> +
> +}
> +EXPORT_SYMBOL_GPL(rcu_barrier);

The rcu_barrier() function must wait on all RCU callbacks, regardless of
which CPU they are queued on.  This is important when unloading modules
that use call_rcu().  In contrast, the above looks to me like it waits
only on the current CPU's callbacks.

So, what am I missing?

> +
> +void rcu_force_quiescent_state(void)
> +{
> +}
> +EXPORT_SYMBOL_GPL(rcu_force_quiescent_state);
> +
> +
> +/*
> + * Insert an RCU callback onto the calling CPUs list of 'current batch'
> + * callbacks.  Lockless version, can be invoked anywhere except under NMI.
> + */
> +void call_rcu(struct rcu_head *cb, void (*func)(struct rcu_head *rcu))
> +{
> +	unsigned long flags;
> +	struct rcu_data *rd;
> +	struct rcu_list *cblist;
> +	int which;
> +
> +	cb->func = func;
> +	cb->next = NULL;
> +
> +	raw_local_irq_save(flags);
> +	smp_rmb();
> +
> +	rd = &rcu_data[rcu_cpu()];

Why not a per-CPU variable rather than an array?

> +	which = ACCESS_ONCE(rd->which) & 1;
> +	cblist = &rd->cblist[which];
> +
> +	/* The following is not NMI-safe, therefore call_rcu()
> +	 * cannot be invoked under NMI. */
> +	rcu_list_add(cblist, cb);
> +	smp_wmb();
> +	raw_local_irq_restore(flags);
> +	atomic_inc(&rcu_stats.nleft);
> +}
> +EXPORT_SYMBOL_GPL(call_rcu);
> +
> +/*
> + * For a given cpu, push the previous batch of callbacks onto a (global)
> + * pending list, then make the current batch the previous.  A new, empty
> + * current batch exists after this operation.
> + *
> + * Locklessly tolerates changes being made by call_rcu() to the current
> + * batch, locklessly tolerates the current batch becoming the previous
> + * batch, and locklessly tolerates a new, empty current batch becoming
> + * available.  Requires that the previous batch be quiescent by the time
> + * rcu_end_batch is invoked.
> + */
> +static void rcu_end_batch(struct rcu_data *rd, struct rcu_list *pending)
> +{
> +	int prev;
> +	struct rcu_list *plist;	/* some cpus' previous list */
> +
> +	prev = (ACCESS_ONCE(rd->which) & 1) ^ 1;
> +	plist = &rd->cblist[prev];
> +
> +	/* Chain previous batch of callbacks, if any, to the pending list */
> +	if (plist->head) {
> +		rcu_list_join(pending, plist);
> +		rcu_list_init(plist);
> +		smp_wmb();
> +	}
> +	/*
> +	 * Swap current and previous lists.  Other cpus must not see this
> +	 * out-of-order w.r.t. the just-completed plist init, hence the above
> +	 * smp_wmb().
> +	 */
> +	rd->which++;

You do seem to have interrupts disabled when sampling ->which, but
this is not safe for cross-CPU accesses to ->which, right?  The other
CPU might queue onto the wrong element.  This would mean that you
would not be guaranteed a full 50ms delay from quiescent state to
corresponding RCU callback invocation.

Or am I missing something subtle here?

> +}
> +
> +/*
> + * Invoke all callbacks on the passed-in list.
> + */
> +static void rcu_invoke_callbacks(struct rcu_list *pending)
> +{
> +	struct rcu_head *curr, *next;
> +
> +	for (curr = pending->head; curr;) {
> +		next = curr->next;
> +		curr->func(curr);
> +		curr = next;
> +		rcu_stats.ninvoked++;
> +		atomic_dec(&rcu_stats.nleft);
> +	}
> +}
> +
> +/*
> + * Check if the conditions for ending the current batch are true. If
> + * so then end it.
> + *
> + * Must be invoked periodically, and the periodic invocations must be
> + * far enough apart in time for the previous batch to become quiescent.
> + * This is a few tens of microseconds unless NMIs are involved; an NMI
> + * stretches out the requirement by the duration of the NMI.
> + *
> + * "Quiescent" means the owning cpu is no longer appending callbacks
> + * and has completed execution of a trailing write-memory-barrier insn.
> + */
> +static void __rcu_delimit_batches(struct rcu_list *pending)
> +{
> +	struct rcu_data *rd;
> +	int cpu, eob;
> +	u64 rcu_now;
> +
> +	/* If an NMI occured then the previous batch may not yet be
> +	 * quiescent.  Let's wait till it is.
> +	 */
> +	if (rcu_nmi_seen) {
> +		rcu_nmi_seen = 0;
> +		return;
> +	}
> +
> +	if (!rcu_scheduler_active)
> +		return;
> +
> +	/*
> +	 * Find out if the current batch has ended
> +	 * (end-of-batch).
> +	 */
> +	eob = 1;
> +	for_each_online_cpu(cpu) {
> +		rd = &rcu_data[cpu];
> +		if (rd->wait) {
> +			eob = 0;
> +			break;
> +		}
> +	}
> +
> +	/*
> +	 * Force end-of-batch if too much time (n seconds) has
> +	 * gone by.  The forcing method is slightly questionable,
> +	 * hence the WARN_ON.
> +	 */
> +	rcu_now = sched_clock();
> +	if (!eob && !rcu_timestamp
> +	&& ((rcu_now - rcu_timestamp) > 3LL * NSEC_PER_SEC)) {
> +		rcu_stats.nforced++;
> +		WARN_ON_ONCE(1);
> +		eob = 1;
> +	}
> +
> +	/*
> +	 * Just return if the current batch has not yet
> +	 * ended.  Also, keep track of just how long it
> +	 * has been since we've actually seen end-of-batch.
> +	 */
> +
> +	if (!eob)
> +		return;
> +
> +	rcu_timestamp = rcu_now;
> +
> +	/*
> +	 * End the current RCU batch and start a new one.
> +	 */
> +	for_each_present_cpu(cpu) {
> +		rd = &rcu_data[cpu];

And here we get the cross-CPU accesses that I was worried about above.

> +		rcu_end_batch(rd, pending);
> +		if (cpu_online(cpu)) /* wins race with offlining every time */
> +			rd->wait = preempt_count_cpu(cpu) > idle_cpu(cpu);
> +		else
> +			rd->wait = 0;
> +	}
> +	rcu_stats.nbatches++;
> +}
> +
> +static void rcu_delimit_batches(void)
> +{
> +	unsigned long flags;
> +	struct rcu_list pending;
> +
> +	rcu_list_init(&pending);
> +
> +	raw_local_irq_save(flags);
> +	smp_rmb();
> +	__rcu_delimit_batches(&pending);
> +	smp_wmb();
> +	raw_local_irq_restore(flags);
> +
> +	if (pending.head)
> +		rcu_invoke_callbacks(&pending);
> +}
> +
> +/* ------------------ interrupt driver section ------------------ */
> +
> +/*
> + * We drive RCU from a periodic interrupt during most of boot. Once boot
> + * is complete we (optionally) transition to a daemon.
> + */
> +
> +#include <linux/time.h>
> +#include <linux/delay.h>
> +#include <linux/hrtimer.h>
> +#include <linux/interrupt.h>
> +
> +#define RCU_PERIOD_NS		(NSEC_PER_SEC / RCU_HZ)
> +#define RCU_PERIOD_DELTA_NS	(((NSEC_PER_SEC / HZ) * 3) / 2)
> +
> +#define RCU_PERIOD_MIN_NS	RCU_PERIOD_NS
> +#define RCU_PERIOD_MAX_NS	(RCU_PERIOD_NS + RCU_PERIOD_DELTA_NS)
> +
> +static struct hrtimer rcu_timer;
> +
> +static void rcu_softirq_func(struct softirq_action *h)
> +{
> +	rcu_delimit_batches();
> +}
> +
> +static enum hrtimer_restart rcu_timer_func(struct hrtimer *t)
> +{
> +	ktime_t next;
> +
> +	raise_softirq(RCU_SOFTIRQ);
> +
> +	next = ktime_add_ns(ktime_get(), RCU_PERIOD_NS);
> +	hrtimer_set_expires_range_ns(&rcu_timer, next, RCU_PERIOD_DELTA_NS);
> +	return HRTIMER_RESTART;
> +}
> +
> +static void rcu_timer_restart(void)
> +{
> +	pr_info("JRCU: starting timer. rate is %d Hz\n", RCU_HZ);
> +	hrtimer_forward_now(&rcu_timer, ns_to_ktime(RCU_PERIOD_NS));
> +	hrtimer_start_expires(&rcu_timer, HRTIMER_MODE_ABS);
> +}
> +
> +static __init int rcu_timer_start(void)
> +{
> +	open_softirq(RCU_SOFTIRQ, rcu_softirq_func);
> +
> +	hrtimer_init(&rcu_timer, CLOCK_MONOTONIC, HRTIMER_MODE_ABS);
> +	rcu_timer.function = rcu_timer_func;
> +	rcu_timer_restart();
> +
> +	return 0;
> +}
> +
> +#ifdef CONFIG_JRCU_DAEMON
> +static void rcu_timer_stop(void)
> +{
> +	int stat;
> +
> +	stat = hrtimer_cancel(&rcu_timer);
> +	if (stat)
> +		pr_info("JRCU: timer canceled.\n");
> +}
> +#endif
> +
> +/*
> + * Transition from a simple to a full featured, interrupt driven RCU.
> + *
> + * This is to protect us against RCU being used very very early in the boot
> + * process, where ideas like 'tasks' and 'cpus' and 'timers' and such are
> + * not yet fully formed.  During this very early time, we use a simple,
> + * not-fully-functional braindead version of RCU.
> + *
> + * Invoked from main() at the earliest point where scheduling and timers
> + * are functional.
> + */
> +void __init rcu_scheduler_starting(void)
> +{
> +	int stat;
> +
> +	stat = rcu_timer_start();
> +	if (stat) {
> +		pr_err("JRCU: failed to start.  This is fatal.\n");
> +		return;
> +	}
> +
> +	rcu_scheduler_active = 1;
> +	smp_wmb();
> +
> +	pr_info("JRCU: started\n");
> +}
> +
> +#ifdef CONFIG_JRCU_DAEMON
> +
> +/* ------------------ daemon driver section --------------------- */
> +
> +#define RCU_PERIOD_MIN_US	(RCU_PERIOD_MIN_NS / NSEC_PER_USEC)
> +#define RCU_PERIOD_MAX_US	(RCU_PERIOD_MAX_NS / NSEC_PER_USEC)
> +
> +/*
> + * Once the system is fully up, we will drive the periodic-polling part
> + * of JRCU from a kernel daemon, jrcud.  Until then it is driven by
> + * an interrupt.
> + */
> +#include <linux/err.h>
> +#include <linux/param.h>
> +#include <linux/kthread.h>
> +
> +static int jrcud_func(void *arg)
> +{
> +	set_user_nice(current, -19);
> +	current->flags |= PF_NOFREEZE;
> +
> +	pr_info("JRCU: daemon started. Will operate at ~%d Hz.\n", RCU_HZ);
> +	rcu_timer_stop();
> +
> +	while (!kthread_should_stop()) {
> +		usleep_range(RCU_PERIOD_MIN_US, RCU_PERIOD_MAX_US);
> +		rcu_delimit_batches();
> +	}
> +
> +	pr_info("JRCU: daemon exiting\n");
> +	rcu_timer_restart();
> +	return 0;
> +}
> +
> +static __init int jrcud_start(void)
> +{
> +	struct task_struct *p;
> +
> +	p = kthread_run(jrcud_func, NULL, "jrcud");
> +	if (IS_ERR(p)) {
> +		pr_warn("JRCU: daemon not started\n");
> +		return -ENODEV;
> +	}
> +	return 0;
> +}
> +late_initcall(jrcud_start);
> +
> +#endif /* CONFIG_JRCU_DAEMON */
> +
> +/* ------------------ debug and statistics section -------------- */
> +
> +#ifdef CONFIG_RCU_TRACE
> +
> +#include <linux/debugfs.h>
> +#include <linux/seq_file.h>
> +
> +static int rcu_debugfs_show(struct seq_file *m, void *unused)
> +{
> +	int cpu, q, s[2], msecs;
> +
> +	raw_local_irq_disable();
> +	msecs = div_s64(sched_clock() - rcu_timestamp, NSEC_PER_MSEC);
> +	raw_local_irq_enable();
> +
> +	seq_printf(m, "%14u: #batches seen\n",
> +		rcu_stats.nbatches);
> +	seq_printf(m, "%14u: #barriers seen\n",
> +		atomic_read(&rcu_stats.nbarriers));
> +	seq_printf(m, "%14llu: #callbacks invoked\n",
> +		rcu_stats.ninvoked);
> +	seq_printf(m, "%14u: #callbacks left to invoke\n",
> +		atomic_read(&rcu_stats.nleft));
> +	seq_printf(m, "%14u: #msecs since last end-of-batch\n",
> +		msecs);
> +	seq_printf(m, "%14u: #passes forced (0 is best)\n",
> +		rcu_stats.nforced);
> +	seq_printf(m, "\n");
> +
> +	for_each_online_cpu(cpu)
> +		seq_printf(m, "%4d ", cpu);
> +	seq_printf(m, "  CPU\n");
> +
> +	s[1] = s[0] = 0;
> +	for_each_online_cpu(cpu) {
> +		struct rcu_data *rd = &rcu_data[cpu];
> +		int w = ACCESS_ONCE(rd->which) & 1;
> +		seq_printf(m, "%c%c%c%d ",
> +			'-',
> +			idle_cpu(cpu) ? 'I' : '-',
> +			rd->wait ? 'W' : '-',
> +			w);
> +		s[w]++;
> +	}
> +	seq_printf(m, "  FLAGS\n");
> +
> +	for (q = 0; q < 2; q++) {
> +		for_each_online_cpu(cpu) {
> +			struct rcu_data *rd = &rcu_data[cpu];
> +			struct rcu_list *l = &rd->cblist[q];
> +			seq_printf(m, "%4d ", l->count);
> +		}
> +		seq_printf(m, "  Q%d%c\n", q, " *"[s[q] > s[q^1]]);
> +	}
> +	seq_printf(m, "\nFLAGS:\n");
> +	seq_printf(m, "  I - cpu idle, 0|1 - Q0 or Q1 is current Q, other is previous Q,\n");
> +	seq_printf(m, "  W - cpu does not permit current batch to end (waiting),\n");
> +	seq_printf(m, "  * - marks the Q that is current for most CPUs.\n");
> +
> +	return 0;
> +}
> +
> +static int rcu_debugfs_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, rcu_debugfs_show, NULL);
> +}
> +
> +static const struct file_operations rcu_debugfs_fops = {
> +	.owner = THIS_MODULE,
> +	.open = rcu_debugfs_open,
> +	.read = seq_read,
> +	.llseek = seq_lseek,
> +	.release = single_release,
> +};
> +
> +static struct dentry *rcudir;
> +
> +static int __init rcu_debugfs_init(void)
> +{
> +	struct dentry *retval;
> +
> +	rcudir = debugfs_create_dir("rcu", NULL);
> +	if (!rcudir)
> +		goto error;
> +
> +	retval = debugfs_create_file("rcudata", 0444, rcudir,
> +			NULL, &rcu_debugfs_fops);
> +	if (!retval)
> +		goto error;
> +
> +	pr_info("JRCU: Created debugfs files\n");
> +	return 0;
> +
> +error:
> +	debugfs_remove_recursive(rcudir);
> +	pr_warn("JRCU: Could not create debugfs files\n");
> +	return -ENOSYS;
> +}
> +late_initcall(rcu_debugfs_init);
> +#endif /* CONFIG_RCU_TRACE */
> Index: b/include/linux/jrcu.h
> ===================================================================
> --- /dev/null
> +++ b/include/linux/jrcu.h
> @@ -0,0 +1,75 @@
> +/*
> + * JRCU - A tiny single-cpu RCU for small SMP systems.
> + *
> + * Author: Joe Korty <joe.korty@ccur.com>
> + * Copyright Concurrent Computer Corporation, 2011
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the
> + * Free Software Foundation; either version 2 of the License, or (at your
> + * option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY
> + * or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
> + * for more details.
> + *
> + * You should have received a copy of the GNU General Public License along
> + * with this program; if not, write to the Free Software Foundation, Inc.,
> + * 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + */
> +#ifndef __LINUX_JRCU_H
> +#define __LINUX_JRCU_H
> +
> +#define __rcu_read_lock()			preempt_disable()
> +#define __rcu_read_unlock()			preempt_enable()
> +
> +#define __rcu_read_lock_bh()			__rcu_read_lock()
> +#define __rcu_read_unlock_bh()			__rcu_read_unlock()
> +
> +extern void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu));
> +
> +#define call_rcu_sched				call_rcu
> +#define call_rcu_bh				call_rcu
> +
> +extern void rcu_barrier(void);
> +
> +#define rcu_barrier_sched			rcu_barrier
> +#define rcu_barrier_bh				rcu_barrier
> +
> +#define synchronize_rcu				rcu_barrier
> +#define synchronize_sched			rcu_barrier
> +#define synchronize_sched_expedited		rcu_barrier
> +#define synchronize_rcu_bh			rcu_barrier
> +#define synchronize_rcu_expedited		rcu_barrier
> +#define synchronize_rcu_bh_expedited		rcu_barrier
> +
> +#define rcu_init(cpu)				do { } while (0)
> +#define rcu_init_sched()			do { } while (0)
> +#define exit_rcu()				do { } while (0)
> +
> +static inline void __rcu_check_callbacks(int cpu, int user) { }
> +#define rcu_check_callbacks			__rcu_check_callbacks
> +
> +#define rcu_needs_cpu(cpu)			(0)
> +#define rcu_batches_completed()			(0)
> +#define rcu_batches_completed_bh()		(0)
> +#define rcu_preempt_depth()			(0)
> +
> +extern void rcu_force_quiescent_state(void);
> +
> +#define rcu_sched_force_quiescent_state		rcu_force_quiescent_state
> +#define rcu_bh_force_quiescent_state		rcu_force_quiescent_state
> +
> +#define rcu_enter_nohz()			do { } while (0)
> +#define rcu_exit_nohz()				do { } while (0)
> +
> +extern void rcu_note_context_switch(int cpu);
> +
> +#define rcu_sched_qs				rcu_note_context_switch
> +#define rcu_bh_qs				rcu_note_context_switch
> +
> +extern void rcu_scheduler_starting(void);
> +extern int rcu_scheduler_active __read_mostly;
> +
> +#endif /* __LINUX_JRCU_H */
> Index: b/include/linux/rcupdate.h
> ===================================================================
> --- a/include/linux/rcupdate.h
> +++ b/include/linux/rcupdate.h
> @@ -145,6 +145,8 @@ static inline void rcu_exit_nohz(void)
>  #include <linux/rcutree.h>
>  #elif defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
>  #include <linux/rcutiny.h>
> +#elif defined(CONFIG_JRCU)
> +#include <linux/jrcu.h>
>  #else
>  #error "Unknown RCU implementation specified to kernel configuration"
>  #endif
> Index: b/init/Kconfig
> ===================================================================
> --- a/init/Kconfig
> +++ b/init/Kconfig
> @@ -384,6 +384,22 @@ config TREE_PREEMPT_RCU
>  	  is also required.  It also scales down nicely to
>  	  smaller systems.
> 
> +config JRCU
> +	bool "A tiny single-CPU RCU for small SMP systems"
> +	depends on PREEMPT
> +	depends on SMP
> +	help
> +	  This option selects a minimal-footprint RCU suitable for small SMP
> +	  systems -- that is, those with fewer than 16 or perhaps 32, and
> +	  certainly less than 64 processors.
> +
> +	  This RCU variant may be a good choice for systems with low latency
> +	  requirements.  It does RCU garbage collection from a single CPU
> +	  rather than have each CPU do its own.  This frees up all but one
> +	  CPU from interference by this periodic requirement.
> +
> +	  Most users should say N here.
> +
>  config TINY_RCU
>  	bool "UP-only small-memory-footprint RCU"
>  	depends on !SMP
> @@ -409,6 +425,17 @@ config PREEMPT_RCU
>  	  This option enables preemptible-RCU code that is common between
>  	  the TREE_PREEMPT_RCU and TINY_PREEMPT_RCU implementations.
> 
> +config JRCU_DAEMON
> +	bool "Drive JRCU from a daemon"
> +	depends on JRCU
> +	default Y
> +	help
> +	  Normally JRCU end-of-batch processing is driven from a SoftIRQ
> +	  'interrupt' driver.  If you consider this to be too invasive,
> +	  this option can be used to drive JRCU from a kernel daemon.
> +
> +	  If unsure, say Y here.
> +
>  config RCU_TRACE
>  	bool "Enable tracing for RCU"
>  	help
> Index: b/kernel/Makefile
> ===================================================================
> --- a/kernel/Makefile
> +++ b/kernel/Makefile
> @@ -87,6 +87,7 @@ obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutre
>  obj-$(CONFIG_TREE_RCU_TRACE) += rcutree_trace.o
>  obj-$(CONFIG_TINY_RCU) += rcutiny.o
>  obj-$(CONFIG_TINY_PREEMPT_RCU) += rcutiny.o
> +obj-$(CONFIG_JRCU) += jrcu.o
>  obj-$(CONFIG_RELAY) += relay.o
>  obj-$(CONFIG_SYSCTL) += utsname_sysctl.o
>  obj-$(CONFIG_TASK_DELAY_ACCT) += delayacct.o
> Index: b/include/linux/hardirq.h
> ===================================================================
> --- a/include/linux/hardirq.h
> +++ b/include/linux/hardirq.h
> @@ -146,7 +146,13 @@ static inline void account_system_vtime(
>  extern void account_system_vtime(struct task_struct *tsk);
>  #endif
> 
> -#if defined(CONFIG_NO_HZ)
> +#if defined(CONFIG_JRCU)
> +extern int rcu_nmi_seen;
> +#define rcu_irq_enter() do { } while (0)
> +#define rcu_irq_exit() do { } while (0)
> +#define rcu_nmi_enter() do { rcu_nmi_seen = 1; } while (0)
> +#define rcu_nmi_exit() do { } while (0)
> +#elif defined(CONFIG_NO_HZ)
>  #if defined(CONFIG_TINY_RCU) || defined(CONFIG_TINY_PREEMPT_RCU)
>  extern void rcu_enter_nohz(void);
>  extern void rcu_exit_nohz(void);
> @@ -168,7 +174,6 @@ static inline void rcu_nmi_enter(void)
>  static inline void rcu_nmi_exit(void)
>  {
>  }
> -
>  #else
>  extern void rcu_irq_enter(void);
>  extern void rcu_irq_exit(void);
> Index: b/kernel/sched.c
> ===================================================================
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2658,6 +2658,21 @@ void sched_fork(struct task_struct *p, i
>  }
> 
>  /*
> + * Fetch the preempt count of some cpu's current task.  Must be called
> + * with interrupts blocked.  Stale return value.
> + *
> + * No locking needed as this always wins the race with context-switch-out
> + * + task destruction, since that is so heavyweight.  The smp_rmb() is
> + * to protect the pointers in that race, not the data being pointed to
> + * (which, being guaranteed stale, can stand a bit of fuzziness).
> + */
> +int preempt_count_cpu(int cpu)
> +{
> +	smp_rmb(); /* stop data prefetch until program ctr gets here */
> +	return task_thread_info(cpu_curr(cpu))->preempt_count;
> +}
> +
> +/*
>   * wake_up_new_task - wake up a newly created task for the first time.
>   *
>   * This function will do some initial scheduler statistics housekeeping
> @@ -3811,7 +3826,7 @@ void __kprobes add_preempt_count(int val
>  	if (DEBUG_LOCKS_WARN_ON((preempt_count() < 0)))
>  		return;
>  #endif
> -	preempt_count() += val;
> +	__add_preempt_count(val);
>  #ifdef CONFIG_DEBUG_PREEMPT
>  	/*
>  	 * Spinlock count overflowing soon?
> @@ -3842,7 +3857,7 @@ void __kprobes sub_preempt_count(int val
> 
>  	if (preempt_count() == val)
>  		trace_preempt_on(CALLER_ADDR0, get_parent_ip(CALLER_ADDR1));
> -	preempt_count() -= val;
> +	__sub_preempt_count(val);
>  }
>  EXPORT_SYMBOL(sub_preempt_count);
> 
> @@ -3994,6 +4009,7 @@ need_resched_nonpreemptible:
> 
>  		rq->nr_switches++;
>  		rq->curr = next;
> +		smp_wmb(); /* for preempt_count_cpu() */
>  		++*switch_count;
> 
>  		context_switch(rq, prev, next); /* unlocks the rq */
> @@ -8209,6 +8225,7 @@ struct task_struct *curr_task(int cpu)
>  void set_curr_task(int cpu, struct task_struct *p)
>  {
>  	cpu_curr(cpu) = p;
> +	smp_wmb(); /* for preempt_count_cpu() */
>  }
> 
>  #endif
> Index: b/include/linux/preempt.h
> ===================================================================
> --- a/include/linux/preempt.h
> +++ b/include/linux/preempt.h
> @@ -10,18 +10,33 @@
>  #include <linux/linkage.h>
>  #include <linux/list.h>
> 
> +# define __add_preempt_count(val) do { preempt_count() += (val); } while (0)
> +
> +#ifndef CONFIG_JRCU
> +# define __sub_preempt_count(val) do { preempt_count() -= (val); } while (0)
> +#else
> +  extern void __rcu_preempt_sub(void);
> +# define __sub_preempt_count(val) do { \
> +	if (!(preempt_count() -= (val))) { \
> +		/* preempt is enabled, RCU OK with consequent stale result */ \
> +		__rcu_preempt_sub(); \
> +	} \
> +} while (0)
> +#endif
> +
>  #if defined(CONFIG_DEBUG_PREEMPT) || defined(CONFIG_PREEMPT_TRACER)
>    extern void add_preempt_count(int val);
>    extern void sub_preempt_count(int val);
>  #else
> -# define add_preempt_count(val)	do { preempt_count() += (val); } while (0)
> -# define sub_preempt_count(val)	do { preempt_count() -= (val); } while (0)
> +# define add_preempt_count(val)	__add_preempt_count(val)
> +# define sub_preempt_count(val)	__sub_preempt_count(val)
>  #endif
> 
>  #define inc_preempt_count() add_preempt_count(1)
>  #define dec_preempt_count() sub_preempt_count(1)
> 
>  #define preempt_count()	(current_thread_info()->preempt_count)
> +extern int preempt_count_cpu(int cpu);
> 
>  #ifdef CONFIG_PREEMPT
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-12 14:36                             ` Paul E. McKenney
@ 2011-03-13  0:43                               ` Joe Korty
  2011-03-13  5:56                                 ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-13  0:43 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Sat, Mar 12, 2011 at 09:36:29AM -0500, Paul E. McKenney wrote:
> On Thu, Mar 10, 2011 at 02:50:45PM -0500, Joe Korty wrote:
>>
>> A longer answer, on a slighly expanded topic, goes as follows.  The heart
>> of jrcu is in this (slighly edited) line,
>>
>>   rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);
> 
> So, if we are idle, then the preemption count must be 2 or greater
> to make the current grace period wait on a CPU.  But if we are not
> idle, then the preemption count need only be 1 or greater to make
> the current grace period wait on a CPU.
> 
> But why do should an idle CPU block the current RCU grace period
> in any case?  The idle loop is defined to be a quiescent state
> for rcu_sched.  (Not that permitting RCU read-side critical sections
> in the idle loop would be a bad thing, as long as the associated
> pitfalls were all properly avoided.)

Amazingly enough, the base preemption level for idle is '1', not '0'.
This suprised me deeply, but on reflection it made sense.  When idle
needs to be preempted, there is no need to actually preempt it .. one
just kick starts it and it will go execute the schedule for you.




>> Here, the garbage collector is making an attempt to deduce, at the
>> start of the current batch, whether or not some cpu is executing code
>> in a quiescent region.  If it is, then that cpu's wait state can be set
>> to zero right away -- we don't have to wait for that cpu to execute a
>> quiescent point tap later on to discover that fact.  This nicely covers
>> the user app and idle cpu situations discussed above.
>>
>> Now, we all know that fetching the preempt_count of some process running on
>> another cpu is guaranteed to return a stale (obsolete) value, and may even
>> be dangerous (pointers are being followed after all).  Putting aside the
>> question of safety, for now, leaves us with a trio of questions: are there
>> times when this inherently unstable value is in fact stable and useful?
>> When it is not stable, is that fact relevant or irrelevant to the correct
>> operation of jrcu? And finally, does the fact that we cannot tell when
>> it is stable and when it is not, also relevant?
> 
> And there is also the ordering of the preempt_disable() and the accesses
> within the critical section...  Just because you recently saw a quiescent
> state doesn't mean that the preceding critical section has completed --
> even x86 is happy to spill stores out of a critical section ended by
> preempt_enable.  If one of those stores is to an RCU protected
> data structure, you might end up freeing the structure before the
> store completed.
> 
> Or is the idea that you would wait 50 milliseconds after detecting
> the quiescent state before invoking the corresponding RCU callbacks?

Yep.  



> I am missing how ->which switching is safe, given the possibility of
> access from other CPUs.

JRCU allows writes to continue through the old '->which'
value for a period of time.  All it requires is that
within 50 msecs that the writes have ceased and that
the writing cpu has executed a smp_wmb() and the effects
of the smp_wmb() have propagated throughout the system.

Even though I keep saying 50msecs for everything, I
suspect that the Q switching meets all the above quiescent
requirements in a few tens of microseconds.  Thus even
a 1 msec JRCU sampling period is expected to be safe,
at least in regard to Q switching.

Regards,
Joe


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-12 14:36                       ` [PATCH] An RCU for SMP with a single CPU garbage collector Paul E. McKenney
@ 2011-03-13  1:25                         ` Joe Korty
  2011-03-13  6:09                           ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-13  1:25 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

On Sat, Mar 12, 2011 at 09:36:53AM -0500, Paul E. McKenney wrote:
> Hello, Joe,
> 
> My biggest question is "what does JRCU do that Frederic's patchset
> does not do?"  I am not seeing it at the moment.  Given that Frederic's
> patchset integrates into RCU, thus providing the full RCU API, I really
> need a good answer to consider JRCU.

Well, it's tiny, it's fast, and it does exactly one thing
and does that really well.  If a user doesn't need that
one thing they shouldn't use JRCU.  But mostly it is an
exciting thought-experiment on another interesting way to
do RCU.  Who knows, maybe it may end up being better than
for what it was aimed at.

> For one, what sort of validation have you done?
> 
>                                                         Thanx, Paul

Not much, I'm writing the code and sending it out for
comment.  And it is currently missing many of the tweaks
needed to make it a production RCU.



>> +struct rcu_data {
>> +     u8 wait;                /* goes false when this cpu consents to
>> +                              * the retirement of the current batch */
>> +     u8 which;               /* selects the current callback list */
>> +     struct rcu_list cblist[2]; /* current & previous callback lists */
>> +} ____cacheline_aligned_in_smp;
>> +
>> +static struct rcu_data rcu_data[NR_CPUS];
> 
> Why not DEFINE_PER_CPU(struct rcu_data, rcu_data)?

All part of being lockless.  I didn't want to have to tie
into cpu onlining and offlining and wanted to eliminate
sprinking special tests and/or online locks throughout
the code.  Also, note the single for_each_present_cpu(cpu)
statement in JRCU .. this loops over all offline cpus and
gradually expires any residuals they have left behind.


>> +/*
>> + * Return our CPU id or zero if we are too early in the boot process to
>> + * know what that is.  For RCU to work correctly, a cpu named '0' must
>> + * eventually be present (but need not ever be online).
>> + */
>> +static inline int rcu_cpu(void)
>> +{
>> +     return current_thread_info()->cpu;
> 
> OK, I'll bite...  Why not smp_processor_id()?

Until recently, it was :) but it was a multiline thing,
with 'if' stmts and such, to handle early boot conditions
when smp_processor_id() isn't valid.

JRCU, perhaps quixotically,  tries to do something
meaningful all the way back to the first microsecond of
existance, when the CPU is switched from 16 to 32 bit mode.
In that early epoch, things like 'cpus' and 'interrupts'
and 'tasks' don't quite yet exist in the form we are used
to for them.

> And what to do about the architectures that put the CPU number somewhere
> else?

I confess I keep forgetting to look at that other 21 or
so other architectures, I had thought they all had ->cpu.
I look into it and, at least for those, reintroduce the
old smp_processor_id() expression.


>> +void rcu_barrier(void)
>> +{
>> +     struct rcu_synchronize rcu;
>> +
>> +     if (!rcu_scheduler_active)
>> +             return;
>> +
>> +     init_completion(&rcu.completion);
>> +     call_rcu(&rcu.head, wakeme_after_rcu);
>> +     wait_for_completion(&rcu.completion);
>> +     atomic_inc(&rcu_stats.nbarriers);
>> +
>> +}
>> +EXPORT_SYMBOL_GPL(rcu_barrier);
> 
> The rcu_barrier() function must wait on all RCU callbacks, regardless of
> which CPU they are queued on.  This is important when unloading modules
> that use call_rcu().  In contrast, the above looks to me like it waits
> only on the current CPU's callbacks.

Oops.  I'll come up with an alternate mechanism.  Thanks for finding this.


> So, what am I missing?

Nothing.  You were right :)



>> +     /*
>> +      * Swap current and previous lists.  Other cpus must not see this
>> +      * out-of-order w.r.t. the just-completed plist init, hence the above
>> +      * smp_wmb().
>> +      */
>> +     rd->which++;
> 
> You do seem to have interrupts disabled when sampling ->which, but
> this is not safe for cross-CPU accesses to ->which, right?  The other
> CPU might queue onto the wrong element.  This would mean that you
> would not be guaranteed a full 50ms delay from quiescent state to
> corresponding RCU callback invocation.
> 
> Or am I missing something subtle here?


JRCU expects updates to the old queue to continue for a
while, it only requires that they end and a trailing wmb
be fully executed before the next sampling period ends.


>> +     /*
>> +      * End the current RCU batch and start a new one.
>> +      */
>> +     for_each_present_cpu(cpu) {
>> +             rd = &rcu_data[cpu];
> 
> And here we get the cross-CPU accesses that I was worried about above.


Yep.  This is one of the trio of reasons why JRCU is for
small SMP systems.  It's the tradeoff I made to move the
entire RCU load off onto one CPU.  If that is not important
(and it won't be to any but to specialized systems), one
is expected to use RCU_TREE.

The other two of the trio of reasons: doing kfree's on the
'wrong' cpu puts the freed buffer in the 'wrong' per-cpu
free queue, and putting all the load on one cpu means
that cpu could hit 100% cpu utilization just doing rcu
callbacks, for systems with thousands of cpus and have
the io fabrics necessary to keep those cpus busy.

Regards,
Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-13  0:43                               ` Joe Korty
@ 2011-03-13  5:56                                 ` Paul E. McKenney
  2011-03-13 23:53                                   ` Joe Korty
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-13  5:56 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Sat, Mar 12, 2011 at 07:43:36PM -0500, Joe Korty wrote:
> On Sat, Mar 12, 2011 at 09:36:29AM -0500, Paul E. McKenney wrote:
> > On Thu, Mar 10, 2011 at 02:50:45PM -0500, Joe Korty wrote:
> >>
> >> A longer answer, on a slighly expanded topic, goes as follows.  The heart
> >> of jrcu is in this (slighly edited) line,
> >>
> >>   rcu_data[cpu].wait = preempt_count_cpu(cpu) > idle_cpu(cpu);
> > 
> > So, if we are idle, then the preemption count must be 2 or greater
> > to make the current grace period wait on a CPU.  But if we are not
> > idle, then the preemption count need only be 1 or greater to make
> > the current grace period wait on a CPU.
> > 
> > But why do should an idle CPU block the current RCU grace period
> > in any case?  The idle loop is defined to be a quiescent state
> > for rcu_sched.  (Not that permitting RCU read-side critical sections
> > in the idle loop would be a bad thing, as long as the associated
> > pitfalls were all properly avoided.)
> 
> Amazingly enough, the base preemption level for idle is '1', not '0'.
> This suprised me deeply, but on reflection it made sense.  When idle
> needs to be preempted, there is no need to actually preempt it .. one
> just kick starts it and it will go execute the schedule for you.

Ah, got it, thank you!

> >> Here, the garbage collector is making an attempt to deduce, at the
> >> start of the current batch, whether or not some cpu is executing code
> >> in a quiescent region.  If it is, then that cpu's wait state can be set
> >> to zero right away -- we don't have to wait for that cpu to execute a
> >> quiescent point tap later on to discover that fact.  This nicely covers
> >> the user app and idle cpu situations discussed above.
> >>
> >> Now, we all know that fetching the preempt_count of some process running on
> >> another cpu is guaranteed to return a stale (obsolete) value, and may even
> >> be dangerous (pointers are being followed after all).  Putting aside the
> >> question of safety, for now, leaves us with a trio of questions: are there
> >> times when this inherently unstable value is in fact stable and useful?
> >> When it is not stable, is that fact relevant or irrelevant to the correct
> >> operation of jrcu? And finally, does the fact that we cannot tell when
> >> it is stable and when it is not, also relevant?
> > 
> > And there is also the ordering of the preempt_disable() and the accesses
> > within the critical section...  Just because you recently saw a quiescent
> > state doesn't mean that the preceding critical section has completed --
> > even x86 is happy to spill stores out of a critical section ended by
> > preempt_enable.  If one of those stores is to an RCU protected
> > data structure, you might end up freeing the structure before the
> > store completed.
> > 
> > Or is the idea that you would wait 50 milliseconds after detecting
> > the quiescent state before invoking the corresponding RCU callbacks?
> 
> Yep.  

OK.

> > I am missing how ->which switching is safe, given the possibility of
> > access from other CPUs.
> 
> JRCU allows writes to continue through the old '->which'
> value for a period of time.  All it requires is that
> within 50 msecs that the writes have ceased and that
> the writing cpu has executed a smp_wmb() and the effects
> of the smp_wmb() have propagated throughout the system.
> 
> Even though I keep saying 50msecs for everything, I
> suspect that the Q switching meets all the above quiescent
> requirements in a few tens of microseconds.  Thus even
> a 1 msec JRCU sampling period is expected to be safe,
> at least in regard to Q switching.

I would feel better about this is the CPU vendors were willing to give
an upper bound...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] An RCU for SMP with a single CPU garbage collector
  2011-03-13  1:25                         ` Joe Korty
@ 2011-03-13  6:09                           ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-13  6:09 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim

On Sat, Mar 12, 2011 at 08:25:04PM -0500, Joe Korty wrote:
> On Sat, Mar 12, 2011 at 09:36:53AM -0500, Paul E. McKenney wrote:
> > Hello, Joe,
> > 
> > My biggest question is "what does JRCU do that Frederic's patchset
> > does not do?"  I am not seeing it at the moment.  Given that Frederic's
> > patchset integrates into RCU, thus providing the full RCU API, I really
> > need a good answer to consider JRCU.
> 
> Well, it's tiny, it's fast, and it does exactly one thing
> and does that really well.  If a user doesn't need that
> one thing they shouldn't use JRCU.  But mostly it is an
> exciting thought-experiment on another interesting way to
> do RCU.  Who knows, maybe it may end up being better than
> for what it was aimed at.

Fair enough!  From my perspective, I am likely to learn something from
watching you work on it, and it is reasonably likely that you will
come up with something useful for the existing RCU implementations.
And who knows what you might come up with?

> > For one, what sort of validation have you done?
> > 
> >                                                         Thanx, Paul
> 
> Not much, I'm writing the code and sending it out for
> comment.  And it is currently missing many of the tweaks
> needed to make it a production RCU.

Ah, OK.  I strongly recommend rcutorture.

> >> +struct rcu_data {
> >> +     u8 wait;                /* goes false when this cpu consents to
> >> +                              * the retirement of the current batch */
> >> +     u8 which;               /* selects the current callback list */
> >> +     struct rcu_list cblist[2]; /* current & previous callback lists */
> >> +} ____cacheline_aligned_in_smp;
> >> +
> >> +static struct rcu_data rcu_data[NR_CPUS];
> > 
> > Why not DEFINE_PER_CPU(struct rcu_data, rcu_data)?
> 
> All part of being lockless.  I didn't want to have to tie
> into cpu onlining and offlining and wanted to eliminate
> sprinking special tests and/or online locks throughout
> the code.  Also, note the single for_each_present_cpu(cpu)
> statement in JRCU .. this loops over all offline cpus and
> gradually expires any residuals they have left behind.

OK, but the per-CPU variables do not come and go with the CPUs.
So I am still not seeing how the array is helping compared to
per-CPU variables.

> >> +/*
> >> + * Return our CPU id or zero if we are too early in the boot process to
> >> + * know what that is.  For RCU to work correctly, a cpu named '0' must
> >> + * eventually be present (but need not ever be online).
> >> + */
> >> +static inline int rcu_cpu(void)
> >> +{
> >> +     return current_thread_info()->cpu;
> > 
> > OK, I'll bite...  Why not smp_processor_id()?
> 
> Until recently, it was :) but it was a multiline thing,
> with 'if' stmts and such, to handle early boot conditions
> when smp_processor_id() isn't valid.
> 
> JRCU, perhaps quixotically,  tries to do something
> meaningful all the way back to the first microsecond of
> existance, when the CPU is switched from 16 to 32 bit mode.
> In that early epoch, things like 'cpus' and 'interrupts'
> and 'tasks' don't quite yet exist in the form we are used
> to for them.

OK.

> > And what to do about the architectures that put the CPU number somewhere
> > else?
> 
> I confess I keep forgetting to look at that other 21 or
> so other architectures, I had thought they all had ->cpu.
> I look into it and, at least for those, reintroduce the
> old smp_processor_id() expression.

But those that have ->cpu can safely use smp_processor_id(), right?

> >> +void rcu_barrier(void)
> >> +{
> >> +     struct rcu_synchronize rcu;
> >> +
> >> +     if (!rcu_scheduler_active)
> >> +             return;
> >> +
> >> +     init_completion(&rcu.completion);
> >> +     call_rcu(&rcu.head, wakeme_after_rcu);
> >> +     wait_for_completion(&rcu.completion);
> >> +     atomic_inc(&rcu_stats.nbarriers);
> >> +
> >> +}
> >> +EXPORT_SYMBOL_GPL(rcu_barrier);
> > 
> > The rcu_barrier() function must wait on all RCU callbacks, regardless of
> > which CPU they are queued on.  This is important when unloading modules
> > that use call_rcu().  In contrast, the above looks to me like it waits
> > only on the current CPU's callbacks.
> 
> Oops.  I'll come up with an alternate mechanism.  Thanks for finding this.

NP.  ;-)

> > So, what am I missing?
> 
> Nothing.  You were right :)
> 
> >> +     /*
> >> +      * Swap current and previous lists.  Other cpus must not see this
> >> +      * out-of-order w.r.t. the just-completed plist init, hence the above
> >> +      * smp_wmb().
> >> +      */
> >> +     rd->which++;
> > 
> > You do seem to have interrupts disabled when sampling ->which, but
> > this is not safe for cross-CPU accesses to ->which, right?  The other
> > CPU might queue onto the wrong element.  This would mean that you
> > would not be guaranteed a full 50ms delay from quiescent state to
> > corresponding RCU callback invocation.
> > 
> > Or am I missing something subtle here?
> 
> JRCU expects updates to the old queue to continue for a
> while, it only requires that they end and a trailing wmb
> be fully executed before the next sampling period ends.

OK.  I remain skeptical, but mainly because similar setups have proven
buggy in the past.

> >> +     /*
> >> +      * End the current RCU batch and start a new one.
> >> +      */
> >> +     for_each_present_cpu(cpu) {
> >> +             rd = &rcu_data[cpu];
> > 
> > And here we get the cross-CPU accesses that I was worried about above.
> 
> Yep.  This is one of the trio of reasons why JRCU is for
> small SMP systems.  It's the tradeoff I made to move the
> entire RCU load off onto one CPU.  If that is not important
> (and it won't be to any but to specialized systems), one
> is expected to use RCU_TREE.

OK, so here is one of the things that you are trying to do with JRCU that
the current in-kernel RCU implementations do not do, namely, cause RCU
callbacks to be consistently invoked on some other CPU.  If I remember
correctly, Frederic's changes do not support this either, because his
workloads run almost entirely in user space, and are therefore not
expected to generate RCU callbacks.

Oddly enough, user-space RCU -does- support this notion.  ;-)

> The other two of the trio of reasons: doing kfree's on the
> 'wrong' cpu puts the freed buffer in the 'wrong' per-cpu
> free queue, and putting all the load on one cpu means
> that cpu could hit 100% cpu utilization just doing rcu
> callbacks, for systems with thousands of cpus and have
> the io fabrics necessary to keep those cpus busy.

That would be one reason I have been reluctant to implement callback
offloading in advance of a definite need!

Though I must admit that I am unsure how JRCU's call_rcu() implementation
does this safely -- it looks to me like it is only excluding irqs,
not other CPUs.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-13  5:56                                 ` Paul E. McKenney
@ 2011-03-13 23:53                                   ` Joe Korty
  2011-03-14  0:50                                     ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Joe Korty @ 2011-03-13 23:53 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Sun, Mar 13, 2011 at 12:56:27AM -0500, Paul E. McKenney wrote:
> > Even though I keep saying 50msecs for everything, I
> > suspect that the Q switching meets all the above quiescent
> > requirements in a few tens of microseconds.  Thus even
> > a 1 msec JRCU sampling period is expected to be safe,
> > at least in regard to Q switching.
> 
> I would feel better about this is the CPU vendors were willing to give
> an upper bound...

I suspect they don't because they don't really know
themselves .. in that whatever it is, it keeps changing
from chip to chip, trying to describe such would be beyond
the english language, and any description would tie them
down on what they could do in future chip designs.

But, there is a hint in current behavior.  It is well known
that many multithreaded apps don't uses barriers at all;
the authors had no idea what they are for.  Yet such apps
largely work.  This implies that the chip designers are
very aggressive in doing implied memory barriers wherever
possible, and they are very aggressive in pushing out
stores to caches very quickly even when memory barriers,
implied or not, are not present.

Joe

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-13 23:53                                   ` Joe Korty
@ 2011-03-14  0:50                                     ` Paul E. McKenney
  2011-03-14  0:55                                       ` Josh Triplett
  0 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2011-03-14  0:50 UTC (permalink / raw)
  To: Joe Korty
  Cc: Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, josh, houston.jim, corbet

On Sun, Mar 13, 2011 at 07:53:51PM -0400, Joe Korty wrote:
> On Sun, Mar 13, 2011 at 12:56:27AM -0500, Paul E. McKenney wrote:
> > > Even though I keep saying 50msecs for everything, I
> > > suspect that the Q switching meets all the above quiescent
> > > requirements in a few tens of microseconds.  Thus even
> > > a 1 msec JRCU sampling period is expected to be safe,
> > > at least in regard to Q switching.
> > 
> > I would feel better about this is the CPU vendors were willing to give
> > an upper bound...
> 
> I suspect they don't because they don't really know
> themselves .. in that whatever it is, it keeps changing
> from chip to chip, trying to describe such would be beyond
> the english language, and any description would tie them
> down on what they could do in future chip designs.

Indeed!

> But, there is a hint in current behavior.  It is well known
> that many multithreaded apps don't uses barriers at all;
> the authors had no idea what they are for.  Yet such apps
> largely work.  This implies that the chip designers are
> very aggressive in doing implied memory barriers wherever
> possible, and they are very aggressive in pushing out
> stores to caches very quickly even when memory barriers,
> implied or not, are not present.

Ahem.  Or that many barrier-omission failures have a low probability
of occurring.  One case in point is a bug in RCU a few years back,
where ten-hour rcutorture runs produced only a handful of errors (see
http://paulmck.livejournal.com/14639.html).  Other cases are turned up by
Peter Sewell's work, which tests code sequences with and without memory
barriers (http://www.cl.cam.ac.uk/~pes20/).  In many cases, broken code
sequences have failure rates in the parts per billion.

This should not be a surprise.  You can see the same effect with locking.
If you have very contention on a given lock, then there will be a very
low probability of encountering bugs involving forgetting to acquire
that lock.

If the CPU count continues increasing, these sorts of latent bugs
will have increasing probabilities of biting us.

						Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: JRCU Theory of Operation
  2011-03-14  0:50                                     ` Paul E. McKenney
@ 2011-03-14  0:55                                       ` Josh Triplett
  0 siblings, 0 replies; 63+ messages in thread
From: Josh Triplett @ 2011-03-14  0:55 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Joe Korty, Frederic Weisbecker, Peter Zijlstra, Lai Jiangshan,
	mathieu.desnoyers, dhowells, loic.minier, dhaval.giani, tglx,
	linux-kernel, houston.jim, corbet

On Sun, Mar 13, 2011 at 05:50:52PM -0700, Paul E. McKenney wrote:
> On Sun, Mar 13, 2011 at 07:53:51PM -0400, Joe Korty wrote:
> > But, there is a hint in current behavior.  It is well known
> > that many multithreaded apps don't uses barriers at all;
> > the authors had no idea what they are for.  Yet such apps
> > largely work.  This implies that the chip designers are
> > very aggressive in doing implied memory barriers wherever
> > possible, and they are very aggressive in pushing out
> > stores to caches very quickly even when memory barriers,
> > implied or not, are not present.
> 
> Ahem.  Or that many barrier-omission failures have a low probability
> of occurring.  One case in point is a bug in RCU a few years back,
> where ten-hour rcutorture runs produced only a handful of errors (see
> http://paulmck.livejournal.com/14639.html).  Other cases are turned up by
> Peter Sewell's work, which tests code sequences with and without memory
> barriers (http://www.cl.cam.ac.uk/~pes20/).  In many cases, broken code
> sequences have failure rates in the parts per billion.

And for that matter, locking primitives tend to necessarily imply
barriers (what good does a lock do if the memory accesses can leak out
of it), and the vast majority of multithreaded code uses locking.  So,
most multithreaded apps implictly get all the barriers they require.

- Josh Triplett

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
  2010-11-08 16:15 ` [PATCH] a local-timer-free version of RCU houston.jim
@ 2010-11-08 19:52   ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2010-11-08 19:52 UTC (permalink / raw)
  To: houston.jim
  Cc: Frederic Weisbecker, Udo A. Steinberg, Joe Korty,
	mathieu desnoyers, dhowells, loic minier, dhaval giani, tglx,
	peterz, linux-kernel, josh

On Mon, Nov 08, 2010 at 04:15:38PM +0000, houston.jim@comcast.net wrote:
> Hi Everyone,
> 
> I'm sorry started this thread and have not been able to keep up
> with the discussion.  I agree that the problems described are real.

Not a problem -- your patch is helpful in any case.

> > > UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
> > > UAS> PEM> incorrectly ending the new grace period.
> > > UAS> PEM> 
> > > UAS> PEM> Or am I missing something here?
> > > UAS> 
> > > UAS> The scenario you describe seems possible. However, it should be easily
> > > UAS> fixed by passing the perceived batch number as another parameter to
> > > UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
> > > UAS> tries to set state bits on a stale batch number (e.g., batch !=
> > > UAS> rcu_batch), it can be detected.
> 
> My thought on how to fix this case is to only hand off the DO_RCU_COMPLETION
> to a single cpu.  The rcu_unlock which receives this hand off would clear its
> own bit and then call rcu_poll_other_cpus to complete the process.

Or we could map to TREE_RCU's data structures, with one thread per
leaf rcu_node structure.

> > What is scary with this is that it also changes rcu sched semantics, and users
> > of call_rcu_sched() and synchronize_sched(), who rely on that to do more
> > tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
> > like really wanting for preempt_disable and local_irq_save/disable, those
> > users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
> > for them...
> 
> I need to stare at the code and get back up to speed. I expect that the synchronize_sched
> path in my patch is just plain broken.

Again, not a problem -- we have a couple approaches that might work.
That said, additional ideas are always welcome!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH] a local-timer-free version of RCU
       [not found] <757455806.950179.1289232791283.JavaMail.root@sz0076a.westchester.pa.mail.comcast.net>
@ 2010-11-08 16:15 ` houston.jim
  2010-11-08 19:52   ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: houston.jim @ 2010-11-08 16:15 UTC (permalink / raw)
  To: Frederic Weisbecker
  Cc: Udo A. Steinberg, Joe Korty, mathieu desnoyers, dhowells,
	loic minier, dhaval giani, tglx, peterz, linux-kernel, josh,
	Paul E. McKenney

Hi Everyone,

I'm sorry started this thread and have not been able to keep up
with the discussion.  I agree that the problems described are real.

> > UAS> PEM> o	CPU 1 continues in rcu_grace_period_complete(),
> > UAS> PEM> incorrectly ending the new grace period.
> > UAS> PEM> 
> > UAS> PEM> Or am I missing something here?
> > UAS> 
> > UAS> The scenario you describe seems possible. However, it should be easily
> > UAS> fixed by passing the perceived batch number as another parameter to
> > UAS> rcu_set_state() and making it part of the cmpxchg. So if the caller
> > UAS> tries to set state bits on a stale batch number (e.g., batch !=
> > UAS> rcu_batch), it can be detected.

My thought on how to fix this case is to only hand off the DO_RCU_COMPLETION
to a single cpu.  The rcu_unlock which receives this hand off would clear its
own bit and then call rcu_poll_other_cpus to complete the process.

> What is scary with this is that it also changes rcu sched semantics, and users
> of call_rcu_sched() and synchronize_sched(), who rely on that to do more
> tricky things than just waiting for rcu_derefence_sched() pointer grace periods,
> like really wanting for preempt_disable and local_irq_save/disable, those
> users will be screwed... :-(  ...unless we also add relevant rcu_read_lock_sched()
> for them...

I need to stare at the code and get back up to speed. I expect that the synchronize_sched
path in my patch is just plain broken.

Jim Houston

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2011-03-14  0:58 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-11-04 23:21 dyntick-hpc and RCU Paul E. McKenney
2010-11-05  5:27 ` Frederic Weisbecker
2010-11-05  5:38   ` Frederic Weisbecker
2010-11-05 15:06     ` Paul E. McKenney
2010-11-05 20:06       ` Dhaval Giani
2010-11-05 15:04   ` Paul E. McKenney
2010-11-08 14:10     ` Frederic Weisbecker
2010-11-05 21:00 ` [PATCH] a local-timer-free version of RCU Joe Korty
2010-11-06 19:28   ` Paul E. McKenney
2010-11-06 19:34     ` Mathieu Desnoyers
2010-11-06 19:42       ` Mathieu Desnoyers
2010-11-06 19:44         ` Paul E. McKenney
2010-11-08  2:11     ` Udo A. Steinberg
2010-11-08  2:19       ` Udo A. Steinberg
2010-11-08  2:54         ` Paul E. McKenney
2010-11-08 15:32           ` Frederic Weisbecker
2010-11-08 19:38             ` Paul E. McKenney
2010-11-08 20:40               ` Frederic Weisbecker
2010-11-10 18:08                 ` Paul E. McKenney
2010-11-08 15:06     ` Frederic Weisbecker
2010-11-08 15:18       ` Joe Korty
2010-11-08 19:50         ` Paul E. McKenney
2010-11-08 19:49       ` Paul E. McKenney
2010-11-08 20:51         ` Frederic Weisbecker
2010-11-06 20:03   ` Mathieu Desnoyers
2010-11-09  9:22   ` Lai Jiangshan
2010-11-10 15:54     ` Frederic Weisbecker
2010-11-10 17:31       ` Peter Zijlstra
2010-11-10 17:45         ` Frederic Weisbecker
2010-11-11  4:19         ` Paul E. McKenney
2010-11-13 22:30           ` Frederic Weisbecker
2010-11-16  1:28             ` Paul E. McKenney
2010-11-16 13:52               ` Frederic Weisbecker
2010-11-16 15:51                 ` Paul E. McKenney
2010-11-17  0:52                   ` Frederic Weisbecker
2010-11-17  1:25                     ` Paul E. McKenney
2011-03-07 20:31                     ` [PATCH] An RCU for SMP with a single CPU garbage collector Joe Korty
     [not found]                       ` <20110307210157.GG3104@linux.vnet.ibm.com>
2011-03-07 21:16                         ` Joe Korty
2011-03-07 21:33                           ` Joe Korty
2011-03-07 22:51                           ` Joe Korty
2011-03-08  9:07                             ` Paul E. McKenney
2011-03-08 15:57                               ` Joe Korty
2011-03-08 22:53                                 ` Joe Korty
2011-03-10  0:30                                   ` Paul E. McKenney
2011-03-10  0:28                                 ` Paul E. McKenney
2011-03-09 22:29                           ` Frederic Weisbecker
2011-03-09 22:15                       ` [PATCH 2/4] jrcu: tap rcu_read_unlock Joe Korty
2011-03-10  0:34                         ` Paul E. McKenney
2011-03-10 19:50                           ` JRCU Theory of Operation Joe Korty
2011-03-12 14:36                             ` Paul E. McKenney
2011-03-13  0:43                               ` Joe Korty
2011-03-13  5:56                                 ` Paul E. McKenney
2011-03-13 23:53                                   ` Joe Korty
2011-03-14  0:50                                     ` Paul E. McKenney
2011-03-14  0:55                                       ` Josh Triplett
2011-03-09 22:16                       ` [PATCH 3/4] jrcu: tap might_resched() Joe Korty
2011-03-09 22:17                       ` [PATCH 4/4] jrcu: add new stat to /sys/kernel/debug/rcu/rcudata Joe Korty
2011-03-09 22:19                       ` [PATCH 1/4] jrcu: remove preempt_enable() tap [resend] Joe Korty
2011-03-12 14:36                       ` [PATCH] An RCU for SMP with a single CPU garbage collector Paul E. McKenney
2011-03-13  1:25                         ` Joe Korty
2011-03-13  6:09                           ` Paul E. McKenney
     [not found] <757455806.950179.1289232791283.JavaMail.root@sz0076a.westchester.pa.mail.comcast.net>
2010-11-08 16:15 ` [PATCH] a local-timer-free version of RCU houston.jim
2010-11-08 19:52   ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).