linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Kernel RCU: shrink the size of the struct rcu_head
@ 2009-10-18 23:29 Mathieu Desnoyers
  2009-10-20 22:07 ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2009-10-18 23:29 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

Hi Paul,

I noticed that you already discussed the possibility of shrinking the
struct rcu_head by removing the function pointer.
(http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)

The ideas brought in so far require having per-callback lists, which
involves a bit of management overhead and don't permit keeping the
call_rcu() in cpu order.

You might want to look into the Userspace RCU urcu-defer.c
implementation, where I perform pointer encoding to compact the usual
case, expected to be the same callback passed as parameter multiple
times in a row to call_rcu(). This is very typical with multiple free()
calls for different data structures next to each other.

This typically keeps the size of the information to encode per callback
down to a minimum: the size of a single pointer. It would be good to
trace the kernel usage of call_rcu() to see if my assumption holds.

I just thought I should tell you before you start looking at this
issue further.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-18 23:29 Kernel RCU: shrink the size of the struct rcu_head Mathieu Desnoyers
@ 2009-10-20 22:07 ` Paul E. McKenney
  2009-10-21 14:53   ` Mathieu Desnoyers
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2009-10-20 22:07 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> Hi Paul,
> 
> I noticed that you already discussed the possibility of shrinking the
> struct rcu_head by removing the function pointer.
> (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> 
> The ideas brought in so far require having per-callback lists, which
> involves a bit of management overhead and don't permit keeping the
> call_rcu() in cpu order.

But please note that this is on the "Possibly Dubious Changes" list.  ;-)

> You might want to look into the Userspace RCU urcu-defer.c
> implementation, where I perform pointer encoding to compact the usual
> case, expected to be the same callback passed as parameter multiple
> times in a row to call_rcu(). This is very typical with multiple free()
> calls for different data structures next to each other.
> 
> This typically keeps the size of the information to encode per callback
> down to a minimum: the size of a single pointer. It would be good to
> trace the kernel usage of call_rcu() to see if my assumption holds.
> 
> I just thought I should tell you before you start looking at this
> issue further.

So the idea is to maintain a per-CPU queue of function pointers, but
with the pointers on this queue encoded to save space, correct?  If I
understand correctly, the user-level rcu-defer implementation relies on
the following:

1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
	critical section (due to the call to rcu_defer_barrier_thread()
	which in turn calls synchronize_rcu().  This is necessary to
	handle queue overflow.  (Which appears to be why you introduce
	a new API, as it is legal to invoke call_rcu() from within an
	RCU read-side critical section.)

2.	It is OK to wait for a grace period when a thread calls
	rcu_defer_unregister_thread() while exiting.  In the kernel,
	this is roughly equivalent to the CPU_DYING notifier, which
	cannot block, thus cannot wait for a grace period.

	I could imagine copying the per-CPU buffer somewhere, though
	my experience with the RCU/CPU-hotplug interface does not
	encourage me in this direction.  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-20 22:07 ` Paul E. McKenney
@ 2009-10-21 14:53   ` Mathieu Desnoyers
  2009-10-23  0:40     ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2009-10-21 14:53 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > Hi Paul,
> > 
> > I noticed that you already discussed the possibility of shrinking the
> > struct rcu_head by removing the function pointer.
> > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > 
> > The ideas brought in so far require having per-callback lists, which
> > involves a bit of management overhead and don't permit keeping the
> > call_rcu() in cpu order.
> 
> But please note that this is on the "Possibly Dubious Changes" list.  ;-)
> 
> > You might want to look into the Userspace RCU urcu-defer.c
> > implementation, where I perform pointer encoding to compact the usual
> > case, expected to be the same callback passed as parameter multiple
> > times in a row to call_rcu(). This is very typical with multiple free()
> > calls for different data structures next to each other.
> > 
> > This typically keeps the size of the information to encode per callback
> > down to a minimum: the size of a single pointer. It would be good to
> > trace the kernel usage of call_rcu() to see if my assumption holds.
> > 
> > I just thought I should tell you before you start looking at this
> > issue further.
> 
> So the idea is to maintain a per-CPU queue of function pointers, but
> with the pointers on this queue encoded to save space, correct?

Yes, exactly.

>  If I
> understand correctly, the user-level rcu-defer implementation relies on
> the following:
> 
> 1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
> 	critical section (due to the call to rcu_defer_barrier_thread()
> 	which in turn calls synchronize_rcu().  This is necessary to
> 	handle queue overflow.  (Which appears to be why you introduce
> 	a new API, as it is legal to invoke call_rcu() from within an
> 	RCU read-side critical section.)

When dealing with queue overflow, I figured we have 4 alternatives.
Either:

1, 2, 3) We proceed to execution of {the single, all, thread local}
         callback(s) on the spot after a synchronize_rcu().

4) We expand the queue by allocating more memory.

The idea of pointer encoding to save space could be used with any of 1,
2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
read-side C.S. context because they require to use synchronize_rcu()
within the C.S., which would deadlock on its calling context.

Now, there is a rationale for the choice of (3) in my urcu-defer
implementation:

* It's how I can deal with memory full (-ENOMEM) without letting the
  system die with exit(). How does the kernel call_rcu() deal with this
  currently ? BUG_ON, WARN_ON ?

* It acts as a rate limiter for urcu_defer_queue(). Basically, if a
  thread starts enqueuing callbacks too fast, it will eventually fill its
  queue and have to empty it itself. AFAIK, It's not possible to do that
  if you allow call_rcu() to be called from read-side C.S..

I could even extend rcu_defer_queue() to take a second rate-limiter
callback, which would check if the thread went over some threshold and
give a more precise limit (e.g. amount of memory to be freed) on the
rate than the "4096 callbacks in flight max", which have been chosen by
benchmarks, but is a bit arbitrary in terms of overall callback effect.

How important is it to permit enqueuing callbacks from within rcu
read-side C.S. in terms of real-life usage ? If it is really that
important to fill this use-case, then I could have a mode for call_rcu()
that expands the RCU callback queue upon overflow. But as I argue above,
I really prefer the control we have with a fixed-sized queue.

> 
> 2.	It is OK to wait for a grace period when a thread calls
> 	rcu_defer_unregister_thread() while exiting.  In the kernel,
> 	this is roughly equivalent to the CPU_DYING notifier, which
> 	cannot block, thus cannot wait for a grace period.
> 
> 	I could imagine copying the per-CPU buffer somewhere, though
> 	my experience with the RCU/CPU-hotplug interface does not
> 	encourage me in this direction.  ;-)

As you say, we don't _have_ to empty the queue before putting a
thread/cpu offline. We could simply copy the unplugged cpu queue to an
orphan queue, as you currently do in your implementation. I agree that
it would be more suitable to the cpu hotplug CPU_DYING execution
context, due to its inherent trickiness.

Thanks,

Mathieu

> 
> 							Thanx, Paul

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-21 14:53   ` Mathieu Desnoyers
@ 2009-10-23  0:40     ` Paul E. McKenney
  2009-10-23 12:29       ` Mathieu Desnoyers
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2009-10-23  0:40 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

On Wed, Oct 21, 2009 at 10:53:15AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > > Hi Paul,
> > > 
> > > I noticed that you already discussed the possibility of shrinking the
> > > struct rcu_head by removing the function pointer.
> > > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > > 
> > > The ideas brought in so far require having per-callback lists, which
> > > involves a bit of management overhead and don't permit keeping the
> > > call_rcu() in cpu order.
> > 
> > But please note that this is on the "Possibly Dubious Changes" list.  ;-)
> > 
> > > You might want to look into the Userspace RCU urcu-defer.c
> > > implementation, where I perform pointer encoding to compact the usual
> > > case, expected to be the same callback passed as parameter multiple
> > > times in a row to call_rcu(). This is very typical with multiple free()
> > > calls for different data structures next to each other.
> > > 
> > > This typically keeps the size of the information to encode per callback
> > > down to a minimum: the size of a single pointer. It would be good to
> > > trace the kernel usage of call_rcu() to see if my assumption holds.
> > > 
> > > I just thought I should tell you before you start looking at this
> > > issue further.
> > 
> > So the idea is to maintain a per-CPU queue of function pointers, but
> > with the pointers on this queue encoded to save space, correct?
> 
> Yes, exactly.

OK, I will add something to this effect on my rcutodo page.

> >  If I
> > understand correctly, the user-level rcu-defer implementation relies on
> > the following:
> > 
> > 1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
> > 	critical section (due to the call to rcu_defer_barrier_thread()
> > 	which in turn calls synchronize_rcu().  This is necessary to
> > 	handle queue overflow.  (Which appears to be why you introduce
> > 	a new API, as it is legal to invoke call_rcu() from within an
> > 	RCU read-side critical section.)
> 
> When dealing with queue overflow, I figured we have 4 alternatives.
> Either:
> 
> 1, 2, 3) We proceed to execution of {the single, all, thread local}
>          callback(s) on the spot after a synchronize_rcu().
> 
> 4) We expand the queue by allocating more memory.
> 
> The idea of pointer encoding to save space could be used with any of 1,
> 2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
> being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
> read-side C.S. context because they require to use synchronize_rcu()
> within the C.S., which would deadlock on its calling context.
> 
> Now, there is a rationale for the choice of (3) in my urcu-defer
> implementation:
> 
> * It's how I can deal with memory full (-ENOMEM) without letting the
>   system die with exit(). How does the kernel call_rcu() deal with this
>   currently ? BUG_ON, WARN_ON ?

It doesn't have to do anything -- the caller of call_rcu() is responsible
for allocating any required memory.  So call_rcu() never allocates memory
and thus never needs to worry about a memory-allocation failure.

> * It acts as a rate limiter for urcu_defer_queue(). Basically, if a
>   thread starts enqueuing callbacks too fast, it will eventually fill its
>   queue and have to empty it itself. AFAIK, It's not possible to do that
>   if you allow call_rcu() to be called from read-side C.S..

Yep!  ;-)

> I could even extend rcu_defer_queue() to take a second rate-limiter
> callback, which would check if the thread went over some threshold and
> give a more precise limit (e.g. amount of memory to be freed) on the
> rate than the "4096 callbacks in flight max", which have been chosen by
> benchmarks, but is a bit arbitrary in terms of overall callback effect.
> 
> How important is it to permit enqueuing callbacks from within rcu
> read-side C.S. in terms of real-life usage ? If it is really that
> important to fill this use-case, then I could have a mode for call_rcu()
> that expands the RCU callback queue upon overflow. But as I argue above,
> I really prefer the control we have with a fixed-sized queue.

There are occurrences of this in the Linux kernel.  In theory, you could
always hang onto the object until leaving the outermost RCU read-side
critical section, but in practice this is not always consistent with
good software-engineering practice.

One use case is when you have an RCU-protected list, each element of
which has an RCU-protected list hanging off it.  In this case, you might
scan the upper-level list under RCU protection, but during the scan you
might need to remove elements from the lower-level list and pass them
to call_rcu().

So it really needs to be legal for call_rcu() to be invoked from within
an RCU read-side critical section.

> > 2.	It is OK to wait for a grace period when a thread calls
> > 	rcu_defer_unregister_thread() while exiting.  In the kernel,
> > 	this is roughly equivalent to the CPU_DYING notifier, which
> > 	cannot block, thus cannot wait for a grace period.
> > 
> > 	I could imagine copying the per-CPU buffer somewhere, though
> > 	my experience with the RCU/CPU-hotplug interface does not
> > 	encourage me in this direction.  ;-)
> 
> As you say, we don't _have_ to empty the queue before putting a
> thread/cpu offline. We could simply copy the unplugged cpu queue to an
> orphan queue, as you currently do in your implementation. I agree that
> it would be more suitable to the cpu hotplug CPU_DYING execution
> context, due to its inherent trickiness.

Especially if you want something like rcu_barrier() to continue working.

Hmmm...  Can user applications unload dynamically linked libraries?  ;-)

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-23  0:40     ` Paul E. McKenney
@ 2009-10-23 12:29       ` Mathieu Desnoyers
  2009-10-23 16:22         ` Paul E. McKenney
  0 siblings, 1 reply; 7+ messages in thread
From: Mathieu Desnoyers @ 2009-10-23 12:29 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Wed, Oct 21, 2009 at 10:53:15AM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > > > Hi Paul,
> > > > 
> > > > I noticed that you already discussed the possibility of shrinking the
> > > > struct rcu_head by removing the function pointer.
> > > > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > > > 
> > > > The ideas brought in so far require having per-callback lists, which
> > > > involves a bit of management overhead and don't permit keeping the
> > > > call_rcu() in cpu order.
> > > 
> > > But please note that this is on the "Possibly Dubious Changes" list.  ;-)
> > > 
> > > > You might want to look into the Userspace RCU urcu-defer.c
> > > > implementation, where I perform pointer encoding to compact the usual
> > > > case, expected to be the same callback passed as parameter multiple
> > > > times in a row to call_rcu(). This is very typical with multiple free()
> > > > calls for different data structures next to each other.
> > > > 
> > > > This typically keeps the size of the information to encode per callback
> > > > down to a minimum: the size of a single pointer. It would be good to
> > > > trace the kernel usage of call_rcu() to see if my assumption holds.
> > > > 
> > > > I just thought I should tell you before you start looking at this
> > > > issue further.
> > > 
> > > So the idea is to maintain a per-CPU queue of function pointers, but
> > > with the pointers on this queue encoded to save space, correct?
> > 
> > Yes, exactly.
> 
> OK, I will add something to this effect on my rcutodo page.
> 
> > >  If I
> > > understand correctly, the user-level rcu-defer implementation relies on
> > > the following:
> > > 
> > > 1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
> > > 	critical section (due to the call to rcu_defer_barrier_thread()
> > > 	which in turn calls synchronize_rcu().  This is necessary to
> > > 	handle queue overflow.  (Which appears to be why you introduce
> > > 	a new API, as it is legal to invoke call_rcu() from within an
> > > 	RCU read-side critical section.)
> > 
> > When dealing with queue overflow, I figured we have 4 alternatives.
> > Either:
> > 
> > 1, 2, 3) We proceed to execution of {the single, all, thread local}
> >          callback(s) on the spot after a synchronize_rcu().
> > 
> > 4) We expand the queue by allocating more memory.
> > 
> > The idea of pointer encoding to save space could be used with any of 1,
> > 2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
> > being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
> > read-side C.S. context because they require to use synchronize_rcu()
> > within the C.S., which would deadlock on its calling context.
> > 
> > Now, there is a rationale for the choice of (3) in my urcu-defer
> > implementation:
> > 
> > * It's how I can deal with memory full (-ENOMEM) without letting the
> >   system die with exit(). How does the kernel call_rcu() deal with this
> >   currently ? BUG_ON, WARN_ON ?
> 
> It doesn't have to do anything -- the caller of call_rcu() is responsible
> for allocating any required memory.  So call_rcu() never allocates memory
> and thus never needs to worry about a memory-allocation failure.
> 

I see. So it puts the burden of memory allocation onto the creation of
the original object we are dealing with. I understand why it's done like
that now: it does not add another memory allocation failure point.

> > * It acts as a rate limiter for urcu_defer_queue(). Basically, if a
> >   thread starts enqueuing callbacks too fast, it will eventually fill its
> >   queue and have to empty it itself. AFAIK, It's not possible to do that
> >   if you allow call_rcu() to be called from read-side C.S..
> 
> Yep!  ;-)

>From userland point of view, it would be good to provide an API that
allows rate-limiting, especially to deal with DoS types of attacks. I
I just renamed rcu_defer_queue() to defer_rcu() and removed the define
for call_rcu(), because this last mapping introduced an API with the
same name as the kernel API, but with different parameters.

So the idea is to have:

* defer_rcu(fct, p): fixed-sized queue rcu work deferral. Cannot be
  called from rcu read-side C.S..
* defer_rcu_ratelimit(fct, p, rl): same as above, but with added rate
  limiter callback.

and, eventually, to add back a standard call_rcu(), which can be called
from within RCU read-side C.S., but which provides no rate limiting
whatsoever.

> 
> > I could even extend rcu_defer_queue() to take a second rate-limiter
> > callback, which would check if the thread went over some threshold and
> > give a more precise limit (e.g. amount of memory to be freed) on the
> > rate than the "4096 callbacks in flight max", which have been chosen by
> > benchmarks, but is a bit arbitrary in terms of overall callback effect.
> > 
> > How important is it to permit enqueuing callbacks from within rcu
> > read-side C.S. in terms of real-life usage ? If it is really that
> > important to fill this use-case, then I could have a mode for call_rcu()
> > that expands the RCU callback queue upon overflow. But as I argue above,
> > I really prefer the control we have with a fixed-sized queue.
> 
> There are occurrences of this in the Linux kernel.  In theory, you could
> always hang onto the object until leaving the outermost RCU read-side
> critical section, but in practice this is not always consistent with
> good software-engineering practice.
> 
> One use case is when you have an RCU-protected list, each element of
> which has an RCU-protected list hanging off it.  In this case, you might
> scan the upper-level list under RCU protection, but during the scan you
> might need to remove elements from the lower-level list and pass them
> to call_rcu().
> 
> So it really needs to be legal for call_rcu() to be invoked from within
> an RCU read-side critical section.

I agree that not being able to use call_rcu() from a read-side C.S.
could really be a problem here.

This is why I think aving two interfaces, one permitting calling
call_rcu() from within C.S., but requiring the added struct rcu_head,
and the other using per-thread queues with maximum size, rate limiting,
but which cannot be used in read-side C.S. seems like a good tradeoff.

> 
> > > 2.	It is OK to wait for a grace period when a thread calls
> > > 	rcu_defer_unregister_thread() while exiting.  In the kernel,
> > > 	this is roughly equivalent to the CPU_DYING notifier, which
> > > 	cannot block, thus cannot wait for a grace period.
> > > 
> > > 	I could imagine copying the per-CPU buffer somewhere, though
> > > 	my experience with the RCU/CPU-hotplug interface does not
> > > 	encourage me in this direction.  ;-)
> > 
> > As you say, we don't _have_ to empty the queue before putting a
> > thread/cpu offline. We could simply copy the unplugged cpu queue to an
> > orphan queue, as you currently do in your implementation. I agree that
> > it would be more suitable to the cpu hotplug CPU_DYING execution
> > context, due to its inherent trickiness.
> 
> Especially if you want something like rcu_barrier() to continue working.
> 
> Hmmm...  Can user applications unload dynamically linked libraries?  ;-)

Yes. With dlclose(). But I expect library developers to call
rcu_defer_barrier() in their destructor if they have queued any work.
(/me sneaks a README update in the git tree to that effect) ;)

Thanks,

Mathieu

> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > 							Thanx, Paul
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-23 12:29       ` Mathieu Desnoyers
@ 2009-10-23 16:22         ` Paul E. McKenney
  2009-10-23 17:41           ` Mathieu Desnoyers
  0 siblings, 1 reply; 7+ messages in thread
From: Paul E. McKenney @ 2009-10-23 16:22 UTC (permalink / raw)
  To: Mathieu Desnoyers; +Cc: linux-kernel

On Fri, Oct 23, 2009 at 08:29:38AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > On Wed, Oct 21, 2009 at 10:53:15AM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > > > > Hi Paul,
> > > > > 
> > > > > I noticed that you already discussed the possibility of shrinking the
> > > > > struct rcu_head by removing the function pointer.
> > > > > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > > > > 
> > > > > The ideas brought in so far require having per-callback lists, which
> > > > > involves a bit of management overhead and don't permit keeping the
> > > > > call_rcu() in cpu order.
> > > > 
> > > > But please note that this is on the "Possibly Dubious Changes" list.  ;-)
> > > > 
> > > > > You might want to look into the Userspace RCU urcu-defer.c
> > > > > implementation, where I perform pointer encoding to compact the usual
> > > > > case, expected to be the same callback passed as parameter multiple
> > > > > times in a row to call_rcu(). This is very typical with multiple free()
> > > > > calls for different data structures next to each other.
> > > > > 
> > > > > This typically keeps the size of the information to encode per callback
> > > > > down to a minimum: the size of a single pointer. It would be good to
> > > > > trace the kernel usage of call_rcu() to see if my assumption holds.
> > > > > 
> > > > > I just thought I should tell you before you start looking at this
> > > > > issue further.
> > > > 
> > > > So the idea is to maintain a per-CPU queue of function pointers, but
> > > > with the pointers on this queue encoded to save space, correct?
> > > 
> > > Yes, exactly.
> > 
> > OK, I will add something to this effect on my rcutodo page.
> > 
> > > >  If I
> > > > understand correctly, the user-level rcu-defer implementation relies on
> > > > the following:
> > > > 
> > > > 1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
> > > > 	critical section (due to the call to rcu_defer_barrier_thread()
> > > > 	which in turn calls synchronize_rcu().  This is necessary to
> > > > 	handle queue overflow.  (Which appears to be why you introduce
> > > > 	a new API, as it is legal to invoke call_rcu() from within an
> > > > 	RCU read-side critical section.)
> > > 
> > > When dealing with queue overflow, I figured we have 4 alternatives.
> > > Either:
> > > 
> > > 1, 2, 3) We proceed to execution of {the single, all, thread local}
> > >          callback(s) on the spot after a synchronize_rcu().
> > > 
> > > 4) We expand the queue by allocating more memory.
> > > 
> > > The idea of pointer encoding to save space could be used with any of 1,
> > > 2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
> > > being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
> > > read-side C.S. context because they require to use synchronize_rcu()
> > > within the C.S., which would deadlock on its calling context.
> > > 
> > > Now, there is a rationale for the choice of (3) in my urcu-defer
> > > implementation:
> > > 
> > > * It's how I can deal with memory full (-ENOMEM) without letting the
> > >   system die with exit(). How does the kernel call_rcu() deal with this
> > >   currently ? BUG_ON, WARN_ON ?
> > 
> > It doesn't have to do anything -- the caller of call_rcu() is responsible
> > for allocating any required memory.  So call_rcu() never allocates memory
> > and thus never needs to worry about a memory-allocation failure.
> 
> I see. So it puts the burden of memory allocation onto the creation of
> the original object we are dealing with. I understand why it's done like
> that now: it does not add another memory allocation failure point.
> 
> > > * It acts as a rate limiter for urcu_defer_queue(). Basically, if a
> > >   thread starts enqueuing callbacks too fast, it will eventually fill its
> > >   queue and have to empty it itself. AFAIK, It's not possible to do that
> > >   if you allow call_rcu() to be called from read-side C.S..
> > 
> > Yep!  ;-)
> 
> From userland point of view, it would be good to provide an API that
> allows rate-limiting, especially to deal with DoS types of attacks. I
> I just renamed rcu_defer_queue() to defer_rcu() and removed the define
> for call_rcu(), because this last mapping introduced an API with the
> same name as the kernel API, but with different parameters.
> 
> So the idea is to have:
> 
> * defer_rcu(fct, p): fixed-sized queue rcu work deferral. Cannot be
>   called from rcu read-side C.S..
> * defer_rcu_ratelimit(fct, p, rl): same as above, but with added rate
>   limiter callback.
> 
> and, eventually, to add back a standard call_rcu(), which can be called
> from within RCU read-side C.S., but which provides no rate limiting
> whatsoever.

These sound to me like interesting experiments, but I strongly
suggest publishing them as "experimental, may change" or whatever.
One thing to keep in mind is that the rate-limiting should not be used
in real-time applications.  Alternatively, you could allow defer_rcu()
and defer_rcu_ratelimit() to return failure if the callback could not
be queued immediately.  Not that figuring out what to do in the failure
case is particularly attractive...

Of course, the memory savings will depend on the workload.  The
defer_rcu*() primitives eliminate the need for a struct rcu_head in
each data structure, but requires that the per-CPU/task callback queues
be present.

> > > I could even extend rcu_defer_queue() to take a second rate-limiter
> > > callback, which would check if the thread went over some threshold and
> > > give a more precise limit (e.g. amount of memory to be freed) on the
> > > rate than the "4096 callbacks in flight max", which have been chosen by
> > > benchmarks, but is a bit arbitrary in terms of overall callback effect.
> > > 
> > > How important is it to permit enqueuing callbacks from within rcu
> > > read-side C.S. in terms of real-life usage ? If it is really that
> > > important to fill this use-case, then I could have a mode for call_rcu()
> > > that expands the RCU callback queue upon overflow. But as I argue above,
> > > I really prefer the control we have with a fixed-sized queue.
> > 
> > There are occurrences of this in the Linux kernel.  In theory, you could
> > always hang onto the object until leaving the outermost RCU read-side
> > critical section, but in practice this is not always consistent with
> > good software-engineering practice.
> > 
> > One use case is when you have an RCU-protected list, each element of
> > which has an RCU-protected list hanging off it.  In this case, you might
> > scan the upper-level list under RCU protection, but during the scan you
> > might need to remove elements from the lower-level list and pass them
> > to call_rcu().
> > 
> > So it really needs to be legal for call_rcu() to be invoked from within
> > an RCU read-side critical section.
> 
> I agree that not being able to use call_rcu() from a read-side C.S.
> could really be a problem here.
> 
> This is why I think aving two interfaces, one permitting calling
> call_rcu() from within C.S., but requiring the added struct rcu_head,
> and the other using per-thread queues with maximum size, rate limiting,
> but which cannot be used in read-side C.S. seems like a good tradeoff.

The defer_rcu() interfaces might work out quite well, but it would be
good to see how they do in real life.  ;-)

> > > > 2.	It is OK to wait for a grace period when a thread calls
> > > > 	rcu_defer_unregister_thread() while exiting.  In the kernel,
> > > > 	this is roughly equivalent to the CPU_DYING notifier, which
> > > > 	cannot block, thus cannot wait for a grace period.
> > > > 
> > > > 	I could imagine copying the per-CPU buffer somewhere, though
> > > > 	my experience with the RCU/CPU-hotplug interface does not
> > > > 	encourage me in this direction.  ;-)
> > > 
> > > As you say, we don't _have_ to empty the queue before putting a
> > > thread/cpu offline. We could simply copy the unplugged cpu queue to an
> > > orphan queue, as you currently do in your implementation. I agree that
> > > it would be more suitable to the cpu hotplug CPU_DYING execution
> > > context, due to its inherent trickiness.
> > 
> > Especially if you want something like rcu_barrier() to continue working.
> > 
> > Hmmm...  Can user applications unload dynamically linked libraries?  ;-)
> 
> Yes. With dlclose(). But I expect library developers to call
> rcu_defer_barrier() in their destructor if they have queued any work.
> (/me sneaks a README update in the git tree to that effect) ;)

;-)

							Thanx, Paul

> Thanks,
> 
> Mathieu
> 
> > 
> > 							Thanx, Paul
> > 
> > > Thanks,
> > > 
> > > Mathieu
> > > 
> > > > 
> > > > 							Thanx, Paul
> > > 
> > > -- 
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> 
> -- 
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Kernel RCU: shrink the size of the struct rcu_head
  2009-10-23 16:22         ` Paul E. McKenney
@ 2009-10-23 17:41           ` Mathieu Desnoyers
  0 siblings, 0 replies; 7+ messages in thread
From: Mathieu Desnoyers @ 2009-10-23 17:41 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: linux-kernel

* Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> On Fri, Oct 23, 2009 at 08:29:38AM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > On Wed, Oct 21, 2009 at 10:53:15AM -0400, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney (paulmck@linux.vnet.ibm.com) wrote:
> > > > > On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > > > > > Hi Paul,
> > > > > > 
> > > > > > I noticed that you already discussed the possibility of shrinking the
> > > > > > struct rcu_head by removing the function pointer.
> > > > > > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > > > > > 
> > > > > > The ideas brought in so far require having per-callback lists, which
> > > > > > involves a bit of management overhead and don't permit keeping the
> > > > > > call_rcu() in cpu order.
> > > > > 
> > > > > But please note that this is on the "Possibly Dubious Changes" list.  ;-)
> > > > > 
> > > > > > You might want to look into the Userspace RCU urcu-defer.c
> > > > > > implementation, where I perform pointer encoding to compact the usual
> > > > > > case, expected to be the same callback passed as parameter multiple
> > > > > > times in a row to call_rcu(). This is very typical with multiple free()
> > > > > > calls for different data structures next to each other.
> > > > > > 
> > > > > > This typically keeps the size of the information to encode per callback
> > > > > > down to a minimum: the size of a single pointer. It would be good to
> > > > > > trace the kernel usage of call_rcu() to see if my assumption holds.
> > > > > > 
> > > > > > I just thought I should tell you before you start looking at this
> > > > > > issue further.
> > > > > 
> > > > > So the idea is to maintain a per-CPU queue of function pointers, but
> > > > > with the pointers on this queue encoded to save space, correct?
> > > > 
> > > > Yes, exactly.
> > > 
> > > OK, I will add something to this effect on my rcutodo page.
> > > 
> > > > >  If I
> > > > > understand correctly, the user-level rcu-defer implementation relies on
> > > > > the following:
> > > > > 
> > > > > 1.	It is illegal to call _rcu_defer_queue() within an RCU read-side
> > > > > 	critical section (due to the call to rcu_defer_barrier_thread()
> > > > > 	which in turn calls synchronize_rcu().  This is necessary to
> > > > > 	handle queue overflow.  (Which appears to be why you introduce
> > > > > 	a new API, as it is legal to invoke call_rcu() from within an
> > > > > 	RCU read-side critical section.)
> > > > 
> > > > When dealing with queue overflow, I figured we have 4 alternatives.
> > > > Either:
> > > > 
> > > > 1, 2, 3) We proceed to execution of {the single, all, thread local}
> > > >          callback(s) on the spot after a synchronize_rcu().
> > > > 
> > > > 4) We expand the queue by allocating more memory.
> > > > 
> > > > The idea of pointer encoding to save space could be used with any of 1,
> > > > 2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
> > > > being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
> > > > read-side C.S. context because they require to use synchronize_rcu()
> > > > within the C.S., which would deadlock on its calling context.
> > > > 
> > > > Now, there is a rationale for the choice of (3) in my urcu-defer
> > > > implementation:
> > > > 
> > > > * It's how I can deal with memory full (-ENOMEM) without letting the
> > > >   system die with exit(). How does the kernel call_rcu() deal with this
> > > >   currently ? BUG_ON, WARN_ON ?
> > > 
> > > It doesn't have to do anything -- the caller of call_rcu() is responsible
> > > for allocating any required memory.  So call_rcu() never allocates memory
> > > and thus never needs to worry about a memory-allocation failure.
> > 
> > I see. So it puts the burden of memory allocation onto the creation of
> > the original object we are dealing with. I understand why it's done like
> > that now: it does not add another memory allocation failure point.
> > 
> > > > * It acts as a rate limiter for urcu_defer_queue(). Basically, if a
> > > >   thread starts enqueuing callbacks too fast, it will eventually fill its
> > > >   queue and have to empty it itself. AFAIK, It's not possible to do that
> > > >   if you allow call_rcu() to be called from read-side C.S..
> > > 
> > > Yep!  ;-)
> > 
> > From userland point of view, it would be good to provide an API that
> > allows rate-limiting, especially to deal with DoS types of attacks. I
> > I just renamed rcu_defer_queue() to defer_rcu() and removed the define
> > for call_rcu(), because this last mapping introduced an API with the
> > same name as the kernel API, but with different parameters.
> > 
> > So the idea is to have:
> > 
> > * defer_rcu(fct, p): fixed-sized queue rcu work deferral. Cannot be
> >   called from rcu read-side C.S..
> > * defer_rcu_ratelimit(fct, p, rl): same as above, but with added rate
> >   limiter callback.
> > 
> > and, eventually, to add back a standard call_rcu(), which can be called
> > from within RCU read-side C.S., but which provides no rate limiting
> > whatsoever.
> 
> These sound to me like interesting experiments, but I strongly
> suggest publishing them as "experimental, may change" or whatever.
> One thing to keep in mind is that the rate-limiting should not be used
> in real-time applications.  Alternatively, you could allow defer_rcu()
> and defer_rcu_ratelimit() to return failure if the callback could not
> be queued immediately.  Not that figuring out what to do in the failure
> case is particularly attractive...
> 
> Of course, the memory savings will depend on the workload.  The
> defer_rcu*() primitives eliminate the need for a struct rcu_head in
> each data structure, but requires that the per-CPU/task callback queues
> be present.

Yep. Will mark rcu_defer() as experimental.

> 
> > > > I could even extend rcu_defer_queue() to take a second rate-limiter
> > > > callback, which would check if the thread went over some threshold and
> > > > give a more precise limit (e.g. amount of memory to be freed) on the
> > > > rate than the "4096 callbacks in flight max", which have been chosen by
> > > > benchmarks, but is a bit arbitrary in terms of overall callback effect.
> > > > 
> > > > How important is it to permit enqueuing callbacks from within rcu
> > > > read-side C.S. in terms of real-life usage ? If it is really that
> > > > important to fill this use-case, then I could have a mode for call_rcu()
> > > > that expands the RCU callback queue upon overflow. But as I argue above,
> > > > I really prefer the control we have with a fixed-sized queue.
> > > 
> > > There are occurrences of this in the Linux kernel.  In theory, you could
> > > always hang onto the object until leaving the outermost RCU read-side
> > > critical section, but in practice this is not always consistent with
> > > good software-engineering practice.
> > > 
> > > One use case is when you have an RCU-protected list, each element of
> > > which has an RCU-protected list hanging off it.  In this case, you might
> > > scan the upper-level list under RCU protection, but during the scan you
> > > might need to remove elements from the lower-level list and pass them
> > > to call_rcu().
> > > 
> > > So it really needs to be legal for call_rcu() to be invoked from within
> > > an RCU read-side critical section.
> > 
> > I agree that not being able to use call_rcu() from a read-side C.S.
> > could really be a problem here.
> > 
> > This is why I think aving two interfaces, one permitting calling
> > call_rcu() from within C.S., but requiring the added struct rcu_head,
> > and the other using per-thread queues with maximum size, rate limiting,
> > but which cannot be used in read-side C.S. seems like a good tradeoff.
> 
> The defer_rcu() interfaces might work out quite well, but it would be
> good to see how they do in real life.  ;-)

Indeed, usage will tell.

Thanks,

Mathieu

> 
> > > > > 2.	It is OK to wait for a grace period when a thread calls
> > > > > 	rcu_defer_unregister_thread() while exiting.  In the kernel,
> > > > > 	this is roughly equivalent to the CPU_DYING notifier, which
> > > > > 	cannot block, thus cannot wait for a grace period.
> > > > > 
> > > > > 	I could imagine copying the per-CPU buffer somewhere, though
> > > > > 	my experience with the RCU/CPU-hotplug interface does not
> > > > > 	encourage me in this direction.  ;-)
> > > > 
> > > > As you say, we don't _have_ to empty the queue before putting a
> > > > thread/cpu offline. We could simply copy the unplugged cpu queue to an
> > > > orphan queue, as you currently do in your implementation. I agree that
> > > > it would be more suitable to the cpu hotplug CPU_DYING execution
> > > > context, due to its inherent trickiness.
> > > 
> > > Especially if you want something like rcu_barrier() to continue working.
> > > 
> > > Hmmm...  Can user applications unload dynamically linked libraries?  ;-)
> > 
> > Yes. With dlclose(). But I expect library developers to call
> > rcu_defer_barrier() in their destructor if they have queued any work.
> > (/me sneaks a README update in the git tree to that effect) ;)
> 
> ;-)
> 
> 							Thanx, Paul
> 
> > Thanks,
> > 
> > Mathieu
> > 
> > > 
> > > 							Thanx, Paul
> > > 
> > > > Thanks,
> > > > 
> > > > Mathieu
> > > > 
> > > > > 
> > > > > 							Thanx, Paul
> > > > 
> > > > -- 
> > > > Mathieu Desnoyers
> > > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
> > 
> > -- 
> > Mathieu Desnoyers
> > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-10-23 17:41 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-10-18 23:29 Kernel RCU: shrink the size of the struct rcu_head Mathieu Desnoyers
2009-10-20 22:07 ` Paul E. McKenney
2009-10-21 14:53   ` Mathieu Desnoyers
2009-10-23  0:40     ` Paul E. McKenney
2009-10-23 12:29       ` Mathieu Desnoyers
2009-10-23 16:22         ` Paul E. McKenney
2009-10-23 17:41           ` Mathieu Desnoyers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).