rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RCU ideas discussed at LPC
@ 2019-12-25 22:41 Joel Fernandes
  2019-12-26  1:05 ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2019-12-25 22:41 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

Hi Paul,
We were discussing some ideas on facebook so I wanted to just post
them here as well. This is in the context of the RCU section of RT MC
https://www.youtube.com/watch?v=bpyFQJV5gCI

Detecting high kfree_rcu() load
----------
You mentioned about this. As I understand it, we did the kfree_rcu()
batching to let the system not do anything RCU related until a batch
has filled up enough or a timeout has occurred. This makes the GP
thread and the system do less work.
The problem you are raising in our facebook thread is, that during
heavy load the "batch" can be large and be dumped into call_rcu()
eventually. Wouldn't this be better handled generically within
call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
That is if a large number of callbacks is dumped, then try to end the
GP more quickly. This likely doesn't need a signal from kfree_rcu()
since call_rcu() knows that it is being hammered.

Detecting recursive call_rcu() within call_rcu()
---------
We could use a per-cpu variable to detect a scenario like this, though
I am not sure if preemption during call_rcu() itself would cause false
positives.

All rcuogp and rcuop threads tied to a house keeping CPU
---
In LPC you mentioned about the problem of OOM if all rcuo* threads
including the GP one are not able to keep up with heavy load. On
Facebook I had proposed something like this: What about making the
affinity setting to be a "soft affinity", that is respect it always
expect in the uncommon case. In the uncommon case of heavy load, let
the threads run wherever to prevent OOM. Sure that might make the
system a little more disruptive, but if we are approaching OOM we have
bigger problems right?

Peter mentioned about rcuogp0 should have slightly higher prio than rcuop0
---------
You mentioned this is something to look into but not sure if we looked
into it yet.

A "heavy" call_rcu() caller using synchronize_rcu() if too many
callbacks are dumped
---------
How about doing this kind of call_rcu() to synchronize_rcu()
transition automatically if the context allows it? I.e. Detect the
context and if sleeping is allowed, then wait for the grace period
synchronously in call_rcu(). Not sure about deadlocks and the like
from this kind of waiting and have to think more.

is square root of N number of rcuogp0 threads - the right optimization?
---------
The question raised was can we do with fewer threads, or even just
one? You mentioned the square root might not be the right choice. How
do we test how well the system is doing. Are you running rcutorture
with a certain tree configuration and monitor memory footprint /
performance?

BTW, I have 2 interns working on RCU (Amol and Madupharna also on CC).
They were selected among several others as a part of the
LinuxFoundation mentorship program. They are familiar with RCU. I have
asked them to look at some RCU-list work and RCU sparse work. However,
I can also have them look into a few other things as time permits and
depending on what interests them.

Thanks, Merry Christmas!

 - Joel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RCU ideas discussed at LPC
  2019-12-25 22:41 RCU ideas discussed at LPC Joel Fernandes
@ 2019-12-26  1:05 ` Paul E. McKenney
  2020-01-04  1:56   ` Joel Fernandes
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2019-12-26  1:05 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

On Wed, Dec 25, 2019 at 05:41:04PM -0500, Joel Fernandes wrote:
> Hi Paul,
> We were discussing some ideas on facebook so I wanted to just post
> them here as well. This is in the context of the RCU section of RT MC
> https://www.youtube.com/watch?v=bpyFQJV5gCI
> 
> Detecting high kfree_rcu() load
> ----------
> You mentioned about this. As I understand it, we did the kfree_rcu()
> batching to let the system not do anything RCU related until a batch
> has filled up enough or a timeout has occurred. This makes the GP
> thread and the system do less work.
> The problem you are raising in our facebook thread is, that during
> heavy load the "batch" can be large and be dumped into call_rcu()
> eventually. Wouldn't this be better handled generically within
> call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
> That is if a large number of callbacks is dumped, then try to end the
> GP more quickly. This likely doesn't need a signal from kfree_rcu()
> since call_rcu() knows that it is being hammered.

Except that call_rcu() currently has no idea how many parcels of memory
a given request from kfree_rcu() represents.

> Detecting recursive call_rcu() within call_rcu()
> ---------
> We could use a per-cpu variable to detect a scenario like this, though
> I am not sure if preemption during call_rcu() itself would cause false
> positives.

A call_rcu() from within an RCU callback function is legal and is
sometimes done.  Or are you thinking of a call_rcu() from an interrupt
handler interrupting another call_rcu()?

> All rcuogp and rcuop threads tied to a house keeping CPU
> ---
> In LPC you mentioned about the problem of OOM if all rcuo* threads
> including the GP one are not able to keep up with heavy load. On
> Facebook I had proposed something like this: What about making the
> affinity setting to be a "soft affinity", that is respect it always
> expect in the uncommon case. In the uncommon case of heavy load, let
> the threads run wherever to prevent OOM. Sure that might make the
> system a little more disruptive, but if we are approaching OOM we have
> bigger problems right?

The problem is that there are a rather large number of ways to force
a given kthread to execute only on a given CPU, and reverse-engineering
all that within call_rcu() isn't reasonable.  An alternative is to
disable offloading, wait for the offloaded callbacks to drain, then
start up the usual softirq approach (or per-CPU kthread, as the case
may be).  This self-throttles because whatever is generating callbacks
gets preempted by softirq invocation.

Give or take real-time priority settings, but beyond a certain point
I start quoting Peter Parker's uncle.

> Peter mentioned about rcuogp0 should have slightly higher prio than rcuop0

Assuming no strange cases with extremely short grace periods, agreed.

> ---------
> You mentioned this is something to look into but not sure if we looked
> into it yet.
> 
> A "heavy" call_rcu() caller using synchronize_rcu() if too many
> callbacks are dumped

This is actually done in some parts of the kernel, though I would
be happier with rcu_barrier() at least some of the time, either
in addition to or in place of synchronize_rcu().  (In fairness,
some of the use cases pre-date rcu_barrier().)

> ---------
> How about doing this kind of call_rcu() to synchronize_rcu()
> transition automatically if the context allows it? I.e. Detect the
> context and if sleeping is allowed, then wait for the grace period
> synchronously in call_rcu(). Not sure about deadlocks and the like
> from this kind of waiting and have to think more.

This gets rather strange in a production PREEMPT=n build, so not a
fan, actually.  And in real-time systems, I pretty much have to splat
anyway if I slow down call_rcu() by that much.

So the preference is instead detecting such misconfiguration and issuing
appropriate diagnostics.  And making RCU more able to keep up when not
grossly misconfigured, hence the kfree_rcu() memory footprint being
fed into core RCU.

> is square root of N number of rcuogp0 threads - the right optimization?

If there were enough CPUs, it would be necessary to have three levels
of hierarchy and to go to the cube root, but that would be more CPUs
than I have seen used.

> ---------
> The question raised was can we do with fewer threads, or even just
> one? You mentioned the square root might not be the right choice. How
> do we test how well the system is doing. Are you running rcutorture
> with a certain tree configuration and monitor memory footprint /
> performance?

The issue prompting the hierarcy was wakeup overhead on the grace-period
kthread.  Going to a hierarchy reduced the load on that single thread
(which could otherwise become a bottleneck on large systems, and also
reduced the absolute number of wakeups by up to almost a factor of two.
Deepening the hierarchy would further reduce the wakeup load on the
grace-period kthread, but would increase the total number of wakeups.

So this is not a matter of tweaks and optimizations.  I would need to
see some horrible problem with the current setup to even consider
making a change.

> BTW, I have 2 interns working on RCU (Amol and Madupharna also on CC).
> They were selected among several others as a part of the
> LinuxFoundation mentorship program. They are familiar with RCU. I have
> asked them to look at some RCU-list work and RCU sparse work. However,
> I can also have them look into a few other things as time permits and
> depending on what interests them.

Dog paddling before cliff diving, please!  ;-)

> Thanks, Merry Christmas!

And to you and yours as well!

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RCU ideas discussed at LPC
  2019-12-26  1:05 ` Paul E. McKenney
@ 2020-01-04  1:56   ` Joel Fernandes
  2020-01-04  2:31     ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2020-01-04  1:56 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

On Wed, Dec 25, 2019 at 05:05:32PM -0800, Paul E. McKenney wrote:
> On Wed, Dec 25, 2019 at 05:41:04PM -0500, Joel Fernandes wrote:
> > Hi Paul,
> > We were discussing some ideas on facebook so I wanted to just post
> > them here as well. This is in the context of the RCU section of RT MC
> > https://www.youtube.com/watch?v=bpyFQJV5gCI
> > 
> > Detecting high kfree_rcu() load
> > ----------
> > You mentioned about this. As I understand it, we did the kfree_rcu()
> > batching to let the system not do anything RCU related until a batch
> > has filled up enough or a timeout has occurred. This makes the GP
> > thread and the system do less work.
> > The problem you are raising in our facebook thread is, that during
> > heavy load the "batch" can be large and be dumped into call_rcu()
> > eventually. Wouldn't this be better handled generically within
> > call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
> > That is if a large number of callbacks is dumped, then try to end the
> > GP more quickly. This likely doesn't need a signal from kfree_rcu()
> > since call_rcu() knows that it is being hammered.
> 
> Except that call_rcu() currently has no idea how many parcels of memory
> a given request from kfree_rcu() represents.

True. At the moment, neither does kfree_rcu() since we store only the
pointer. We could consult the low level allocator if they have this
information. If you could let me know how to make RCU more aggressive in this
case (once we know there's a problem), I could work on something like this. I
did have OOM issues in earlier versions of the kfree_rcu() patch. I could
boot a system with less memory and OOM it too with the tests even now.

> > Detecting recursive call_rcu() within call_rcu()
> > ---------
> > We could use a per-cpu variable to detect a scenario like this, though
> > I am not sure if preemption during call_rcu() itself would cause false
> > positives.
> 
> A call_rcu() from within an RCU callback function is legal and is
> sometimes done.  Or are you thinking of a call_rcu() from an interrupt
> handler interrupting another call_rcu()?

Oh, did not know this. I thought this was the point heavily discussed in the
LPC talk but must have misunderstood when you said you hoped no one was
precisely doing this..

> > All rcuogp and rcuop threads tied to a house keeping CPU
> > ---
> > In LPC you mentioned about the problem of OOM if all rcuo* threads
> > including the GP one are not able to keep up with heavy load. On
> > Facebook I had proposed something like this: What about making the
> > affinity setting to be a "soft affinity", that is respect it always
> > expect in the uncommon case. In the uncommon case of heavy load, let
> > the threads run wherever to prevent OOM. Sure that might make the
> > system a little more disruptive, but if we are approaching OOM we have
> > bigger problems right?
> 
> The problem is that there are a rather large number of ways to force
> a given kthread to execute only on a given CPU, and reverse-engineering
> all that within call_rcu() isn't reasonable.  An alternative is to
> disable offloading, wait for the offloaded callbacks to drain, then
> start up the usual softirq approach (or per-CPU kthread, as the case
> may be).  This self-throttles because whatever is generating callbacks
> gets preempted by softirq invocation.

Ok, agreed. Did you already implement the "disable offloading" code?

> > ---------
> > How about doing this kind of call_rcu() to synchronize_rcu()
> > transition automatically if the context allows it? I.e. Detect the
> > context and if sleeping is allowed, then wait for the grace period
> > synchronously in call_rcu(). Not sure about deadlocks and the like
> > from this kind of waiting and have to think more.
> 
> This gets rather strange in a production PREEMPT=n build, so not a
> fan, actually.  And in real-time systems, I pretty much have to splat
> anyway if I slow down call_rcu() by that much.
> 
> So the preference is instead detecting such misconfiguration and issuing
> appropriate diagnostics.  And making RCU more able to keep up when not
> grossly misconfigured, hence the kfree_rcu() memory footprint being
> fed into core RCU.

Ok. Is it not Ok to simply assume that a large number of callbacks queued
along with observing high memory pressure, means RCU should be more
aggressive anyway since whatever memory can be freed by invoking callbacks
should be helpful anyway? Or were you thinking making RCU aggressive when
there's a lot of memory pressure is not worth it, without knowing that RCU is
the cause for it?

> > is square root of N number of rcuogp0 threads - the right optimization?
> 
> If there were enough CPUs, it would be necessary to have three levels
> of hierarchy and to go to the cube root, but that would be more CPUs
> than I have seen used.
> > ---------
> > The question raised was can we do with fewer threads, or even just
> > one? You mentioned the square root might not be the right choice. How
> > do we test how well the system is doing. Are you running rcutorture
> > with a certain tree configuration and monitor memory footprint /
> > performance?
> 
> The issue prompting the hierarcy was wakeup overhead on the grace-period
> kthread.  Going to a hierarchy reduced the load on that single thread
> (which could otherwise become a bottleneck on large systems, and also
> reduced the absolute number of wakeups by up to almost a factor of two.
> Deepening the hierarchy would further reduce the wakeup load on the
> grace-period kthread, but would increase the total number of wakeups.
> 
> So this is not a matter of tweaks and optimizations.  I would need to
> see some horrible problem with the current setup to even consider
> making a change.

Ok, I only raised this because in the LPC talk you mentioned that you are not
sure if this is the right optimization. But I understand the rationale for
choosing some hierarchy in light of the wakeup performance improvements (I
already knew that this is why you had a hierarchy).

> > BTW, I have 2 interns working on RCU (Amol and Madupharna also on CC).
> > They were selected among several others as a part of the
> > LinuxFoundation mentorship program. They are familiar with RCU. I have
> > asked them to look at some RCU-list work and RCU sparse work. However,
> > I can also have them look into a few other things as time permits and
> > depending on what interests them.
> 
> Dog paddling before cliff diving, please!  ;-)

Sure. They are working on relatively simpler things for their internship but
I just put these ideas out there with them on CC so they can pick something
else as well if they have time and interest ;-)

> > Thanks, Merry Christmas!
> 
> And to you and yours as well!

Hope you had a good holiday season!

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RCU ideas discussed at LPC
  2020-01-04  1:56   ` Joel Fernandes
@ 2020-01-04  2:31     ` Paul E. McKenney
  2020-01-04 21:21       ` Joel Fernandes
  0 siblings, 1 reply; 6+ messages in thread
From: Paul E. McKenney @ 2020-01-04  2:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

On Fri, Jan 03, 2020 at 08:56:17PM -0500, Joel Fernandes wrote:
> On Wed, Dec 25, 2019 at 05:05:32PM -0800, Paul E. McKenney wrote:
> > On Wed, Dec 25, 2019 at 05:41:04PM -0500, Joel Fernandes wrote:
> > > Hi Paul,
> > > We were discussing some ideas on facebook so I wanted to just post
> > > them here as well. This is in the context of the RCU section of RT MC
> > > https://www.youtube.com/watch?v=bpyFQJV5gCI
> > > 
> > > Detecting high kfree_rcu() load
> > > ----------
> > > You mentioned about this. As I understand it, we did the kfree_rcu()
> > > batching to let the system not do anything RCU related until a batch
> > > has filled up enough or a timeout has occurred. This makes the GP
> > > thread and the system do less work.
> > > The problem you are raising in our facebook thread is, that during
> > > heavy load the "batch" can be large and be dumped into call_rcu()
> > > eventually. Wouldn't this be better handled generically within
> > > call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
> > > That is if a large number of callbacks is dumped, then try to end the
> > > GP more quickly. This likely doesn't need a signal from kfree_rcu()
> > > since call_rcu() knows that it is being hammered.
> > 
> > Except that call_rcu() currently has no idea how many parcels of memory
> > a given request from kfree_rcu() represents.
> 
> True. At the moment, neither does kfree_rcu() since we store only the
> pointer. We could consult the low level allocator if they have this
> information. If you could let me know how to make RCU more aggressive in this
> case (once we know there's a problem), I could work on something like this. I
> did have OOM issues in earlier versions of the kfree_rcu() patch. I could
> boot a system with less memory and OOM it too with the tests even now.

Let's keep things simple, at first at least!  ;-)

Currently, call_rcu() has no idea how much memory is tied up by a normal
callback, either.  But just counting the callbacks (or, in the case of
kfree_rcu(), counting the block of memory, independent of size) is at
least correlated with the memory footprint.  Plus that is what has been
used in the past, so it should be a good place to start.

Besides, how many call_rcu() invocations is a 1K kfree_rcu() invocation
worth?  A 8K kfree_rcu() invocation?  A 64-byte kfree_rcu() invocation?

We might need to answer those questions over time, but again, let's start
simple.

> > > Detecting recursive call_rcu() within call_rcu()
> > > ---------
> > > We could use a per-cpu variable to detect a scenario like this, though
> > > I am not sure if preemption during call_rcu() itself would cause false
> > > positives.
> > 
> > A call_rcu() from within an RCU callback function is legal and is
> > sometimes done.  Or are you thinking of a call_rcu() from an interrupt
> > handler interrupting another call_rcu()?
> 
> Oh, did not know this. I thought this was the point heavily discussed in the
> LPC talk but must have misunderstood when you said you hoped no one was
> precisely doing this..

What I hoped they avoid is a call_rcu() bomb, where each callback does
several call_rcu() invocations.  Just as with child processes invoking
fork(), within broad limits it is OK for callback functions to invoke
call_rcu().  There is at least one in rcutorture, for example, but it
does just one call_rcu() and also checks a time-to-stop flag.

> > > All rcuogp and rcuop threads tied to a house keeping CPU
> > > ---
> > > In LPC you mentioned about the problem of OOM if all rcuo* threads
> > > including the GP one are not able to keep up with heavy load. On
> > > Facebook I had proposed something like this: What about making the
> > > affinity setting to be a "soft affinity", that is respect it always
> > > expect in the uncommon case. In the uncommon case of heavy load, let
> > > the threads run wherever to prevent OOM. Sure that might make the
> > > system a little more disruptive, but if we are approaching OOM we have
> > > bigger problems right?
> > 
> > The problem is that there are a rather large number of ways to force
> > a given kthread to execute only on a given CPU, and reverse-engineering
> > all that within call_rcu() isn't reasonable.  An alternative is to
> > disable offloading, wait for the offloaded callbacks to drain, then
> > start up the usual softirq approach (or per-CPU kthread, as the case
> > may be).  This self-throttles because whatever is generating callbacks
> > gets preempted by softirq invocation.
> 
> Ok, agreed. Did you already implement the "disable offloading" code?

Not yet, and I do agree with the results of the LPC vote, which is to
do the diagnostic first.  Perhaps given a suitable diagnostic strategy,
"disable offloading" never will be needed.

That said, the changes I have made to RCU over the past several years
are within striking distance of "disable offloading" being possible.
There are fewer race conditions than there used to be, but there is
still no shortage.

> > > ---------
> > > How about doing this kind of call_rcu() to synchronize_rcu()
> > > transition automatically if the context allows it? I.e. Detect the
> > > context and if sleeping is allowed, then wait for the grace period
> > > synchronously in call_rcu(). Not sure about deadlocks and the like
> > > from this kind of waiting and have to think more.
> > 
> > This gets rather strange in a production PREEMPT=n build, so not a
> > fan, actually.  And in real-time systems, I pretty much have to splat
> > anyway if I slow down call_rcu() by that much.
> > 
> > So the preference is instead detecting such misconfiguration and issuing
> > appropriate diagnostics.  And making RCU more able to keep up when not
> > grossly misconfigured, hence the kfree_rcu() memory footprint being
> > fed into core RCU.
> 
> Ok. Is it not Ok to simply assume that a large number of callbacks queued
> along with observing high memory pressure, means RCU should be more
> aggressive anyway since whatever memory can be freed by invoking callbacks
> should be helpful anyway? Or were you thinking making RCU aggressive when
> there's a lot of memory pressure is not worth it, without knowing that RCU is
> the cause for it?

I used to have a memory-pressure switch for RCU, but the OOM guys hated
it.  But given a reliable "running short of memory" indicator, I would
be quite happy to use it.  After all, even if RCU is not at fault, it
might still be helpful for it to pull its memory-footprint horns in a bit.

> > > is square root of N number of rcuogp0 threads - the right optimization?
> > 
> > If there were enough CPUs, it would be necessary to have three levels
> > of hierarchy and to go to the cube root, but that would be more CPUs
> > than I have seen used.
> > > ---------
> > > The question raised was can we do with fewer threads, or even just
> > > one? You mentioned the square root might not be the right choice. How
> > > do we test how well the system is doing. Are you running rcutorture
> > > with a certain tree configuration and monitor memory footprint /
> > > performance?
> > 
> > The issue prompting the hierarcy was wakeup overhead on the grace-period
> > kthread.  Going to a hierarchy reduced the load on that single thread
> > (which could otherwise become a bottleneck on large systems, and also
> > reduced the absolute number of wakeups by up to almost a factor of two.
> > Deepening the hierarchy would further reduce the wakeup load on the
> > grace-period kthread, but would increase the total number of wakeups.
> > 
> > So this is not a matter of tweaks and optimizations.  I would need to
> > see some horrible problem with the current setup to even consider
> > making a change.
> 
> Ok, I only raised this because in the LPC talk you mentioned that you are not
> sure if this is the right optimization. But I understand the rationale for
> choosing some hierarchy in light of the wakeup performance improvements (I
> already knew that this is why you had a hierarchy).

Very good!  ;-)

> > > BTW, I have 2 interns working on RCU (Amol and Madupharna also on CC).
> > > They were selected among several others as a part of the
> > > LinuxFoundation mentorship program. They are familiar with RCU. I have
> > > asked them to look at some RCU-list work and RCU sparse work. However,
> > > I can also have them look into a few other things as time permits and
> > > depending on what interests them.
> > 
> > Dog paddling before cliff diving, please!  ;-)
> 
> Sure. They are working on relatively simpler things for their internship but
> I just put these ideas out there with them on CC so they can pick something
> else as well if they have time and interest ;-)

I considered pointing them at KCSAN reports, but about 5% of them require
global knowledge.  And it is never clear up front which are the 5%.  And
that 5% of "real bugs" is most of the motivation for things like KCSAN.

> > > Thanks, Merry Christmas!
> > 
> > And to you and yours as well!
> 
> Hope you had a good holiday season!

It did!  First holiday season in quite a few years featuring all
three kids, though not all at once.  Might be awhile until the next
time that happens.  Something about them being about 30 years old and
widely dispersed.  ;-)

As the little one becomes more aware, your holiday seasons should become
quite fun.  Don't miss out!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RCU ideas discussed at LPC
  2020-01-04  2:31     ` Paul E. McKenney
@ 2020-01-04 21:21       ` Joel Fernandes
  2020-01-06 18:03         ` Paul E. McKenney
  0 siblings, 1 reply; 6+ messages in thread
From: Joel Fernandes @ 2020-01-04 21:21 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

On Fri, Jan 03, 2020 at 06:31:33PM -0800, Paul E. McKenney wrote:
> On Fri, Jan 03, 2020 at 08:56:17PM -0500, Joel Fernandes wrote:
> > On Wed, Dec 25, 2019 at 05:05:32PM -0800, Paul E. McKenney wrote:
> > > On Wed, Dec 25, 2019 at 05:41:04PM -0500, Joel Fernandes wrote:
> > > > Hi Paul,
> > > > We were discussing some ideas on facebook so I wanted to just post
> > > > them here as well. This is in the context of the RCU section of RT MC
> > > > https://www.youtube.com/watch?v=bpyFQJV5gCI
> > > > 
> > > > Detecting high kfree_rcu() load
> > > > ----------
> > > > You mentioned about this. As I understand it, we did the kfree_rcu()
> > > > batching to let the system not do anything RCU related until a batch
> > > > has filled up enough or a timeout has occurred. This makes the GP
> > > > thread and the system do less work.
> > > > The problem you are raising in our facebook thread is, that during
> > > > heavy load the "batch" can be large and be dumped into call_rcu()
> > > > eventually. Wouldn't this be better handled generically within
> > > > call_rcu() itself, for the benefit of other non-kfree_rcu workloads?
> > > > That is if a large number of callbacks is dumped, then try to end the
> > > > GP more quickly. This likely doesn't need a signal from kfree_rcu()
> > > > since call_rcu() knows that it is being hammered.
> > > 
> > > Except that call_rcu() currently has no idea how many parcels of memory
> > > a given request from kfree_rcu() represents.
> > 
> > True. At the moment, neither does kfree_rcu() since we store only the
> > pointer. We could consult the low level allocator if they have this
> > information. If you could let me know how to make RCU more aggressive in this
> > case (once we know there's a problem), I could work on something like this. I
> > did have OOM issues in earlier versions of the kfree_rcu() patch. I could
> > boot a system with less memory and OOM it too with the tests even now.
> 
> Let's keep things simple, at first at least!  ;-)
> 
> Currently, call_rcu() has no idea how much memory is tied up by a normal
> callback, either.  But just counting the callbacks (or, in the case of
> kfree_rcu(), counting the block of memory, independent of size) is at
> least correlated with the memory footprint.  Plus that is what has been
> used in the past, so it should be a good place to start.
> 
> Besides, how many call_rcu() invocations is a 1K kfree_rcu() invocation
> worth?  A 8K kfree_rcu() invocation?  A 64-byte kfree_rcu() invocation?
> 
> We might need to answer those questions over time, but again, let's start
> simple.

Sounds great.

> > > > Detecting recursive call_rcu() within call_rcu()
> > > > ---------
> > > > We could use a per-cpu variable to detect a scenario like this, though
> > > > I am not sure if preemption during call_rcu() itself would cause false
> > > > positives.
> > > 
> > > A call_rcu() from within an RCU callback function is legal and is
> > > sometimes done.  Or are you thinking of a call_rcu() from an interrupt
> > > handler interrupting another call_rcu()?
> > 
> > Oh, did not know this. I thought this was the point heavily discussed in the
> > LPC talk but must have misunderstood when you said you hoped no one was
> > precisely doing this..
> 
> What I hoped they avoid is a call_rcu() bomb, where each callback does
> several call_rcu() invocations.  Just as with child processes invoking
> fork(), within broad limits it is OK for callback functions to invoke
> call_rcu().  There is at least one in rcutorture, for example, but it
> does just one call_rcu() and also checks a time-to-stop flag.

Ok, got it now.

> > > > ---------
> > > > How about doing this kind of call_rcu() to synchronize_rcu()
> > > > transition automatically if the context allows it? I.e. Detect the
> > > > context and if sleeping is allowed, then wait for the grace period
> > > > synchronously in call_rcu(). Not sure about deadlocks and the like
> > > > from this kind of waiting and have to think more.
> > > 
> > > This gets rather strange in a production PREEMPT=n build, so not a
> > > fan, actually.  And in real-time systems, I pretty much have to splat
> > > anyway if I slow down call_rcu() by that much.
> > > 
> > > So the preference is instead detecting such misconfiguration and issuing
> > > appropriate diagnostics.  And making RCU more able to keep up when not
> > > grossly misconfigured, hence the kfree_rcu() memory footprint being
> > > fed into core RCU.
> > 
> > Ok. Is it not Ok to simply assume that a large number of callbacks queued
> > along with observing high memory pressure, means RCU should be more
> > aggressive anyway since whatever memory can be freed by invoking callbacks
> > should be helpful anyway? Or were you thinking making RCU aggressive when
> > there's a lot of memory pressure is not worth it, without knowing that RCU is
> > the cause for it?
> 
> I used to have a memory-pressure switch for RCU, but the OOM guys hated
> it.  But given a reliable "running short of memory" indicator, I would
> be quite happy to use it.  After all, even if RCU is not at fault, it
> might still be helpful for it to pull its memory-footprint horns in a bit.

With recent advances in PSI, I am wondering if those pressure signals (for
memory) can be leveraged to pull the memory-footprint horns. I can look more
into this, I am also looking into PSI for other work things.

One thing I am wondering though is, say we get a reliable signal -- what
could RCU do? Were you thinking of having the FQS loop set the usual
emergency flags and hope the "RCU-idle" CPUs enter quiescent states, along
with additional signalling for rcu_read_unlock_special()?  Will think more
about it..

As far as testing goes, I was thinking of initially running rcuperf on a
system with less memory and never entering OOM as a "test has passed"
indication.

> > > > BTW, I have 2 interns working on RCU (Amol and Madupharna also on
> > > > CC).
> > > > They were selected among several others as a part of the
> > > > LinuxFoundation mentorship program. They are familiar with RCU. I have
> > > > asked them to look at some RCU-list work and RCU sparse work. However,
> > > > I can also have them look into a few other things as time permits and
> > > > depending on what interests them.
> > > 
> > > Dog paddling before cliff diving, please!  ;-)
> > 
> > Sure. They are working on relatively simpler things for their internship but
> > I just put these ideas out there with them on CC so they can pick something
> > else as well if they have time and interest ;-)
> 
> I considered pointing them at KCSAN reports, but about 5% of them require
> global knowledge.  And it is never clear up front which are the 5%.  And
> that 5% of "real bugs" is most of the motivation for things like KCSAN.

Interesting.

> > > > Thanks, Merry Christmas!
> > > 
> > > And to you and yours as well!
> > 
> > Hope you had a good holiday season!
> 
> It did!  First holiday season in quite a few years featuring all
> three kids, though not all at once.  Might be awhile until the next
> time that happens.  Something about them being about 30 years old and
> widely dispersed.  ;-)

Oh nice, happy to hear that and hope this year end brings the same.

> As the little one becomes more aware, your holiday seasons should become
> quite fun.  Don't miss out!  ;-)

Looking forward to it and will do ;)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: RCU ideas discussed at LPC
  2020-01-04 21:21       ` Joel Fernandes
@ 2020-01-06 18:03         ` Paul E. McKenney
  0 siblings, 0 replies; 6+ messages in thread
From: Paul E. McKenney @ 2020-01-06 18:03 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Daniel Bristot de Oliveira, Peter Zilstra, Steven Rostedt, rcu,
	Madhuparna Bhowmik, Amol Grover

On Sat, Jan 04, 2020 at 04:21:08PM -0500, Joel Fernandes wrote:
> On Fri, Jan 03, 2020 at 06:31:33PM -0800, Paul E. McKenney wrote:
> > On Fri, Jan 03, 2020 at 08:56:17PM -0500, Joel Fernandes wrote:
> > > On Wed, Dec 25, 2019 at 05:05:32PM -0800, Paul E. McKenney wrote:

[ . . . ]

> > > > > How about doing this kind of call_rcu() to synchronize_rcu()
> > > > > transition automatically if the context allows it? I.e. Detect the
> > > > > context and if sleeping is allowed, then wait for the grace period
> > > > > synchronously in call_rcu(). Not sure about deadlocks and the like
> > > > > from this kind of waiting and have to think more.
> > > > 
> > > > This gets rather strange in a production PREEMPT=n build, so not a
> > > > fan, actually.  And in real-time systems, I pretty much have to splat
> > > > anyway if I slow down call_rcu() by that much.
> > > > 
> > > > So the preference is instead detecting such misconfiguration and issuing
> > > > appropriate diagnostics.  And making RCU more able to keep up when not
> > > > grossly misconfigured, hence the kfree_rcu() memory footprint being
> > > > fed into core RCU.
> > > 
> > > Ok. Is it not Ok to simply assume that a large number of callbacks queued
> > > along with observing high memory pressure, means RCU should be more
> > > aggressive anyway since whatever memory can be freed by invoking callbacks
> > > should be helpful anyway? Or were you thinking making RCU aggressive when
> > > there's a lot of memory pressure is not worth it, without knowing that RCU is
> > > the cause for it?
> > 
> > I used to have a memory-pressure switch for RCU, but the OOM guys hated
> > it.  But given a reliable "running short of memory" indicator, I would
> > be quite happy to use it.  After all, even if RCU is not at fault, it
> > might still be helpful for it to pull its memory-footprint horns in a bit.
> 
> With recent advances in PSI, I am wondering if those pressure signals (for
> memory) can be leveraged to pull the memory-footprint horns. I can look more
> into this, I am also looking into PSI for other work things.
> 
> One thing I am wondering though is, say we get a reliable signal -- what
> could RCU do? Were you thinking of having the FQS loop set the usual
> emergency flags and hope the "RCU-idle" CPUs enter quiescent states, along
> with additional signalling for rcu_read_unlock_special()?  Will think more
> about it..

I am thinking in terms of it reacting to memory pressure in the same way
that it currently does when it finds a CPU with more RCU callbacks than
it likes.  ;-)

> As far as testing goes, I was thinking of initially running rcuperf on a
> system with less memory and never entering OOM as a "test has passed"
> indication.

Agreed.  I would look at memory-pressure actions as an additional level
of memory-footprint guardrail, and I strongly encourage you to set up
your testing and deployment so as to allow production use at least one
guardrail that normal testing does not slam into.

(Yes, there also needs to be focused testing of the last guardrail, but
that should be separate, probably an rcutorture option where a reader
keeps spinning until memory pressure kicks in.)

> > > > > BTW, I have 2 interns working on RCU (Amol and Madupharna also on
> > > > > CC).
> > > > > They were selected among several others as a part of the
> > > > > LinuxFoundation mentorship program. They are familiar with RCU. I have
> > > > > asked them to look at some RCU-list work and RCU sparse work. However,
> > > > > I can also have them look into a few other things as time permits and
> > > > > depending on what interests them.
> > > > 
> > > > Dog paddling before cliff diving, please!  ;-)
> > > 
> > > Sure. They are working on relatively simpler things for their internship but
> > > I just put these ideas out there with them on CC so they can pick something
> > > else as well if they have time and interest ;-)
> > 
> > I considered pointing them at KCSAN reports, but about 5% of them require
> > global knowledge.  And it is never clear up front which are the 5%.  And
> > that 5% of "real bugs" is most of the motivation for things like KCSAN.
> 
> Interesting.
> 
> > > > > Thanks, Merry Christmas!
> > > > 
> > > > And to you and yours as well!
> > > 
> > > Hope you had a good holiday season!
> > 
> > It did!  First holiday season in quite a few years featuring all
> > three kids, though not all at once.  Might be awhile until the next
> > time that happens.  Something about them being about 30 years old and
> > widely dispersed.  ;-)
> 
> Oh nice, happy to hear that and hope this year end brings the same.
> 
> > As the little one becomes more aware, your holiday seasons should become
> > quite fun.  Don't miss out!  ;-)
> 
> Looking forward to it and will do ;)

;-) ;-) ;-)

								Thanx, Paul

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2020-01-06 18:03 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-12-25 22:41 RCU ideas discussed at LPC Joel Fernandes
2019-12-26  1:05 ` Paul E. McKenney
2020-01-04  1:56   ` Joel Fernandes
2020-01-04  2:31     ` Paul E. McKenney
2020-01-04 21:21       ` Joel Fernandes
2020-01-06 18:03         ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).