linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* dyntick-idle CPU and node's qsmask
@ 2018-11-10 21:46 Joel Fernandes
  2018-11-10 23:04 ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-10 21:46 UTC (permalink / raw)
  To: linux-kernel, Paul E. McKenney, josh, rostedt, mathieu.desnoyers,
	jiangshanlai

Hi Paul and everyone,

I was tracing/studying the RCU code today in paul/dev branch and noticed that
for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
corresponding to the leaf node for the idle CPU, and reporting a QS on their
behalf.

rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0

That's all good but I was wondering if we can do better for the idle CPUs if
we can some how not set the qsmask of the node in the first place. Then no
reporting would be needed of quiescent state is needed for idle CPUs right?
And we would also not need to acquire the rnp lock I think.

At least for a single node tree RCU system, it seems that would avoid needing
to acquire the lock without complications. Anyway let me know your thoughts
and happy to discuss this at the hallways of the LPC as well for folks
attending :)

thanks,

- Joel

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-10 21:46 dyntick-idle CPU and node's qsmask Joel Fernandes
@ 2018-11-10 23:04 ` Paul E. McKenney
  2018-11-11  3:09   ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-10 23:04 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> Hi Paul and everyone,
> 
> I was tracing/studying the RCU code today in paul/dev branch and noticed that
> for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> corresponding to the leaf node for the idle CPU, and reporting a QS on their
> behalf.
> 
> rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> 
> That's all good but I was wondering if we can do better for the idle CPUs if
> we can some how not set the qsmask of the node in the first place. Then no
> reporting would be needed of quiescent state is needed for idle CPUs right?
> And we would also not need to acquire the rnp lock I think.
> 
> At least for a single node tree RCU system, it seems that would avoid needing
> to acquire the lock without complications. Anyway let me know your thoughts
> and happy to discuss this at the hallways of the LPC as well for folks
> attending :)

We could, but that would require consulting the rcu_data structure for
each CPU while initializing the grace period, thus increasing the number
of cache misses during grace-period initialization and also shortly after
for any non-idle CPUs.  This seems backwards on busy systems where each
CPU will with high probability report its own quiescent state before three
jiffies pass, in which case the cache misses on the rcu_data structures
would be wasted motion.

Now, this does increase overhead on mostly idle systems, but the theory
is that mostly idle systems are most able to absorb this extra overhead.

Thoughts?

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-10 23:04 ` Paul E. McKenney
@ 2018-11-11  3:09   ` Joel Fernandes
  2018-11-11  4:22     ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-11  3:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > Hi Paul and everyone,
> > 
> > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > behalf.
> > 
> > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > 
> > That's all good but I was wondering if we can do better for the idle CPUs if
> > we can some how not set the qsmask of the node in the first place. Then no
> > reporting would be needed of quiescent state is needed for idle CPUs right?
> > And we would also not need to acquire the rnp lock I think.
> > 
> > At least for a single node tree RCU system, it seems that would avoid needing
> > to acquire the lock without complications. Anyway let me know your thoughts
> > and happy to discuss this at the hallways of the LPC as well for folks
> > attending :)
> 
> We could, but that would require consulting the rcu_data structure for
> each CPU while initializing the grace period, thus increasing the number
> of cache misses during grace-period initialization and also shortly after
> for any non-idle CPUs.  This seems backwards on busy systems where each

When I traced, it appears to me that rcu_data structure of a remote CPU was
being consulted anyway by the rcu_sched thread. So it seems like such cache
miss would happen anyway whether it is during grace-period initialization or
during the fqs stage? I guess I'm trying to say, the consultation of remote
CPU's rcu_data happens anyway.

> CPU will with high probability report its own quiescent state before three
> jiffies pass, in which case the cache misses on the rcu_data structures
> would be wasted motion.

If all the CPUs are busy and reporting their QS themselves, then I think the
qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
force_qs_rnp) wouldn't be called and so there would no cache misses on
rcu_data right?

> Now, this does increase overhead on mostly idle systems, but the theory
> is that mostly idle systems are most able to absorb this extra overhead.

Yes. Could we use rcuperf to check the impact of such change?

Anyway it was just an idea that popped up when I was going through traces :)
Thanks for the discussion and happy to discuss further or try out anything.

- Joel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-11  3:09   ` Joel Fernandes
@ 2018-11-11  4:22     ` Paul E. McKenney
  2018-11-11 18:09       ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-11  4:22 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > Hi Paul and everyone,
> > > 
> > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > behalf.
> > > 
> > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > 
> > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > we can some how not set the qsmask of the node in the first place. Then no
> > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > And we would also not need to acquire the rnp lock I think.
> > > 
> > > At least for a single node tree RCU system, it seems that would avoid needing
> > > to acquire the lock without complications. Anyway let me know your thoughts
> > > and happy to discuss this at the hallways of the LPC as well for folks
> > > attending :)
> > 
> > We could, but that would require consulting the rcu_data structure for
> > each CPU while initializing the grace period, thus increasing the number
> > of cache misses during grace-period initialization and also shortly after
> > for any non-idle CPUs.  This seems backwards on busy systems where each
> 
> When I traced, it appears to me that rcu_data structure of a remote CPU was
> being consulted anyway by the rcu_sched thread. So it seems like such cache
> miss would happen anyway whether it is during grace-period initialization or
> during the fqs stage? I guess I'm trying to say, the consultation of remote
> CPU's rcu_data happens anyway.

Hmmm...

The rcu_gp_init() function does access an rcu_data structure, but it is
that of the current CPU, so shouldn't involve a communications cache miss,
at least not in the common case.

Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
functions that it calls?  In that case, please see below.

> > CPU will with high probability report its own quiescent state before three
> > jiffies pass, in which case the cache misses on the rcu_data structures
> > would be wasted motion.
> 
> If all the CPUs are busy and reporting their QS themselves, then I think the
> qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> force_qs_rnp) wouldn't be called and so there would no cache misses on
> rcu_data right?

Yes, but assuming that all CPUs report their quiescent states before
the first call to rcu_gp_fqs().  One exception is when some CPU is
looping in the kernel for many milliseconds without passing through a
quiescent state.  This is because for recent kernels, cond_resched()
is not a quiescent state until the grace period is something like 100
milliseconds old.  (For older kernels, cond_resched() was never an RCU
quiescent state unless it actually scheduled.)

Why wait 100 milliseconds?  Because otherwise the increase in
cond_resched() overhead shows up all too well, causing 0day test robot
to complain bitterly.  Besides, I would expect that in the common case,
CPUs would be executing usermode code.

Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
CPU-bound usermode workloads on those CPUs?  Such CPUs would appear to
be idle from an RCU perspective.  But these CPUs would never touch their
rcu_data structures, so they would likely remain in the RCU grace-period
kthread's cache.  So this should work well also.  Give or take that other
work would likely eject them from the cache, but in that case they would
be capacity cache misses rather than the aforementioned communications
cache misses.  Not that this distinction matters to whoever is measuring
performance.  ;-)

> > Now, this does increase overhead on mostly idle systems, but the theory
> > is that mostly idle systems are most able to absorb this extra overhead.
> 
> Yes. Could we use rcuperf to check the impact of such change?

I would be very surprised if the overhead was large enough for rcuperf
to be able to see it.

> Anyway it was just an idea that popped up when I was going through traces :)
> Thanks for the discussion and happy to discuss further or try out anything.

Either way, I do appreciate your going through this.  People have found
RCU bugs this way, one of which involved RCU uselessly calling a particular
function twice in quick succession.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-11  4:22     ` Paul E. McKenney
@ 2018-11-11 18:09       ` Joel Fernandes
  2018-11-11 18:36         ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-11 18:09 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > Hi Paul and everyone,
> > > > 
> > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > behalf.
> > > > 
> > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > 
> > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > And we would also not need to acquire the rnp lock I think.
> > > > 
> > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > attending :)
> > > 
> > > We could, but that would require consulting the rcu_data structure for
> > > each CPU while initializing the grace period, thus increasing the number
> > > of cache misses during grace-period initialization and also shortly after
> > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > 
> > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > miss would happen anyway whether it is during grace-period initialization or
> > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > CPU's rcu_data happens anyway.
> 
> Hmmm...
> 
> The rcu_gp_init() function does access an rcu_data structure, but it is
> that of the current CPU, so shouldn't involve a communications cache miss,
> at least not in the common case.
> 
> Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> functions that it calls?  In that case, please see below.

Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.

> > > CPU will with high probability report its own quiescent state before three
> > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > would be wasted motion.
> > 
> > If all the CPUs are busy and reporting their QS themselves, then I think the
> > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > rcu_data right?
> 
> Yes, but assuming that all CPUs report their quiescent states before
> the first call to rcu_gp_fqs().  One exception is when some CPU is
> looping in the kernel for many milliseconds without passing through a
> quiescent state.  This is because for recent kernels, cond_resched()
> is not a quiescent state until the grace period is something like 100
> milliseconds old.  (For older kernels, cond_resched() was never an RCU
> quiescent state unless it actually scheduled.)
> 
> Why wait 100 milliseconds?  Because otherwise the increase in
> cond_resched() overhead shows up all too well, causing 0day test robot
> to complain bitterly.  Besides, I would expect that in the common case,
> CPUs would be executing usermode code.

Makes sense. I was also wondering about this other thing you mentioned about
waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
that mean that even if a single CPU is dyntick-idle for a long period of
time, then the minimum grace period duration would be atleast 3 jiffies? In
our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
consumption low. Not that I'm saying its an issue or anything (since IIUC if
someone wants shorter grace periods, they should just use expedited GPs), but
it sounds like it would be shorter GP if we just set the qsmask early on some
how and we can manage the overhead of doing so.

> Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
> CPU-bound usermode workloads on those CPUs?  Such CPUs would appear to
> be idle from an RCU perspective.  But these CPUs would never touch their
> rcu_data structures, so they would likely remain in the RCU grace-period
> kthread's cache.  So this should work well also.  Give or take that other
> work would likely eject them from the cache, but in that case they would
> be capacity cache misses rather than the aforementioned communications
> cache misses.  Not that this distinction matters to whoever is measuring
> performance.  ;-)

Ah ok :-) I had booted with !CONFIG_NO_HZ_FULL for this test.

> > Anyway it was just an idea that popped up when I was going through traces :)
> > Thanks for the discussion and happy to discuss further or try out anything.
> 
> Either way, I do appreciate your going through this.  People have found
> RCU bugs this way, one of which involved RCU uselessly calling a particular
> function twice in quick succession.  ;-)
 
Thanks.  It is my pleasure and happy to help :) I'll keep digging into it.

 - Joel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-11 18:09       ` Joel Fernandes
@ 2018-11-11 18:36         ` Paul E. McKenney
  2018-11-11 21:04           ` Joel Fernandes
  2018-11-20 20:42           ` Joel Fernandes
  0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-11 18:36 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > Hi Paul and everyone,
> > > > > 
> > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > behalf.
> > > > > 
> > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > 
> > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > And we would also not need to acquire the rnp lock I think.
> > > > > 
> > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > attending :)
> > > > 
> > > > We could, but that would require consulting the rcu_data structure for
> > > > each CPU while initializing the grace period, thus increasing the number
> > > > of cache misses during grace-period initialization and also shortly after
> > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > 
> > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > miss would happen anyway whether it is during grace-period initialization or
> > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > CPU's rcu_data happens anyway.
> > 
> > Hmmm...
> > 
> > The rcu_gp_init() function does access an rcu_data structure, but it is
> > that of the current CPU, so shouldn't involve a communications cache miss,
> > at least not in the common case.
> > 
> > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > functions that it calls?  In that case, please see below.
> 
> Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> 
> > > > CPU will with high probability report its own quiescent state before three
> > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > would be wasted motion.
> > > 
> > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > rcu_data right?
> > 
> > Yes, but assuming that all CPUs report their quiescent states before
> > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > looping in the kernel for many milliseconds without passing through a
> > quiescent state.  This is because for recent kernels, cond_resched()
> > is not a quiescent state until the grace period is something like 100
> > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > quiescent state unless it actually scheduled.)
> > 
> > Why wait 100 milliseconds?  Because otherwise the increase in
> > cond_resched() overhead shows up all too well, causing 0day test robot
> > to complain bitterly.  Besides, I would expect that in the common case,
> > CPUs would be executing usermode code.
> 
> Makes sense. I was also wondering about this other thing you mentioned about
> waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> that mean that even if a single CPU is dyntick-idle for a long period of
> time, then the minimum grace period duration would be atleast 3 jiffies? In
> our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> consumption low. Not that I'm saying its an issue or anything (since IIUC if
> someone wants shorter grace periods, they should just use expedited GPs), but
> it sounds like it would be shorter GP if we just set the qsmask early on some
> how and we can manage the overhead of doing so.

First, there is some autotuning of the delay based on HZ:

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))

So at HZ=300, you should be seeing a two-jiffy delay rather than the
usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
rates often mean slower instruction execution and thus a desire for
lower RCU overhead.  There is further autotuning based on number of
CPUs, but this does not kick in until you have 256 CPUs on your system,
and I bet that smartphones aren't there yet.  Nevertheless, check out
RCU_JIFFIES_FQS_DIV for more info on this.

But you can always override this autotuning using the following kernel
boot paramters:

rcutree.jiffies_till_first_fqs
rcutree.jiffies_till_next_fqs

You can even set the first one to zero if you want the effect of pre-scanning
for idle CPUs.  ;-)

The second must be set to one or greater.

Both are capped at one second (HZ).

> > Ah, did you build with NO_HZ_FULL, boot with nohz_full CPUs, and then run
> > CPU-bound usermode workloads on those CPUs?  Such CPUs would appear to
> > be idle from an RCU perspective.  But these CPUs would never touch their
> > rcu_data structures, so they would likely remain in the RCU grace-period
> > kthread's cache.  So this should work well also.  Give or take that other
> > work would likely eject them from the cache, but in that case they would
> > be capacity cache misses rather than the aforementioned communications
> > cache misses.  Not that this distinction matters to whoever is measuring
> > performance.  ;-)
> 
> Ah ok :-) I had booted with !CONFIG_NO_HZ_FULL for this test.

Never mind, then.  ;-)

> > > Anyway it was just an idea that popped up when I was going through traces :)
> > > Thanks for the discussion and happy to discuss further or try out anything.
> > 
> > Either way, I do appreciate your going through this.  People have found
> > RCU bugs this way, one of which involved RCU uselessly calling a particular
> > function twice in quick succession.  ;-)
>  
> Thanks.  It is my pleasure and happy to help :) I'll keep digging into it.

Looking forward to further questions and patches.  ;-)

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-11 18:36         ` Paul E. McKenney
@ 2018-11-11 21:04           ` Joel Fernandes
  2018-11-20 20:42           ` Joel Fernandes
  1 sibling, 0 replies; 14+ messages in thread
From: Joel Fernandes @ 2018-11-11 21:04 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
[..]
> > > > > CPU will with high probability report its own quiescent state before three
> > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > would be wasted motion.
> > > > 
> > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > > 
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state.  This is because for recent kernels, cond_resched()
> > > is not a quiescent state until the grace period is something like 100
> > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > quiescent state unless it actually scheduled.)
> > > 
> > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > to complain bitterly.  Besides, I would expect that in the common case,
> > > CPUs would be executing usermode code.
> > 
> > Makes sense. I was also wondering about this other thing you mentioned about
> > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > that mean that even if a single CPU is dyntick-idle for a long period of
> > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > someone wants shorter grace periods, they should just use expedited GPs), but
> > it sounds like it would be shorter GP if we just set the qsmask early on some
> > how and we can manage the overhead of doing so.
> 
> First, there is some autotuning of the delay based on HZ:
> 
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> 
> So at HZ=300, you should be seeing a two-jiffy delay rather than the
> usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> rates often mean slower instruction execution and thus a desire for
> lower RCU overhead.  There is further autotuning based on number of
> CPUs, but this does not kick in until you have 256 CPUs on your system,
> and I bet that smartphones aren't there yet.  Nevertheless, check out
> RCU_JIFFIES_FQS_DIV for more info on this.

Got it. I agree with that heuristic.

> But you can always override this autotuning using the following kernel
> boot paramters:
> 
> rcutree.jiffies_till_first_fqs
> rcutree.jiffies_till_next_fqs
> 
> You can even set the first one to zero if you want the effect of pre-scanning
> for idle CPUs.  ;-)
> 
> The second must be set to one or greater.
> 
> Both are capped at one second (HZ).

Got it. Thanks a lot for the explanations.

> > > > Anyway it was just an idea that popped up when I was going through traces :)
> > > > Thanks for the discussion and happy to discuss further or try out anything.
> > > 
> > > Either way, I do appreciate your going through this.  People have found
> > > RCU bugs this way, one of which involved RCU uselessly calling a particular
> > > function twice in quick succession.  ;-)
> >  
> > Thanks.  It is my pleasure and happy to help :) I'll keep digging into it.
> 
> Looking forward to further questions and patches.  ;-)

Will do! thanks,

 - Joel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-11 18:36         ` Paul E. McKenney
  2018-11-11 21:04           ` Joel Fernandes
@ 2018-11-20 20:42           ` Joel Fernandes
  2018-11-20 22:28             ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-20 20:42 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > Hi Paul and everyone,
> > > > > > 
> > > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > > behalf.
> > > > > > 
> > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > 
> > > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > 
> > > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > > attending :)
> > > > > 
> > > > > We could, but that would require consulting the rcu_data structure for
> > > > > each CPU while initializing the grace period, thus increasing the number
> > > > > of cache misses during grace-period initialization and also shortly after
> > > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > > 
> > > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > > miss would happen anyway whether it is during grace-period initialization or
> > > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > > CPU's rcu_data happens anyway.
> > > 
> > > Hmmm...
> > > 
> > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > at least not in the common case.
> > > 
> > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > functions that it calls?  In that case, please see below.
> > 
> > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > 
> > > > > CPU will with high probability report its own quiescent state before three
> > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > would be wasted motion.
> > > > 
> > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > rcu_data right?
> > > 
> > > Yes, but assuming that all CPUs report their quiescent states before
> > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > looping in the kernel for many milliseconds without passing through a
> > > quiescent state.  This is because for recent kernels, cond_resched()
> > > is not a quiescent state until the grace period is something like 100
> > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > quiescent state unless it actually scheduled.)
> > > 
> > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > to complain bitterly.  Besides, I would expect that in the common case,
> > > CPUs would be executing usermode code.
> > 
> > Makes sense. I was also wondering about this other thing you mentioned about
> > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > that mean that even if a single CPU is dyntick-idle for a long period of
> > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > someone wants shorter grace periods, they should just use expedited GPs), but
> > it sounds like it would be shorter GP if we just set the qsmask early on some
> > how and we can manage the overhead of doing so.
> 
> First, there is some autotuning of the delay based on HZ:
> 
> #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> 
> So at HZ=300, you should be seeing a two-jiffy delay rather than the
> usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> rates often mean slower instruction execution and thus a desire for
> lower RCU overhead.  There is further autotuning based on number of
> CPUs, but this does not kick in until you have 256 CPUs on your system,
> and I bet that smartphones aren't there yet.  Nevertheless, check out
> RCU_JIFFIES_FQS_DIV for more info on this.
> 
> But you can always override this autotuning using the following kernel
> boot paramters:
> 
> rcutree.jiffies_till_first_fqs
> rcutree.jiffies_till_next_fqs

Slightly related, I was just going through your patch in the dev branch "doc:
Now jiffies_till_sched_qs solicits from cond_resched()".

If I understand correctly, what you're trying to do is set
rcu_data.rcu_urgent_qs if you've not heard from the CPU long enough from
rcu_implicit_dynticks_qs.

Then in the other paths, you are reading this value and similuating a dyntick
idle transition even though you may not be really going into dyntick-idle.
Actually in the scheduler-tick, you are also using it to set NEED_RESCHED
appropriately.

Did I get it right so far?

I was thinking if we could simplify rcu_note_context_switch (the parts that
call rcu_momentary_dyntick_idle), if we did the following in
rcu_implicit_dynticks_qs.

Since we already call rcu_qs in rcu_note_context_switch, that would clear the
rdp->cpu_no_qs flag. Then there should be no need to call
rcu_momentary_dyntick_idle from rcu_note_context switch.

I think this would simplify cond_resched as well.  Could this avoid the need
for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..

Basically for some background, I was thinking can we simplify the code that
calls "rcu_momentary_dyntick_idle" since we already register a qs in other
ways (like by resetting cpu_no_qs).

I should probably start drawing some pictures to make sense of everything,
but do let me know if I have a point ;-) Thanks for your time.

- Joel

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index c818e0c91a81..5aa0259c014d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 	 * read-side critical section that started before the beginning
 	 * of the current RCU grace period.
 	 */
-	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
+	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
 		trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
 		rcu_gpnum_ovf(rnp, rdp);
 		return 1;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-20 20:42           ` Joel Fernandes
@ 2018-11-20 22:28             ` Paul E. McKenney
  2018-11-20 22:34               ` Paul E. McKenney
  2018-11-21  2:06               ` Joel Fernandes
  0 siblings, 2 replies; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-20 22:28 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > Hi Paul and everyone,
> > > > > > > 
> > > > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > > > behalf.
> > > > > > > 
> > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > > 
> > > > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > 
> > > > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > > > attending :)
> > > > > > 
> > > > > > We could, but that would require consulting the rcu_data structure for
> > > > > > each CPU while initializing the grace period, thus increasing the number
> > > > > > of cache misses during grace-period initialization and also shortly after
> > > > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > > > 
> > > > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > > > miss would happen anyway whether it is during grace-period initialization or
> > > > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > > > CPU's rcu_data happens anyway.
> > > > 
> > > > Hmmm...
> > > > 
> > > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > > at least not in the common case.
> > > > 
> > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > functions that it calls?  In that case, please see below.
> > > 
> > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > 
> > > > > > CPU will with high probability report its own quiescent state before three
> > > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > > would be wasted motion.
> > > > > 
> > > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > > rcu_data right?
> > > > 
> > > > Yes, but assuming that all CPUs report their quiescent states before
> > > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > > looping in the kernel for many milliseconds without passing through a
> > > > quiescent state.  This is because for recent kernels, cond_resched()
> > > > is not a quiescent state until the grace period is something like 100
> > > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > > quiescent state unless it actually scheduled.)
> > > > 
> > > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > > to complain bitterly.  Besides, I would expect that in the common case,
> > > > CPUs would be executing usermode code.
> > > 
> > > Makes sense. I was also wondering about this other thing you mentioned about
> > > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > > that mean that even if a single CPU is dyntick-idle for a long period of
> > > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > > someone wants shorter grace periods, they should just use expedited GPs), but
> > > it sounds like it would be shorter GP if we just set the qsmask early on some
> > > how and we can manage the overhead of doing so.
> > 
> > First, there is some autotuning of the delay based on HZ:
> > 
> > #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> > 
> > So at HZ=300, you should be seeing a two-jiffy delay rather than the
> > usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> > is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> > rates often mean slower instruction execution and thus a desire for
> > lower RCU overhead.  There is further autotuning based on number of
> > CPUs, but this does not kick in until you have 256 CPUs on your system,
> > and I bet that smartphones aren't there yet.  Nevertheless, check out
> > RCU_JIFFIES_FQS_DIV for more info on this.
> > 
> > But you can always override this autotuning using the following kernel
> > boot paramters:
> > 
> > rcutree.jiffies_till_first_fqs
> > rcutree.jiffies_till_next_fqs
> 
> Slightly related, I was just going through your patch in the dev branch "doc:
> Now jiffies_till_sched_qs solicits from cond_resched()".
> 
> If I understand correctly, what you're trying to do is set
> rcu_data.rcu_urgent_qs if you've not heard from the CPU long enough from
> rcu_implicit_dynticks_qs.
> 
> Then in the other paths, you are reading this value and similuating a dyntick
> idle transition even though you may not be really going into dyntick-idle.
> Actually in the scheduler-tick, you are also using it to set NEED_RESCHED
> appropriately.
> 
> Did I get it right so far?

Partially.

The simulated dyntick-idle transition happens if the grace period extends
for even longer, so that ->rcu_need_heavy_qs gets set.  Up to that point,
all that is asked for is a local-to-the-CPU report of a quiescent state.

> I was thinking if we could simplify rcu_note_context_switch (the parts that
> call rcu_momentary_dyntick_idle), if we did the following in
> rcu_implicit_dynticks_qs.
> 
> Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> rdp->cpu_no_qs flag. Then there should be no need to call
> rcu_momentary_dyntick_idle from rcu_note_context switch.

But does this also work for the rcu_all_qs() code path?

> I think this would simplify cond_resched as well.  Could this avoid the need
> for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..

There is also the code path from cond_resched() in PREEMPT=n kernels.
This needs rcu_all_qs().  Though it is quite possible that some additional
code collapsing is possible.

> Basically for some background, I was thinking can we simplify the code that
> calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> ways (like by resetting cpu_no_qs).

One complication is that rcu_all_qs() is invoked with interrupts
and preemption enabled, while rcu_note_context_switch() is
invoked with interrupts disabled.  Also, as you say, Tasks RCU.
Plus rcu_all_qs() wants to exit immediately if there is nothing to
do, while rcu_note_context_switch() must unconditionally do rcu_qs()
-- yes, it could check, but that would be redundant with the checks
within rcu_qs().  The one function traces and the other one doesn't,
but it would be OK if both traced.  (I hope, anyway:  The cond_resched()
performance requirements are surprisingly severe.)  Aside from that,
the two functions are quite similar.

It would of course be possible to create a common helper function that
rcu_all_qs() and rcu_note_context_switch() both became simple wrappers
for, but it is not clear that this would actually be shorter or simpler.

> I should probably start drawing some pictures to make sense of everything,
> but do let me know if I have a point ;-) Thanks for your time.

This stuff is admittedly a bit fiddly.  Again, it took some serious
work to avoid cond_resched() performance regressions.

> - Joel
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index c818e0c91a81..5aa0259c014d 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
>  	 * read-side critical section that started before the beginning
>  	 * of the current RCU grace period.
>  	 */
> -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {

If I am not too confused, this change could cause trouble for
nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
scheduler-clock interrupts, last I checked, and this could prevent the
CPU from reporting its quiescent state to core RCU.

Or am I missing something here?

							Thanx, Paul

>  		trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
>  		rcu_gpnum_ovf(rnp, rdp);
>  		return 1;
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-20 22:28             ` Paul E. McKenney
@ 2018-11-20 22:34               ` Paul E. McKenney
  2018-11-21  2:06               ` Joel Fernandes
  1 sibling, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-20 22:34 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 02:28:13PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > Hi Paul and everyone,
> > > > > > > > 
> > > > > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > > > > behalf.
> > > > > > > > 
> > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > > > 
> > > > > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > 
> > > > > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > > > > attending :)
> > > > > > > 
> > > > > > > We could, but that would require consulting the rcu_data structure for
> > > > > > > each CPU while initializing the grace period, thus increasing the number
> > > > > > > of cache misses during grace-period initialization and also shortly after
> > > > > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > > > > 
> > > > > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > > > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > > > > miss would happen anyway whether it is during grace-period initialization or
> > > > > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > > > > CPU's rcu_data happens anyway.
> > > > > 
> > > > > Hmmm...
> > > > > 
> > > > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > > > at least not in the common case.
> > > > > 
> > > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > > functions that it calls?  In that case, please see below.
> > > > 
> > > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > > 
> > > > > > > CPU will with high probability report its own quiescent state before three
> > > > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > > > would be wasted motion.
> > > > > > 
> > > > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > > > rcu_data right?
> > > > > 
> > > > > Yes, but assuming that all CPUs report their quiescent states before
> > > > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > > > looping in the kernel for many milliseconds without passing through a
> > > > > quiescent state.  This is because for recent kernels, cond_resched()
> > > > > is not a quiescent state until the grace period is something like 100
> > > > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > > > quiescent state unless it actually scheduled.)
> > > > > 
> > > > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > > > to complain bitterly.  Besides, I would expect that in the common case,
> > > > > CPUs would be executing usermode code.
> > > > 
> > > > Makes sense. I was also wondering about this other thing you mentioned about
> > > > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > > > that mean that even if a single CPU is dyntick-idle for a long period of
> > > > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > > > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > > > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > > > someone wants shorter grace periods, they should just use expedited GPs), but
> > > > it sounds like it would be shorter GP if we just set the qsmask early on some
> > > > how and we can manage the overhead of doing so.
> > > 
> > > First, there is some autotuning of the delay based on HZ:
> > > 
> > > #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> > > 
> > > So at HZ=300, you should be seeing a two-jiffy delay rather than the
> > > usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> > > is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> > > rates often mean slower instruction execution and thus a desire for
> > > lower RCU overhead.  There is further autotuning based on number of
> > > CPUs, but this does not kick in until you have 256 CPUs on your system,
> > > and I bet that smartphones aren't there yet.  Nevertheless, check out
> > > RCU_JIFFIES_FQS_DIV for more info on this.
> > > 
> > > But you can always override this autotuning using the following kernel
> > > boot paramters:
> > > 
> > > rcutree.jiffies_till_first_fqs
> > > rcutree.jiffies_till_next_fqs
> > 
> > Slightly related, I was just going through your patch in the dev branch "doc:
> > Now jiffies_till_sched_qs solicits from cond_resched()".
> > 
> > If I understand correctly, what you're trying to do is set
> > rcu_data.rcu_urgent_qs if you've not heard from the CPU long enough from
> > rcu_implicit_dynticks_qs.
> > 
> > Then in the other paths, you are reading this value and similuating a dyntick
> > idle transition even though you may not be really going into dyntick-idle.
> > Actually in the scheduler-tick, you are also using it to set NEED_RESCHED
> > appropriately.
> > 
> > Did I get it right so far?
> 
> Partially.
> 
> The simulated dyntick-idle transition happens if the grace period extends
> for even longer, so that ->rcu_need_heavy_qs gets set.  Up to that point,
> all that is asked for is a local-to-the-CPU report of a quiescent state.
> 
> > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > call rcu_momentary_dyntick_idle), if we did the following in
> > rcu_implicit_dynticks_qs.
> > 
> > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > rdp->cpu_no_qs flag. Then there should be no need to call
> > rcu_momentary_dyntick_idle from rcu_note_context switch.
> 
> But does this also work for the rcu_all_qs() code path?
> 
> > I think this would simplify cond_resched as well.  Could this avoid the need
> > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> 
> There is also the code path from cond_resched() in PREEMPT=n kernels.
> This needs rcu_all_qs().  Though it is quite possible that some additional
> code collapsing is possible.
> 
> > Basically for some background, I was thinking can we simplify the code that
> > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > ways (like by resetting cpu_no_qs).
> 
> One complication is that rcu_all_qs() is invoked with interrupts
> and preemption enabled, while rcu_note_context_switch() is
> invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> Plus rcu_all_qs() wants to exit immediately if there is nothing to
> do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> -- yes, it could check, but that would be redundant with the checks
> within rcu_qs().  The one function traces and the other one doesn't,
> but it would be OK if both traced.  (I hope, anyway:  The cond_resched()
> performance requirements are surprisingly severe.)  Aside from that,
> the two functions are quite similar.

Plus there are two sets of rcu_qs() and rcu_note_context_switch(),
on for PREEMPT=y and the other for PREEMPT=n.  And cond_resched()
is nothingness for PREEMPT=y.  And currently rcu_implicit_dynticks_qs()
needs to work with both sets.

							Thanx, Paul

> It would of course be possible to create a common helper function that
> rcu_all_qs() and rcu_note_context_switch() both became simple wrappers
> for, but it is not clear that this would actually be shorter or simpler.
> 
> > I should probably start drawing some pictures to make sense of everything,
> > but do let me know if I have a point ;-) Thanks for your time.
> 
> This stuff is admittedly a bit fiddly.  Again, it took some serious
> work to avoid cond_resched() performance regressions.
> 
> > - Joel
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c818e0c91a81..5aa0259c014d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> >  	 * read-side critical section that started before the beginning
> >  	 * of the current RCU grace period.
> >  	 */
> > -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> 
> If I am not too confused, this change could cause trouble for
> nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> scheduler-clock interrupts, last I checked, and this could prevent the
> CPU from reporting its quiescent state to core RCU.
> 
> Or am I missing something here?
> 
> 							Thanx, Paul
> 
> >  		trace_rcu_fqs(rcu_state.name, rdp->gp_seq, rdp->cpu, TPS("dti"));
> >  		rcu_gpnum_ovf(rnp, rdp);
> >  		return 1;
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-20 22:28             ` Paul E. McKenney
  2018-11-20 22:34               ` Paul E. McKenney
@ 2018-11-21  2:06               ` Joel Fernandes
  2018-11-21  2:41                 ` Paul E. McKenney
  1 sibling, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-21  2:06 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 02:28:14PM -0800, Paul E. McKenney wrote:
> On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > Hi Paul and everyone,
> > > > > > > > 
> > > > > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > > > > behalf.
> > > > > > > > 
> > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > > > 
> > > > > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > 
> > > > > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > > > > attending :)
> > > > > > > 
> > > > > > > We could, but that would require consulting the rcu_data structure for
> > > > > > > each CPU while initializing the grace period, thus increasing the number
> > > > > > > of cache misses during grace-period initialization and also shortly after
> > > > > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > > > > 
> > > > > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > > > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > > > > miss would happen anyway whether it is during grace-period initialization or
> > > > > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > > > > CPU's rcu_data happens anyway.
> > > > > 
> > > > > Hmmm...
> > > > > 
> > > > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > > > at least not in the common case.
> > > > > 
> > > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > > functions that it calls?  In that case, please see below.
> > > > 
> > > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > > 
> > > > > > > CPU will with high probability report its own quiescent state before three
> > > > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > > > would be wasted motion.
> > > > > > 
> > > > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > > > rcu_data right?
> > > > > 
> > > > > Yes, but assuming that all CPUs report their quiescent states before
> > > > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > > > looping in the kernel for many milliseconds without passing through a
> > > > > quiescent state.  This is because for recent kernels, cond_resched()
> > > > > is not a quiescent state until the grace period is something like 100
> > > > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > > > quiescent state unless it actually scheduled.)
> > > > > 
> > > > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > > > to complain bitterly.  Besides, I would expect that in the common case,
> > > > > CPUs would be executing usermode code.
> > > > 
> > > > Makes sense. I was also wondering about this other thing you mentioned about
> > > > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > > > that mean that even if a single CPU is dyntick-idle for a long period of
> > > > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > > > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > > > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > > > someone wants shorter grace periods, they should just use expedited GPs), but
> > > > it sounds like it would be shorter GP if we just set the qsmask early on some
> > > > how and we can manage the overhead of doing so.
> > > 
> > > First, there is some autotuning of the delay based on HZ:
> > > 
> > > #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> > > 
> > > So at HZ=300, you should be seeing a two-jiffy delay rather than the
> > > usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> > > is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> > > rates often mean slower instruction execution and thus a desire for
> > > lower RCU overhead.  There is further autotuning based on number of
> > > CPUs, but this does not kick in until you have 256 CPUs on your system,
> > > and I bet that smartphones aren't there yet.  Nevertheless, check out
> > > RCU_JIFFIES_FQS_DIV for more info on this.
> > > 
> > > But you can always override this autotuning using the following kernel
> > > boot paramters:
> > > 
> > > rcutree.jiffies_till_first_fqs
> > > rcutree.jiffies_till_next_fqs
> > 
> > Slightly related, I was just going through your patch in the dev branch "doc:
> > Now jiffies_till_sched_qs solicits from cond_resched()".
> > 
> > If I understand correctly, what you're trying to do is set
> > rcu_data.rcu_urgent_qs if you've not heard from the CPU long enough from
> > rcu_implicit_dynticks_qs.
> > 
> > Then in the other paths, you are reading this value and similuating a dyntick
> > idle transition even though you may not be really going into dyntick-idle.
> > Actually in the scheduler-tick, you are also using it to set NEED_RESCHED
> > appropriately.
> > 
> > Did I get it right so far?
> 
> Partially.
> 
> The simulated dyntick-idle transition happens if the grace period extends
> for even longer, so that ->rcu_need_heavy_qs gets set.  Up to that point,
> all that is asked for is a local-to-the-CPU report of a quiescent state.

Right, that's true. My feeling was the whole "fake a dyntick idle transition"
seems to me a bit of a hack and I was thinking if not depending on that would
simplify the code so we don't need the rcu_momentary_dyntick_idle.

> > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > call rcu_momentary_dyntick_idle), if we did the following in
> > rcu_implicit_dynticks_qs.
> > 
> > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > rdp->cpu_no_qs flag. Then there should be no need to call
> > rcu_momentary_dyntick_idle from rcu_note_context switch.
> 
> But does this also work for the rcu_all_qs() code path?

Could we not do something like this in rcu_all_qs? as some over-simplified
pseudo code:

rcu_all_qs() {
  if (!urgent_qs || !heavy_qs)
     return;

  rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can monitor in
              //  the diff in my last email (from rcu_implicit_dynticks_qs)
}

> > I think this would simplify cond_resched as well.  Could this avoid the need
> > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> 
> There is also the code path from cond_resched() in PREEMPT=n kernels.
> This needs rcu_all_qs().  Though it is quite possible that some additional
> code collapsing is possible.
> 
> > Basically for some background, I was thinking can we simplify the code that
> > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > ways (like by resetting cpu_no_qs).
> 
> One complication is that rcu_all_qs() is invoked with interrupts
> and preemption enabled, while rcu_note_context_switch() is
> invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> Plus rcu_all_qs() wants to exit immediately if there is nothing to
> do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> -- yes, it could check, but that would be redundant with the checks

This immediate exit is taken care off in the above psuedo code, would that
help the cond_resched performance?

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index c818e0c91a81..5aa0259c014d 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> >  	 * read-side critical section that started before the beginning
> >  	 * of the current RCU grace period.
> >  	 */
> > -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> 
> If I am not too confused, this change could cause trouble for
> nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> scheduler-clock interrupts, last I checked, and this could prevent the
> CPU from reporting its quiescent state to core RCU.

Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
diff is an OR condition so it is more relaxed than before.

Assuming the NOHZ_FULL CPUs call cond_resched during their looping, that
would trigger the rcu_all_qs -> rcu_qs path which would clear cpu_no_qs flag
for that CPU right? This would result in the above diff causing a return of 1
for that CPU (from rcu_implicit_dynticks_qs).

> Or am I missing something here?

I think I might be the one missing something but I'm glad we are almost on
the same page ;-)

thanks,

 -  Joel


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-21  2:06               ` Joel Fernandes
@ 2018-11-21  2:41                 ` Paul E. McKenney
  2018-11-21  4:37                   ` Joel Fernandes
  0 siblings, 1 reply; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-21  2:41 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 06:06:12PM -0800, Joel Fernandes wrote:
> On Tue, Nov 20, 2018 at 02:28:14PM -0800, Paul E. McKenney wrote:
> > On Tue, Nov 20, 2018 at 12:42:43PM -0800, Joel Fernandes wrote:
> > > On Sun, Nov 11, 2018 at 10:36:18AM -0800, Paul E. McKenney wrote:
> > > > On Sun, Nov 11, 2018 at 10:09:16AM -0800, Joel Fernandes wrote:
> > > > > On Sat, Nov 10, 2018 at 08:22:10PM -0800, Paul E. McKenney wrote:
> > > > > > On Sat, Nov 10, 2018 at 07:09:25PM -0800, Joel Fernandes wrote:
> > > > > > > On Sat, Nov 10, 2018 at 03:04:36PM -0800, Paul E. McKenney wrote:
> > > > > > > > On Sat, Nov 10, 2018 at 01:46:59PM -0800, Joel Fernandes wrote:
> > > > > > > > > Hi Paul and everyone,
> > > > > > > > > 
> > > > > > > > > I was tracing/studying the RCU code today in paul/dev branch and noticed that
> > > > > > > > > for dyntick-idle CPUs, the RCU GP thread is clearing the rnp->qsmask
> > > > > > > > > corresponding to the leaf node for the idle CPU, and reporting a QS on their
> > > > > > > > > behalf.
> > > > > > > > > 
> > > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 792 0 dti
> > > > > > > > > rcu_sched-10    [003]    40.008039: rcu_fqs:              rcu_sched 801 2 dti
> > > > > > > > > rcu_sched-10    [003]    40.008041: rcu_quiescent_state_report: rcu_sched 805 5>0 0 0 3 0
> > > > > > > > > 
> > > > > > > > > That's all good but I was wondering if we can do better for the idle CPUs if
> > > > > > > > > we can some how not set the qsmask of the node in the first place. Then no
> > > > > > > > > reporting would be needed of quiescent state is needed for idle CPUs right?
> > > > > > > > > And we would also not need to acquire the rnp lock I think.
> > > > > > > > > 
> > > > > > > > > At least for a single node tree RCU system, it seems that would avoid needing
> > > > > > > > > to acquire the lock without complications. Anyway let me know your thoughts
> > > > > > > > > and happy to discuss this at the hallways of the LPC as well for folks
> > > > > > > > > attending :)
> > > > > > > > 
> > > > > > > > We could, but that would require consulting the rcu_data structure for
> > > > > > > > each CPU while initializing the grace period, thus increasing the number
> > > > > > > > of cache misses during grace-period initialization and also shortly after
> > > > > > > > for any non-idle CPUs.  This seems backwards on busy systems where each
> > > > > > > 
> > > > > > > When I traced, it appears to me that rcu_data structure of a remote CPU was
> > > > > > > being consulted anyway by the rcu_sched thread. So it seems like such cache
> > > > > > > miss would happen anyway whether it is during grace-period initialization or
> > > > > > > during the fqs stage? I guess I'm trying to say, the consultation of remote
> > > > > > > CPU's rcu_data happens anyway.
> > > > > > 
> > > > > > Hmmm...
> > > > > > 
> > > > > > The rcu_gp_init() function does access an rcu_data structure, but it is
> > > > > > that of the current CPU, so shouldn't involve a communications cache miss,
> > > > > > at least not in the common case.
> > > > > > 
> > > > > > Or are you seeing these cross-CPU rcu_data accesses in rcu_gp_fqs() or
> > > > > > functions that it calls?  In that case, please see below.
> > > > > 
> > > > > Yes, it was rcu_implicit_dynticks_qs called from rcu_gp_fqs.
> > > > > 
> > > > > > > > CPU will with high probability report its own quiescent state before three
> > > > > > > > jiffies pass, in which case the cache misses on the rcu_data structures
> > > > > > > > would be wasted motion.
> > > > > > > 
> > > > > > > If all the CPUs are busy and reporting their QS themselves, then I think the
> > > > > > > qsmask is likely 0 so then rcu_implicit_dynticks_qs (called from
> > > > > > > force_qs_rnp) wouldn't be called and so there would no cache misses on
> > > > > > > rcu_data right?
> > > > > > 
> > > > > > Yes, but assuming that all CPUs report their quiescent states before
> > > > > > the first call to rcu_gp_fqs().  One exception is when some CPU is
> > > > > > looping in the kernel for many milliseconds without passing through a
> > > > > > quiescent state.  This is because for recent kernels, cond_resched()
> > > > > > is not a quiescent state until the grace period is something like 100
> > > > > > milliseconds old.  (For older kernels, cond_resched() was never an RCU
> > > > > > quiescent state unless it actually scheduled.)
> > > > > > 
> > > > > > Why wait 100 milliseconds?  Because otherwise the increase in
> > > > > > cond_resched() overhead shows up all too well, causing 0day test robot
> > > > > > to complain bitterly.  Besides, I would expect that in the common case,
> > > > > > CPUs would be executing usermode code.
> > > > > 
> > > > > Makes sense. I was also wondering about this other thing you mentioned about
> > > > > waiting for 3 jiffies before reporting the idle CPU's quiescent state. Does
> > > > > that mean that even if a single CPU is dyntick-idle for a long period of
> > > > > time, then the minimum grace period duration would be atleast 3 jiffies? In
> > > > > our mobile embedded devices, jiffies is set to 3.33ms (HZ=300) to keep power
> > > > > consumption low. Not that I'm saying its an issue or anything (since IIUC if
> > > > > someone wants shorter grace periods, they should just use expedited GPs), but
> > > > > it sounds like it would be shorter GP if we just set the qsmask early on some
> > > > > how and we can manage the overhead of doing so.
> > > > 
> > > > First, there is some autotuning of the delay based on HZ:
> > > > 
> > > > #define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
> > > > 
> > > > So at HZ=300, you should be seeing a two-jiffy delay rather than the
> > > > usual HZ=1000 three-jiffy delay.  Of course, this means that the delay
> > > > is 6.67ms rather than the usual 3ms, but the theory is that lower HZ
> > > > rates often mean slower instruction execution and thus a desire for
> > > > lower RCU overhead.  There is further autotuning based on number of
> > > > CPUs, but this does not kick in until you have 256 CPUs on your system,
> > > > and I bet that smartphones aren't there yet.  Nevertheless, check out
> > > > RCU_JIFFIES_FQS_DIV for more info on this.
> > > > 
> > > > But you can always override this autotuning using the following kernel
> > > > boot paramters:
> > > > 
> > > > rcutree.jiffies_till_first_fqs
> > > > rcutree.jiffies_till_next_fqs
> > > 
> > > Slightly related, I was just going through your patch in the dev branch "doc:
> > > Now jiffies_till_sched_qs solicits from cond_resched()".
> > > 
> > > If I understand correctly, what you're trying to do is set
> > > rcu_data.rcu_urgent_qs if you've not heard from the CPU long enough from
> > > rcu_implicit_dynticks_qs.
> > > 
> > > Then in the other paths, you are reading this value and similuating a dyntick
> > > idle transition even though you may not be really going into dyntick-idle.
> > > Actually in the scheduler-tick, you are also using it to set NEED_RESCHED
> > > appropriately.
> > > 
> > > Did I get it right so far?
> > 
> > Partially.
> > 
> > The simulated dyntick-idle transition happens if the grace period extends
> > for even longer, so that ->rcu_need_heavy_qs gets set.  Up to that point,
> > all that is asked for is a local-to-the-CPU report of a quiescent state.
> 
> Right, that's true. My feeling was the whole "fake a dyntick idle transition"
> seems to me a bit of a hack and I was thinking if not depending on that would
> simplify the code so we don't need the rcu_momentary_dyntick_idle.

It is there in case scheduling-clock interrupts are for whatever reason
not happening on that CPU.

> > > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > > call rcu_momentary_dyntick_idle), if we did the following in
> > > rcu_implicit_dynticks_qs.
> > > 
> > > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > 
> > But does this also work for the rcu_all_qs() code path?
> 
> Could we not do something like this in rcu_all_qs? as some over-simplified
> pseudo code:
> 
> rcu_all_qs() {
>   if (!urgent_qs || !heavy_qs)
>      return;
> 
>   rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can monitor in
>               //  the diff in my last email (from rcu_implicit_dynticks_qs)
> }

Except that rcu_qs() doesn't necessarily report the quiescent state to
the RCU core.  Keeping down context-switch overhead and all that.

> > > I think this would simplify cond_resched as well.  Could this avoid the need
> > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> > 
> > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > This needs rcu_all_qs().  Though it is quite possible that some additional
> > code collapsing is possible.
> > 
> > > Basically for some background, I was thinking can we simplify the code that
> > > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > > ways (like by resetting cpu_no_qs).
> > 
> > One complication is that rcu_all_qs() is invoked with interrupts
> > and preemption enabled, while rcu_note_context_switch() is
> > invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > -- yes, it could check, but that would be redundant with the checks
> 
> This immediate exit is taken care off in the above psuedo code, would that
> help the cond_resched performance?

It look like you are cautiously edging towards the two wrapper functions
calling common code, relying on inlining and simplification.  Why not just
try doing it?  ;-)

> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index c818e0c91a81..5aa0259c014d 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > >  	 * read-side critical section that started before the beginning
> > >  	 * of the current RCU grace period.
> > >  	 */
> > > -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> > 
> > If I am not too confused, this change could cause trouble for
> > nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> > scheduler-clock interrupts, last I checked, and this could prevent the
> > CPU from reporting its quiescent state to core RCU.
> 
> Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
> diff is an OR condition so it is more relaxed than before.

Yes, because rcu_qs() is only guaranteed to capture the quiescent
state on the current CPU, not necessarily report it to the RCU core.

> Assuming the NOHZ_FULL CPUs call cond_resched during their looping, that
> would trigger the rcu_all_qs -> rcu_qs path which would clear cpu_no_qs flag
> for that CPU right? This would result in the above diff causing a return of 1
> for that CPU (from rcu_implicit_dynticks_qs).

One problem is that cond_resched() is a complete no-op in PREEMPT=y
kernels.

> > Or am I missing something here?
> 
> I think I might be the one missing something but I'm glad we are almost on
> the same page ;-)

There are a lot of moving parts in this area, to be sure.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-21  2:41                 ` Paul E. McKenney
@ 2018-11-21  4:37                   ` Joel Fernandes
  2018-11-21 14:39                     ` Paul E. McKenney
  0 siblings, 1 reply; 14+ messages in thread
From: Joel Fernandes @ 2018-11-21  4:37 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 06:41:07PM -0800, Paul E. McKenney wrote:
[...] 
> > > > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > > > call rcu_momentary_dyntick_idle), if we did the following in
> > > > rcu_implicit_dynticks_qs.
> > > > 
> > > > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > > 
> > > But does this also work for the rcu_all_qs() code path?
> > 
> > Could we not do something like this in rcu_all_qs? as some over-simplified
> > pseudo code:
> > 
> > rcu_all_qs() {
> >   if (!urgent_qs || !heavy_qs)
> >      return;
> > 
> >   rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can monitor in
> >               //  the diff in my last email (from rcu_implicit_dynticks_qs)
> > }
> 
> Except that rcu_qs() doesn't necessarily report the quiescent state to
> the RCU core.  Keeping down context-switch overhead and all that.

Sure yeah, but I think the QS will be indirectly anyway by the force_qs_rnp()
path if we detect that rcu_qs() happened on the CPU?

> > > > I think this would simplify cond_resched as well.  Could this avoid the need
> > > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> > > 
> > > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > > This needs rcu_all_qs().  Though it is quite possible that some additional
> > > code collapsing is possible.
> > > 
> > > > Basically for some background, I was thinking can we simplify the code that
> > > > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > > > ways (like by resetting cpu_no_qs).
> > > 
> > > One complication is that rcu_all_qs() is invoked with interrupts
> > > and preemption enabled, while rcu_note_context_switch() is
> > > invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> > > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > > -- yes, it could check, but that would be redundant with the checks
> > 
> > This immediate exit is taken care off in the above psuedo code, would that
> > help the cond_resched performance?
> 
> It look like you are cautiously edging towards the two wrapper functions
> calling common code, relying on inlining and simplification.  Why not just
> try doing it?  ;-)

Sure yeah. I was more thinking of the ambitious goal of getting rid of the
complexity and exploring the general design idea, than containing/managing
the complexity with reducing code duplication. :D

> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index c818e0c91a81..5aa0259c014d 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > > >  	 * read-side critical section that started before the beginning
> > > >  	 * of the current RCU grace period.
> > > >  	 */
> > > > -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > > +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> > > 
> > > If I am not too confused, this change could cause trouble for
> > > nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> > > scheduler-clock interrupts, last I checked, and this could prevent the
> > > CPU from reporting its quiescent state to core RCU.
> > 
> > Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
> > diff is an OR condition so it is more relaxed than before.
> 
> Yes, because rcu_qs() is only guaranteed to capture the quiescent
> state on the current CPU, not necessarily report it to the RCU core.

The reporting to the core is necessary to call rcu_report_qs_rnp so that the
QS information is propogating up the tree, right?

Wouldn't that reporting be done anyway by:

force_qs_rnp
  -> rcu_implicit_dynticks_qs  (which returns 1 because rdp->cpu_no_qs.b.norm
				was cleared by rcu_qs() and we detect that
				with help of above diff)

  -> rcu_report_qs_rnp is called with mask bit set for corresponding CPU that
  				has the !rdp->cpu_no_qs.b.norm


I think that's what I am missing - that why wouldn't the above scheme work.
The only difference is reporting to the RCU core might invoke pending
callbacks but I'm not sure if that matters for this. I'll these changes,
and try tracing it out and study it more.  thanks for the patience,

 - Joel
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: dyntick-idle CPU and node's qsmask
  2018-11-21  4:37                   ` Joel Fernandes
@ 2018-11-21 14:39                     ` Paul E. McKenney
  0 siblings, 0 replies; 14+ messages in thread
From: Paul E. McKenney @ 2018-11-21 14:39 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: linux-kernel, josh, rostedt, mathieu.desnoyers, jiangshanlai

On Tue, Nov 20, 2018 at 08:37:22PM -0800, Joel Fernandes wrote:
> On Tue, Nov 20, 2018 at 06:41:07PM -0800, Paul E. McKenney wrote:
> [...] 
> > > > > I was thinking if we could simplify rcu_note_context_switch (the parts that
> > > > > call rcu_momentary_dyntick_idle), if we did the following in
> > > > > rcu_implicit_dynticks_qs.
> > > > > 
> > > > > Since we already call rcu_qs in rcu_note_context_switch, that would clear the
> > > > > rdp->cpu_no_qs flag. Then there should be no need to call
> > > > > rcu_momentary_dyntick_idle from rcu_note_context switch.
> > > > 
> > > > But does this also work for the rcu_all_qs() code path?
> > > 
> > > Could we not do something like this in rcu_all_qs? as some over-simplified
> > > pseudo code:
> > > 
> > > rcu_all_qs() {
> > >   if (!urgent_qs || !heavy_qs)
> > >      return;
> > > 
> > >   rcu_qs();   // This clears the rdp->cpu_no_qs flags which we can monitor in
> > >               //  the diff in my last email (from rcu_implicit_dynticks_qs)
> > > }
> > 
> > Except that rcu_qs() doesn't necessarily report the quiescent state to
> > the RCU core.  Keeping down context-switch overhead and all that.
> 
> Sure yeah, but I think the QS will be indirectly anyway by the force_qs_rnp()
> path if we detect that rcu_qs() happened on the CPU?

The force_qs_rnp() path won't see anything that has not already been
reported to the RCU core.

> > > > > I think this would simplify cond_resched as well.  Could this avoid the need
> > > > > for having an rcu_all_qs at all? Hopefully I didn't some Tasks-RCU corner cases..
> > > > 
> > > > There is also the code path from cond_resched() in PREEMPT=n kernels.
> > > > This needs rcu_all_qs().  Though it is quite possible that some additional
> > > > code collapsing is possible.
> > > > 
> > > > > Basically for some background, I was thinking can we simplify the code that
> > > > > calls "rcu_momentary_dyntick_idle" since we already register a qs in other
> > > > > ways (like by resetting cpu_no_qs).
> > > > 
> > > > One complication is that rcu_all_qs() is invoked with interrupts
> > > > and preemption enabled, while rcu_note_context_switch() is
> > > > invoked with interrupts disabled.  Also, as you say, Tasks RCU.
> > > > Plus rcu_all_qs() wants to exit immediately if there is nothing to
> > > > do, while rcu_note_context_switch() must unconditionally do rcu_qs()
> > > > -- yes, it could check, but that would be redundant with the checks
> > > 
> > > This immediate exit is taken care off in the above psuedo code, would that
> > > help the cond_resched performance?
> > 
> > It look like you are cautiously edging towards the two wrapper functions
> > calling common code, relying on inlining and simplification.  Why not just
> > try doing it?  ;-)
> 
> Sure yeah. I was more thinking of the ambitious goal of getting rid of the
> complexity and exploring the general design idea, than containing/managing
> the complexity with reducing code duplication. :D
> 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index c818e0c91a81..5aa0259c014d 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -1063,7 +1063,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
> > > > >  	 * read-side critical section that started before the beginning
> > > > >  	 * of the current RCU grace period.
> > > > >  	 */
> > > > > -	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap)) {
> > > > > +	if (rcu_dynticks_in_eqs_since(rdp, rdp->dynticks_snap) || !rdp->cpu_no_qs.b.norm) {
> > > > 
> > > > If I am not too confused, this change could cause trouble for
> > > > nohz_full CPUs looping in the kernel.  Such CPUs don't necessarily take
> > > > scheduler-clock interrupts, last I checked, and this could prevent the
> > > > CPU from reporting its quiescent state to core RCU.
> > > 
> > > Would that still be a problem if rcu_all_qs called rcu_qs? Also the above
> > > diff is an OR condition so it is more relaxed than before.
> > 
> > Yes, because rcu_qs() is only guaranteed to capture the quiescent
> > state on the current CPU, not necessarily report it to the RCU core.
> 
> The reporting to the core is necessary to call rcu_report_qs_rnp so that the
> QS information is propogating up the tree, right?
> 
> Wouldn't that reporting be done anyway by:
> 
> force_qs_rnp
>   -> rcu_implicit_dynticks_qs  (which returns 1 because rdp->cpu_no_qs.b.norm
> 				was cleared by rcu_qs() and we detect that
> 				with help of above diff)

Ah.  It is not safe to sample rdp->cpu_no_qs.b.norm off-CPU, and that
is what your patch would do.  This is intentional -- if it were safe to
sample off-CPU, then it would be more expensive to read/update on-CPU.

>   -> rcu_report_qs_rnp is called with mask bit set for corresponding CPU that
>   				has the !rdp->cpu_no_qs.b.norm
> 
> 
> I think that's what I am missing - that why wouldn't the above scheme work.
> The only difference is reporting to the RCU core might invoke pending
> callbacks but I'm not sure if that matters for this. I'll these changes,
> and try tracing it out and study it more.  thanks for the patience,

There are a lot of moving parts and you have not yet gotten to all
of them.  I suggest next taking a look at the relationship between
rcu_check_callbacks() and rcu_process_callbacks(), including the
open_softirq().  These have old names -- they handle the interface
between the CPU and RCU code, among other things.  Including invoking
callbacks, but only for some configurations.  :-/

							Thanx, Paul


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-11-21 14:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-10 21:46 dyntick-idle CPU and node's qsmask Joel Fernandes
2018-11-10 23:04 ` Paul E. McKenney
2018-11-11  3:09   ` Joel Fernandes
2018-11-11  4:22     ` Paul E. McKenney
2018-11-11 18:09       ` Joel Fernandes
2018-11-11 18:36         ` Paul E. McKenney
2018-11-11 21:04           ` Joel Fernandes
2018-11-20 20:42           ` Joel Fernandes
2018-11-20 22:28             ` Paul E. McKenney
2018-11-20 22:34               ` Paul E. McKenney
2018-11-21  2:06               ` Joel Fernandes
2018-11-21  2:41                 ` Paul E. McKenney
2018-11-21  4:37                   ` Joel Fernandes
2018-11-21 14:39                     ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).