RCU Archive on lore.kernel.org
 help / color / Atom feed
* need_heavy_qs flag for PREEMPT=y kernels
@ 2019-08-11 18:08 Joel Fernandes
  2019-08-11 18:34 ` Joel Fernandes
  2019-08-11 21:13 ` Paul E. McKenney
  0 siblings, 2 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-08-11 18:08 UTC (permalink / raw)
  To: rcu, paulmck

Hi Paul, everyone,

I noticed on reading code that the need_heavy_qs check and
rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?

Consider the following events:
1. Kernel is PREEMPT=y configuration.
2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
4. Grace period thread running on CPU 3 enter the fqs loop.
5. Enough time passes and it sets the need_heavy_qs for CPU2.
6. CPU 2 is still in kernel mode but does cond_resched().
7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.

Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
loop to detect that the CPU has crossed a quiescent point?

Is this done so that cond_resched() is fast for PREEMPT=y kernels?

thanks,

 - Joel



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 18:08 need_heavy_qs flag for PREEMPT=y kernels Joel Fernandes
@ 2019-08-11 18:34 ` Joel Fernandes
  2019-08-11 21:16   ` Paul E. McKenney
  2019-08-11 21:13 ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-11 18:34 UTC (permalink / raw)
  To: rcu, Paul E. McKenney

On Sun, Aug 11, 2019 at 2:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
>
> Hi Paul, everyone,
>
> I noticed on reading code that the need_heavy_qs check and
> rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
>
> Consider the following events:
> 1. Kernel is PREEMPT=y configuration.
> 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> 4. Grace period thread running on CPU 3 enter the fqs loop.
> 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> 6. CPU 2 is still in kernel mode but does cond_resched().
> 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
>
> Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> loop to detect that the CPU has crossed a quiescent point?
>
> Is this done so that cond_resched() is fast for PREEMPT=y kernels?

Oh, so I take it this bit of code in rcu_implicit_dynticks_qs(), with
the accompanying comments, takes care of the scenario I describe?
Another way could be just call rcu_momentary_dyntick_idle() during
cond_resched() for nohz_full CPUs? Is that pricey?
        /*
         * NO_HZ_FULL CPUs can run in-kernel without rcu_sched_clock_irq!
         * The above code handles this, but only for straight cond_resched().
         * And some in-kernel loops check need_resched() before calling
         * cond_resched(), which defeats the above code for CPUs that are
         * running in-kernel with scheduling-clock interrupts disabled.
         * So hit them over the head with the resched_cpu() hammer!
         */
        if (tick_nohz_full_cpu(rdp->cpu) &&
                   time_after(jiffies,
                              READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
                resched_cpu(rdp->cpu);
                WRITE_ONCE(rdp->last_fqs_resched, jiffies);
        }

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 18:08 need_heavy_qs flag for PREEMPT=y kernels Joel Fernandes
  2019-08-11 18:34 ` Joel Fernandes
@ 2019-08-11 21:13 ` Paul E. McKenney
  2019-08-12  3:21   ` Joel Fernandes
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-11 21:13 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Sun, Aug 11, 2019 at 02:08:52PM -0400, Joel Fernandes wrote:
> Hi Paul, everyone,
> 
> I noticed on reading code that the need_heavy_qs check and
> rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
> 
> Consider the following events:
> 1. Kernel is PREEMPT=y configuration.
> 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> 4. Grace period thread running on CPU 3 enter the fqs loop.
> 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> 6. CPU 2 is still in kernel mode but does cond_resched().
> 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
> 
> Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> loop to detect that the CPU has crossed a quiescent point?
> 
> Is this done so that cond_resched() is fast for PREEMPT=y kernels?

The problem is that the definiton of cond_resched() in PREEMPT=y
kernels is as follows:

	static inline int _cond_resched(void) { return 0; }

If (but only if!) someone shows a problem in a PREEMPT=y kernel, the
code could be updated to do something like a resched_cpu() earlier
rather than later.

The reason that I do not expect a problem in NO_HZ_FULL=n&&PREEMPT=y
kernels is that the scheduling-clock tick will with high probability
happen when the CPU is not in an RCU read-side critical section, and
this quiescent state will be reported reasonably quickly.

This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
more aggressive about using resched_cpu() on CPUs that have not yet
reported a quiescent state for the current grace period.

So we should be good as is.

Or am I missing a key corner case here?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 18:34 ` Joel Fernandes
@ 2019-08-11 21:16   ` Paul E. McKenney
  2019-08-11 21:25     ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-11 21:16 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Sun, Aug 11, 2019 at 02:34:08PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 2:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> >
> > Hi Paul, everyone,
> >
> > I noticed on reading code that the need_heavy_qs check and
> > rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> > need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
> >
> > Consider the following events:
> > 1. Kernel is PREEMPT=y configuration.
> > 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> > 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> > 4. Grace period thread running on CPU 3 enter the fqs loop.
> > 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> > 6. CPU 2 is still in kernel mode but does cond_resched().
> > 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
> >
> > Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> > loop to detect that the CPU has crossed a quiescent point?
> >
> > Is this done so that cond_resched() is fast for PREEMPT=y kernels?
> 
> Oh, so I take it this bit of code in rcu_implicit_dynticks_qs(), with
> the accompanying comments, takes care of the scenario I describe?
> Another way could be just call rcu_momentary_dyntick_idle() during
> cond_resched() for nohz_full CPUs? Is that pricey?
>         /*
>          * NO_HZ_FULL CPUs can run in-kernel without rcu_sched_clock_irq!
>          * The above code handles this, but only for straight cond_resched().
>          * And some in-kernel loops check need_resched() before calling
>          * cond_resched(), which defeats the above code for CPUs that are
>          * running in-kernel with scheduling-clock interrupts disabled.
>          * So hit them over the head with the resched_cpu() hammer!
>          */
>         if (tick_nohz_full_cpu(rdp->cpu) &&
>                    time_after(jiffies,
>                               READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
>                 resched_cpu(rdp->cpu);
>                 WRITE_ONCE(rdp->last_fqs_resched, jiffies);
>         }

Yes, for NO_HZ_FULL=y&&PREEMPT=y kernels.

Your thought of including rcu_momentary_dyntick_idle() would function
correctly, but would cause performance issues.  Even adding additional
compares and branches in that hot codepath is visible to 0day test robot!
So adding a read-modify-write atomic operation to that code path would
get attention of the wrong kind.  ;-)

But please see my earlier email on how things work out for kernels
built with NO_HZ_FULL=n&&PREEMPT=y.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 21:16   ` Paul E. McKenney
@ 2019-08-11 21:25     ` Joel Fernandes
  2019-08-11 23:30       ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-11 21:25 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Sun, Aug 11, 2019 at 02:16:46PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 11, 2019 at 02:34:08PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 2:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > >
> > > Hi Paul, everyone,
> > >
> > > I noticed on reading code that the need_heavy_qs check and
> > > rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> > > need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
> > >
> > > Consider the following events:
> > > 1. Kernel is PREEMPT=y configuration.
> > > 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> > > 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> > > 4. Grace period thread running on CPU 3 enter the fqs loop.
> > > 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> > > 6. CPU 2 is still in kernel mode but does cond_resched().
> > > 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
> > >
> > > Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> > > loop to detect that the CPU has crossed a quiescent point?
> > >
> > > Is this done so that cond_resched() is fast for PREEMPT=y kernels?
> > 
> > Oh, so I take it this bit of code in rcu_implicit_dynticks_qs(), with
> > the accompanying comments, takes care of the scenario I describe?
> > Another way could be just call rcu_momentary_dyntick_idle() during
> > cond_resched() for nohz_full CPUs? Is that pricey?
> >         /*
> >          * NO_HZ_FULL CPUs can run in-kernel without rcu_sched_clock_irq!
> >          * The above code handles this, but only for straight cond_resched().
> >          * And some in-kernel loops check need_resched() before calling
> >          * cond_resched(), which defeats the above code for CPUs that are
> >          * running in-kernel with scheduling-clock interrupts disabled.
> >          * So hit them over the head with the resched_cpu() hammer!
> >          */
> >         if (tick_nohz_full_cpu(rdp->cpu) &&
> >                    time_after(jiffies,
> >                               READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
> >                 resched_cpu(rdp->cpu);
> >                 WRITE_ONCE(rdp->last_fqs_resched, jiffies);
> >         }
> 
> Yes, for NO_HZ_FULL=y&&PREEMPT=y kernels.

Actually, I was only referring to the case of NO_HZ_FULL=y being the
troublesome one (i.e. rcu_need_heavy_qs flag would have no effect).

For NO_HZ_FULL=n, I have full confidence the scheduler tick will notice
rcu_urgent_qs and do a reschedule. The ensuing softirq then does the needful
to help end the grace period.

> Your thought of including rcu_momentary_dyntick_idle() would function
> correctly, but would cause performance issues.  Even adding additional
> compares and branches in that hot codepath is visible to 0day test robot!
> So adding a read-modify-write atomic operation to that code path would
> get attention of the wrong kind.  ;-)

But wouldn't these performance issues also be visible with
NO_HZ_FULL=y && PREEMPT=n?  Why is PREEMPT=n made an exception? Is it that
0day doesn't test this combination much? :-D

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 21:25     ` Joel Fernandes
@ 2019-08-11 23:30       ` Paul E. McKenney
  2019-08-12  1:24         ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-11 23:30 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Sun, Aug 11, 2019 at 05:25:05PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 02:16:46PM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 11, 2019 at 02:34:08PM -0400, Joel Fernandes wrote:
> > > On Sun, Aug 11, 2019 at 2:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > >
> > > > Hi Paul, everyone,
> > > >
> > > > I noticed on reading code that the need_heavy_qs check and
> > > > rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> > > > need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
> > > >
> > > > Consider the following events:
> > > > 1. Kernel is PREEMPT=y configuration.
> > > > 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> > > > 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> > > > 4. Grace period thread running on CPU 3 enter the fqs loop.
> > > > 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> > > > 6. CPU 2 is still in kernel mode but does cond_resched().
> > > > 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
> > > >
> > > > Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> > > > loop to detect that the CPU has crossed a quiescent point?
> > > >
> > > > Is this done so that cond_resched() is fast for PREEMPT=y kernels?
> > > 
> > > Oh, so I take it this bit of code in rcu_implicit_dynticks_qs(), with
> > > the accompanying comments, takes care of the scenario I describe?
> > > Another way could be just call rcu_momentary_dyntick_idle() during
> > > cond_resched() for nohz_full CPUs? Is that pricey?
> > >         /*
> > >          * NO_HZ_FULL CPUs can run in-kernel without rcu_sched_clock_irq!
> > >          * The above code handles this, but only for straight cond_resched().
> > >          * And some in-kernel loops check need_resched() before calling
> > >          * cond_resched(), which defeats the above code for CPUs that are
> > >          * running in-kernel with scheduling-clock interrupts disabled.
> > >          * So hit them over the head with the resched_cpu() hammer!
> > >          */
> > >         if (tick_nohz_full_cpu(rdp->cpu) &&
> > >                    time_after(jiffies,
> > >                               READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
> > >                 resched_cpu(rdp->cpu);
> > >                 WRITE_ONCE(rdp->last_fqs_resched, jiffies);
> > >         }
> > 
> > Yes, for NO_HZ_FULL=y&&PREEMPT=y kernels.
> 
> Actually, I was only referring to the case of NO_HZ_FULL=y being the
> troublesome one (i.e. rcu_need_heavy_qs flag would have no effect).
> 
> For NO_HZ_FULL=n, I have full confidence the scheduler tick will notice
> rcu_urgent_qs and do a reschedule. The ensuing softirq then does the needful
> to help end the grace period.

Whew!

That confidence was not at all apparent in your initial email.

> > Your thought of including rcu_momentary_dyntick_idle() would function
> > correctly, but would cause performance issues.  Even adding additional
> > compares and branches in that hot codepath is visible to 0day test robot!
> > So adding a read-modify-write atomic operation to that code path would
> > get attention of the wrong kind.  ;-)
> 
> But wouldn't these performance issues also be visible with
> NO_HZ_FULL=y && PREEMPT=n?

In PREEMPT=n, cond_resched() already has a check, and with quite a bit
of care it is possible to introduce another.

>                             Why is PREEMPT=n made an exception?

The exception is actually CONFIG_NO_HZ_FULL=y && CONFIG_PREEMPT=y.
In that case, we can rely on neither the scheduling-clock interrupt
nor on cond_resched().  In the other three cases, we have one or both.

>                                                                 Is it that
> 0day doesn't test this combination much? :-D

Might be, but it sure tests the other combinations!

Next question:  Why does rcu_implicit_dynticks_qs() check only for
tick_nohz_full_cpu() and not also IS_ENABLED(CONFIG_PREEMPT)?  After
all, a nohz_full CPU in a !CONFIG_PREEMPT kernel should be able to
rely on cond_resched(), right?

Should this change?  Why or why not?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 23:30       ` Paul E. McKenney
@ 2019-08-12  1:24         ` Joel Fernandes
  2019-08-12  1:40           ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-12  1:24 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Sun, Aug 11, 2019 at 04:30:24PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 11, 2019 at 05:25:05PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 02:16:46PM -0700, Paul E. McKenney wrote:
> > > On Sun, Aug 11, 2019 at 02:34:08PM -0400, Joel Fernandes wrote:
> > > > On Sun, Aug 11, 2019 at 2:08 PM Joel Fernandes <joel@joelfernandes.org> wrote:
> > > > >
> > > > > Hi Paul, everyone,
> > > > >
> > > > > I noticed on reading code that the need_heavy_qs check and
> > > > > rcu_momentary_dyntick_idle() is only called for !PREEMPT kernels. Don't we
> > > > > need to call this for PREEMPT kernels for the benefit of nohz_full CPUs?
> > > > >
> > > > > Consider the following events:
> > > > > 1. Kernel is PREEMPT=y configuration.
> > > > > 2. CPU 2 is a nohz_full CPU running only a single task and the tick is off.
> > > > > 3. CPU 2 is running only in kernel mode and does not enter user mode or idle.
> > > > > 4. Grace period thread running on CPU 3 enter the fqs loop.
> > > > > 5. Enough time passes and it sets the need_heavy_qs for CPU2.
> > > > > 6. CPU 2 is still in kernel mode but does cond_resched().
> > > > > 7. cond_resched() does not call rcu_momentary_dyntick_idle() because PREEMPT=y.
> > > > >
> > > > > Is 7. not calling rcu_momentary_dyntick_idle() a lost opportunity for the FQS
> > > > > loop to detect that the CPU has crossed a quiescent point?
> > > > >
> > > > > Is this done so that cond_resched() is fast for PREEMPT=y kernels?
> > > > 
> > > > Oh, so I take it this bit of code in rcu_implicit_dynticks_qs(), with
> > > > the accompanying comments, takes care of the scenario I describe?
> > > > Another way could be just call rcu_momentary_dyntick_idle() during
> > > > cond_resched() for nohz_full CPUs? Is that pricey?
> > > >         /*
> > > >          * NO_HZ_FULL CPUs can run in-kernel without rcu_sched_clock_irq!
> > > >          * The above code handles this, but only for straight cond_resched().
> > > >          * And some in-kernel loops check need_resched() before calling
> > > >          * cond_resched(), which defeats the above code for CPUs that are
> > > >          * running in-kernel with scheduling-clock interrupts disabled.
> > > >          * So hit them over the head with the resched_cpu() hammer!
> > > >          */
> > > >         if (tick_nohz_full_cpu(rdp->cpu) &&
> > > >                    time_after(jiffies,
> > > >                               READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
> > > >                 resched_cpu(rdp->cpu);
> > > >                 WRITE_ONCE(rdp->last_fqs_resched, jiffies);
> > > >         }
> > > 
> > > Yes, for NO_HZ_FULL=y&&PREEMPT=y kernels.
> > 
> > Actually, I was only referring to the case of NO_HZ_FULL=y being the
> > troublesome one (i.e. rcu_need_heavy_qs flag would have no effect).
> > 
> > For NO_HZ_FULL=n, I have full confidence the scheduler tick will notice
> > rcu_urgent_qs and do a reschedule. The ensuing softirq then does the needful
> > to help end the grace period.
> 
> Whew!
> 
> That confidence was not at all apparent in your initial email.

Sorry, I should improve the quality of my emails for sure.

> > > Your thought of including rcu_momentary_dyntick_idle() would function
> > > correctly, but would cause performance issues.  Even adding additional
> > > compares and branches in that hot codepath is visible to 0day test robot!
> > > So adding a read-modify-write atomic operation to that code path would
> > > get attention of the wrong kind.  ;-)
> > 
> > But wouldn't these performance issues also be visible with
> > NO_HZ_FULL=y && PREEMPT=n?
> 
> In PREEMPT=n, cond_resched() already has a check, and with quite a bit
> of care it is possible to introduce another.

Actually, may be I did not express properly. I mean the performance issues
that 0day found (that you mentioned above) with invoking
rcu_momentary_dyntick_idle() from hotpaths should also have been found with
PREEMPT=n. However, sounds like it has found these issues only when invoking
rcu_momentary_dyntick_idle() with PREEMPT=y - which as you said is the reason
rcu_momentary_dyntick_idle() is not invoked in PREEMPT=y. So I was asking,
why are these same performance issues not seen with PREEMPT=n? And if they
are seen, why do we invoke rcu_momentary_dyntick_idle() in mainline for
PREEMPT=n kernels?


> >                             Why is PREEMPT=n made an exception?
> 
> The exception is actually CONFIG_NO_HZ_FULL=y && CONFIG_PREEMPT=y.
> In that case, we can rely on neither the scheduling-clock interrupt
> nor on cond_resched().  In the other three cases, we have one or both.

Agreed, that's what I found weird. PREEMPT=y with NOHZ_FULL=y has no support
to rely on. While PREEMPT=n with NOHZ_FULL=y does. So my question was about
the rationale for why there is this difference. Either we invoke
rcu_momentary_dyntick_idle for both PREEMPT options, or we don't invoke it
for either. Why invoke it for one but not the other?

> Next question:  Why does rcu_implicit_dynticks_qs() check only for
> tick_nohz_full_cpu() and not also IS_ENABLED(CONFIG_PREEMPT)?  After
> all, a nohz_full CPU in a !CONFIG_PREEMPT kernel should be able to
> rely on cond_resched(), right?
> 
> Should this change?  Why or why not?

Let me think more about this :) I have an answer in mind but I will think a
bit more about it and responsd :)

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12  1:24         ` Joel Fernandes
@ 2019-08-12  1:40           ` Joel Fernandes
  2019-08-12  3:57             ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-12  1:40 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Sun, Aug 11, 2019 at 09:24:31PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 04:30:24PM -0700, Paul E. McKenney wrote:
[snip]
> > Next question:  Why does rcu_implicit_dynticks_qs() check only for
> > tick_nohz_full_cpu() and not also IS_ENABLED(CONFIG_PREEMPT)?  After
> > all, a nohz_full CPU in a !CONFIG_PREEMPT kernel should be able to
> > rely on cond_resched(), right?
> > 
> > Should this change?  Why or why not?
> 
> Let me think more about this :) I have an answer in mind but I will think a
> bit more about it and responsd :)

It should not change. That's because (as somewhat mentioned in the comments),
some code paths in the kernel check need_resched() before invoking
cond_resched(). So even with PREEMPT=n having the help of cond_resched(), the
cond_resched() may actually not even be invoked.  So in this case, the
resched_cpu() from rcu_implicit_dynticks_qs() does the needful by setting the
rescheduling flags on the CPU, so that cond_resched() on those CPUs actually
get called. Is that a correct analysis?

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-11 21:13 ` Paul E. McKenney
@ 2019-08-12  3:21   ` Joel Fernandes
  2019-08-12  3:53     ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-12  3:21 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
[snip]
> This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> more aggressive about using resched_cpu() on CPUs that have not yet
> reported a quiescent state for the current grace period.

Just wanted to ask something - how does resched_cpu() help for this case?

Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
in kernel mode with scheduler tick off. As we discussed, we have no help from
cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
passed (jtsq*3), we send the CPU a re-scheduling IPI.

This seems not that useful. Even if we enter the scheduler due to the
rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
quiescent state to the leaf node. Neither will anything to do a
rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
will still end up getting blocked.

Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
kernel actually helps to end the grace period when we call resched_cpu() on
it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
CPU? Maybe I should do an experiment to see this all play out.

And I need to write down everything I learnt today before I go crazy... ;-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12  3:21   ` Joel Fernandes
@ 2019-08-12  3:53     ` Paul E. McKenney
  2019-08-12 21:20       ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-12  3:53 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> [snip]
> > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > more aggressive about using resched_cpu() on CPUs that have not yet
> > reported a quiescent state for the current grace period.
> 
> Just wanted to ask something - how does resched_cpu() help for this case?
> 
> Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> in kernel mode with scheduler tick off. As we discussed, we have no help from
> cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> passed (jtsq*3), we send the CPU a re-scheduling IPI.
> 
> This seems not that useful. Even if we enter the scheduler due to the
> rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> quiescent state to the leaf node. Neither will anything to do a
> rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> will still end up getting blocked.
> 
> Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> kernel actually helps to end the grace period when we call resched_cpu() on
> it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> CPU? Maybe I should do an experiment to see this all play out.

An experiment would be good!

> And I need to write down everything I learnt today before I go crazy... ;-)

I know that feeling!  ;-)

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12  1:40           ` Joel Fernandes
@ 2019-08-12  3:57             ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-12  3:57 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Sun, Aug 11, 2019 at 09:40:53PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 09:24:31PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 04:30:24PM -0700, Paul E. McKenney wrote:
> [snip]
> > > Next question:  Why does rcu_implicit_dynticks_qs() check only for
> > > tick_nohz_full_cpu() and not also IS_ENABLED(CONFIG_PREEMPT)?  After
> > > all, a nohz_full CPU in a !CONFIG_PREEMPT kernel should be able to
> > > rely on cond_resched(), right?
> > > 
> > > Should this change?  Why or why not?
> > 
> > Let me think more about this :) I have an answer in mind but I will think a
> > bit more about it and responsd :)
> 
> It should not change. That's because (as somewhat mentioned in the comments),
> some code paths in the kernel check need_resched() before invoking
> cond_resched(). So even with PREEMPT=n having the help of cond_resched(), the
> cond_resched() may actually not even be invoked.  So in this case, the
> resched_cpu() from rcu_implicit_dynticks_qs() does the needful by setting the
> rescheduling flags on the CPU, so that cond_resched() on those CPUs actually
> get called. Is that a correct analysis?

Looks valid to me!  There might well be other scenarios as well, but
only one is required to justify the resched_cpu().

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12  3:53     ` Paul E. McKenney
@ 2019-08-12 21:20       ` Joel Fernandes
  2019-08-12 23:01         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-12 21:20 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > [snip]
> > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > reported a quiescent state for the current grace period.
> > 
> > Just wanted to ask something - how does resched_cpu() help for this case?
> > 
> > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > 
> > This seems not that useful. Even if we enter the scheduler due to the
> > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > quiescent state to the leaf node. Neither will anything to do a
> > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > will still end up getting blocked.
> > 
> > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > kernel actually helps to end the grace period when we call resched_cpu() on
> > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > CPU? Maybe I should do an experiment to see this all play out.
> 
> An experiment would be good!

Hi Paul,
Some very interesting findings!

Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
exits. Diff for test is appended.

Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
grace period is unable to come to an end with the hold up seemingly due to
CPU 3.

I see that the scheduler tick is off mostly, but occasionally is turned back
on during the test loop. However it has no effect and the grace period is
stuck on the same rcu_state.gp_seq value for the duration of the test. I
think the scheduler-tick ineffectiveness could be because of this patch?
https://lore.kernel.org/patchwork/patch/446044/

Relevant traces, sorry I did not wrap it for better readability:

Here I marked the start of the test:

[   24.852639] rcu_perf-163    13.... 1828278us : rcu_perf_writer: Start of rcuperf test
[   24.853651] rcu_perf-163    13dN.1 1828284us : rcu_grace_period: rcu_preempt 18446744073709550677 cpuqs
[   24.971709] rcu_pree-10      0d..1 1833228us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   26.493607] rcu_pree-10      0d..1 2137286us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   28.090618] rcu_pree-10      0d..1 2441290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3

Scheduler tick came in but almost 6 seconds of start of test, probably because migrate thread increase number of tasks queued on RQ:

[   28.355513] rcu_perf-163     3d.h. 2487447us : rcu_sched_clock_irq: sched-tick
[   29.592912] rcu_pree-10      0d..1 2745293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   30.238041] rcu_perf-163     3d..2 2879562us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=migration/3 next_pid=26 next_prio=0
[   30.269203] migratio-26      3d..2 2879745us : sched_switch: prev_comm=migration/3 prev_pid=26 prev_prio=0 prev_state=S ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
[   31.109635] rcu_pree-10      0d..1 3049301us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   32.627686] rcu_pree-10      0d..1 3353298us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   34.071163] rcu_pree-10      0d..1 3657299us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3

Several context-switches on CPU 3, but grace-period is still extended:

[   34.634814] rcu_perf-163     3d..2 3775310us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
[   34.638532] kworker/-28      3d..2 3775343us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=D ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
[   34.640338]  cpuhp/3-25      3d..2 3775348us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   34.653831]   <idle>-0       3d..2 3775522us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
[   34.657770] kworker/-28      3d..2 3775536us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=I ==> next_comm=kworker/3:1 next_pid=177 next_prio=120
[   34.661758] kworker/-177     3d..2 3775543us : sched_switch: prev_comm=kworker/3:1 prev_pid=177 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   34.866007]   <idle>-0       3d..2 3789967us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0H next_pid=29 next_prio=100
[   34.874200] kworker/-29      3d..2 3789988us : sched_switch: prev_comm=kworker/3:0H prev_pid=29 prev_prio=100 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   35.128506]   <idle>-0       3d..2 3816391us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
[   35.130481]  cpuhp/3-25      3d..2 3816503us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
[   35.509469]   <idle>-0       3d..2 3828462us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98

Scheduler tick doesn't do much to help:

[   35.512436] rcu_perf-163     3d.h. 3829210us : rcu_sched_clock_irq: sched-tick

Now scheduler tick is completely off for the next 29 seconds. FQS loops keeps
doing resched_cpu on CPU 3 at 300 jiffy intervals (my HZ=1000):

[   36.160145] rcu_pree-10      0d..1 3958214us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   38.778574] rcu_pree-10      0d..1 4262288us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   40.604108] rcu_pree-10      0d..1 4566246us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   42.565345] rcu_pree-10      0d..1 4870294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   44.322041] rcu_pree-10      0d..1 5174293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   46.083710] rcu_pree-10      0d..1 5478290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   47.851449] rcu_pree-10      0d..1 5782289us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   49.693127] rcu_pree-10      0d..1 6086293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   51.432018] rcu_pree-10      0d..1 6390283us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   53.187991] rcu_pree-10      0d..1 6694294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
[   53.970850] rcu_perf-163     3.... 6828237us : rcu_perf_writer: End of rcuperf test

I feel we could do one of:
1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler
2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.

What do you think?  Also test diff is below.

thanks,

 - Joel

---8<-----------------------

diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index 70d6ac19cbff..181cd9ce733a 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -366,17 +366,15 @@ rcu_perf_writer(void *arg)
 	u64 t;
 	u64 *wdp;
 	u64 *wdpp = writer_durations[me];
+	unsigned long jiffies_init;
 
 	VERBOSE_PERFOUT_STRING("rcu_perf_writer task started");
+	trace_printk("Start of rcuperf test\n");
 	WARN_ON(!wdpp);
-	set_cpus_allowed_ptr(current, cpumask_of(me % nr_cpu_ids));
+	set_cpus_allowed_ptr(current, cpumask_of(3));
 	sp.sched_priority = 1;
 	sched_setscheduler_nocheck(current, SCHED_FIFO, &sp);
 
-	if (holdoff)
-		schedule_timeout_uninterruptible(holdoff * HZ);
-
-	t = ktime_get_mono_fast_ns();
 	if (atomic_inc_return(&n_rcu_perf_writer_started) >= nrealwriters) {
 		t_rcu_perf_writer_started = t;
 		if (gp_exp) {
@@ -387,80 +385,21 @@ rcu_perf_writer(void *arg)
 		}
 	}
 
+	jiffies_init = jiffies;
 	do {
-		if (writer_holdoff)
-			udelay(writer_holdoff);
-		wdp = &wdpp[i];
-		*wdp = ktime_get_mono_fast_ns();
-		if (gp_async) {
-retry:
-			if (!rhp)
-				rhp = kmalloc(sizeof(*rhp), GFP_KERNEL);
-			if (rhp && atomic_read(this_cpu_ptr(&n_async_inflight)) < gp_async_max) {
-				rcu_perf_writer_state = RTWS_ASYNC;
-				atomic_inc(this_cpu_ptr(&n_async_inflight));
-				cur_ops->async(rhp, rcu_perf_async_cb);
-				rhp = NULL;
-			} else if (!kthread_should_stop()) {
-				rcu_perf_writer_state = RTWS_BARRIER;
-				cur_ops->gp_barrier();
-				goto retry;
-			} else {
-				kfree(rhp); /* Because we are stopping. */
-			}
-		} else if (gp_exp) {
-			rcu_perf_writer_state = RTWS_EXP_SYNC;
-			cur_ops->exp_sync();
-		} else {
-			rcu_perf_writer_state = RTWS_SYNC;
-			cur_ops->sync();
-		}
-		rcu_perf_writer_state = RTWS_IDLE;
-		t = ktime_get_mono_fast_ns();
-		*wdp = t - *wdp;
-		i_max = i;
-		if (!started &&
-		    atomic_read(&n_rcu_perf_writer_started) >= nrealwriters)
-			started = true;
-		if (!done && i >= MIN_MEAS) {
-			done = true;
-			sp.sched_priority = 0;
-			sched_setscheduler_nocheck(current,
-						   SCHED_NORMAL, &sp);
-			pr_alert("%s%s rcu_perf_writer %ld has %d measurements\n",
-				 perf_type, PERF_FLAG, me, MIN_MEAS);
-			if (atomic_inc_return(&n_rcu_perf_writer_finished) >=
-			    nrealwriters) {
-				schedule_timeout_interruptible(10);
-				rcu_ftrace_dump(DUMP_ALL);
-				PERFOUT_STRING("Test complete");
-				t_rcu_perf_writer_finished = t;
-				if (gp_exp) {
-					b_rcu_gp_test_finished =
-						cur_ops->exp_completed() / 2;
-				} else {
-					b_rcu_gp_test_finished =
-						cur_ops->get_gp_seq();
-				}
-				if (shutdown) {
-					smp_mb(); /* Assign before wake. */
-					wake_up(&shutdown_wq);
-				}
-			}
-		}
-		if (done && !alldone &&
-		    atomic_read(&n_rcu_perf_writer_finished) >= nrealwriters)
-			alldone = true;
-		if (started && !alldone && i < MAX_MEAS - 1)
-			i++;
-		rcu_perf_wait_shutdown();
-	} while (!torture_must_stop());
-	if (gp_async) {
-		rcu_perf_writer_state = RTWS_BARRIER;
-		cur_ops->gp_barrier();
+		i++;
+		cond_resched();
+	} while ((jiffies < jiffies_init + 5000) && !torture_must_stop());
+
+	trace_printk("End of rcuperf test\n");
+	schedule_timeout_interruptible(10);
+	rcu_ftrace_dump(DUMP_ALL);
+	PERFOUT_STRING("Test complete");
+	if (shutdown) {
+		smp_mb(); /* Assign before wake. */
+		wake_up(&shutdown_wq);
 	}
-	rcu_perf_writer_state = RTWS_STOPPING;
-	writer_n_durations[me] = i_max;
+
 	torture_kthread_stopping("rcu_perf_writer");
 	return 0;
 }
@@ -797,7 +736,7 @@ rcu_perf_init(void)
 	if (kfree_rcu_test)
 		return kfree_perf_init();
 
-	nrealwriters = compute_real(nwriters);
+	nrealwriters = 1;
 	nrealreaders = compute_real(nreaders);
 	atomic_set(&n_rcu_perf_reader_started, 0);
 	atomic_set(&n_rcu_perf_writer_started, 0);
@@ -814,6 +753,8 @@ rcu_perf_init(void)
 			goto unwind;
 		schedule_timeout_uninterruptible(1);
 	}
+
+#if 0
 	reader_tasks = kcalloc(nrealreaders, sizeof(reader_tasks[0]),
 			       GFP_KERNEL);
 	if (reader_tasks == NULL) {
@@ -829,6 +770,7 @@ rcu_perf_init(void)
 	}
 	while (atomic_read(&n_rcu_perf_reader_started) < nrealreaders)
 		schedule_timeout_uninterruptible(1);
+#endif
 	writer_tasks = kcalloc(nrealwriters, sizeof(reader_tasks[0]),
 			       GFP_KERNEL);
 	writer_durations = kcalloc(nrealwriters, sizeof(*writer_durations),
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 102a5f606d78..e3deb9431544 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1066,6 +1066,7 @@ static int rcu_implicit_dynticks_qs(struct rcu_data *rdp)
 	if (tick_nohz_full_cpu(rdp->cpu) &&
 		   time_after(jiffies,
 			      READ_ONCE(rdp->last_fqs_resched) + jtsq * 3)) {
+		trace_printk("Sending urgent resched to cpu %d\n", rdp->cpu);
 		resched_cpu(rdp->cpu);
 		WRITE_ONCE(rdp->last_fqs_resched, jiffies);
 	}
@@ -2168,6 +2169,8 @@ static void rcu_do_batch(struct rcu_data *rdp)
  */
 void rcu_sched_clock_irq(int user)
 {
+	trace_printk("sched-tick\n");
+
 	trace_rcu_utilization(TPS("Start scheduler-tick"));
 	raw_cpu_inc(rcu_data.ticks_this_gp);
 	/* The load-acquire pairs with the store-release setting to true. */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12 21:20       ` Joel Fernandes
@ 2019-08-12 23:01         ` Paul E. McKenney
  2019-08-13  1:02           ` Joel Fernandes
  0 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-12 23:01 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Mon, Aug 12, 2019 at 05:20:13PM -0400, Joel Fernandes wrote:
> On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > > [snip]
> > > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > > reported a quiescent state for the current grace period.
> > > 
> > > Just wanted to ask something - how does resched_cpu() help for this case?
> > > 
> > > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > > 
> > > This seems not that useful. Even if we enter the scheduler due to the
> > > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > > quiescent state to the leaf node. Neither will anything to do a
> > > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > > will still end up getting blocked.
> > > 
> > > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > > kernel actually helps to end the grace period when we call resched_cpu() on
> > > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > > CPU? Maybe I should do an experiment to see this all play out.
> > 
> > An experiment would be good!
> 
> Hi Paul,
> Some very interesting findings!
> 
> Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
> CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
> exits. Diff for test is appended.
> 
> Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
> thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
> grace period is unable to come to an end with the hold up seemingly due to
> CPU 3.

Good catch!

> I see that the scheduler tick is off mostly, but occasionally is turned back
> on during the test loop. However it has no effect and the grace period is
> stuck on the same rcu_state.gp_seq value for the duration of the test. I
> think the scheduler-tick ineffectiveness could be because of this patch?
> https://lore.kernel.org/patchwork/patch/446044/

Unlikely, given that __rcu_pending(), which is now just rcu_pending(),
is only invoked from the scheduler-clock interrupt.

But from your traces below, clearly something is in need of repair.

> Relevant traces, sorry I did not wrap it for better readability:
> 
> Here I marked the start of the test:
> 
> [   24.852639] rcu_perf-163    13.... 1828278us : rcu_perf_writer: Start of rcuperf test
> [   24.853651] rcu_perf-163    13dN.1 1828284us : rcu_grace_period: rcu_preempt 18446744073709550677 cpuqs
> [   24.971709] rcu_pree-10      0d..1 1833228us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   26.493607] rcu_pree-10      0d..1 2137286us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   28.090618] rcu_pree-10      0d..1 2441290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> 
> Scheduler tick came in but almost 6 seconds of start of test, probably because migrate thread increase number of tasks queued on RQ:
> 
> [   28.355513] rcu_perf-163     3d.h. 2487447us : rcu_sched_clock_irq: sched-tick
> [   29.592912] rcu_pree-10      0d..1 2745293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   30.238041] rcu_perf-163     3d..2 2879562us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=migration/3 next_pid=26 next_prio=0
> [   30.269203] migratio-26      3d..2 2879745us : sched_switch: prev_comm=migration/3 prev_pid=26 prev_prio=0 prev_state=S ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
> [   31.109635] rcu_pree-10      0d..1 3049301us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   32.627686] rcu_pree-10      0d..1 3353298us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   34.071163] rcu_pree-10      0d..1 3657299us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> 
> Several context-switches on CPU 3, but grace-period is still extended:
> 
> [   34.634814] rcu_perf-163     3d..2 3775310us : sched_switch: prev_comm=rcu_perf_writer prev_pid=163 prev_prio=98 prev_state=R+ ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
> [   34.638532] kworker/-28      3d..2 3775343us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=D ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
> [   34.640338]  cpuhp/3-25      3d..2 3775348us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   34.653831]   <idle>-0       3d..2 3775522us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0 next_pid=28 next_prio=120
> [   34.657770] kworker/-28      3d..2 3775536us : sched_switch: prev_comm=kworker/3:0 prev_pid=28 prev_prio=120 prev_state=I ==> next_comm=kworker/3:1 next_pid=177 next_prio=120
> [   34.661758] kworker/-177     3d..2 3775543us : sched_switch: prev_comm=kworker/3:1 prev_pid=177 prev_prio=120 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   34.866007]   <idle>-0       3d..2 3789967us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/3:0H next_pid=29 next_prio=100
> [   34.874200] kworker/-29      3d..2 3789988us : sched_switch: prev_comm=kworker/3:0H prev_pid=29 prev_prio=100 prev_state=I ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   35.128506]   <idle>-0       3d..2 3816391us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=cpuhp/3 next_pid=25 next_prio=120
> [   35.130481]  cpuhp/3-25      3d..2 3816503us : sched_switch: prev_comm=cpuhp/3 prev_pid=25 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120
> [   35.509469]   <idle>-0       3d..2 3828462us : sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_perf_writer next_pid=163 next_prio=98
> 
> Scheduler tick doesn't do much to help:
> 
> [   35.512436] rcu_perf-163     3d.h. 3829210us : rcu_sched_clock_irq: sched-tick
> 
> Now scheduler tick is completely off for the next 29 seconds. FQS loops keeps
> doing resched_cpu on CPU 3 at 300 jiffy intervals (my HZ=1000):
> 
> [   36.160145] rcu_pree-10      0d..1 3958214us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   38.778574] rcu_pree-10      0d..1 4262288us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   40.604108] rcu_pree-10      0d..1 4566246us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   42.565345] rcu_pree-10      0d..1 4870294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   44.322041] rcu_pree-10      0d..1 5174293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   46.083710] rcu_pree-10      0d..1 5478290us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   47.851449] rcu_pree-10      0d..1 5782289us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   49.693127] rcu_pree-10      0d..1 6086293us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   51.432018] rcu_pree-10      0d..1 6390283us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   53.187991] rcu_pree-10      0d..1 6694294us : rcu_implicit_dynticks_qs: Sending urgent resched to cpu 3
> [   53.970850] rcu_perf-163     3.... 6828237us : rcu_perf_writer: End of rcuperf test
> 
> I feel we could do one of:
> 1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler

This would not be so good in the case where said handler interrupted
an RCU read-side critical section.

> 2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.

This could work, but we normally need multiple softirq invocations to
get to a quiescent state.  So we really need to turn the scheduler-clock
interrupt on.

We do have a hook into the interrupt handlers in the guise of the
irq==true case in rcu_nmi_exit_common().  When switching to an extended
quiescent state, there is nothing to do because FQS will detect the
quiescent state on the next pass.  Otherwise, the scheduler-clock
tick could be turned on.  But only if this is a nohz_full CPU and the
rdp->rcu_urgent_qs flag is set and the rdp->dynticks_nmi_nesting value
is 2.

We also would need to turn the scheduler-clock interrupt off at some
point, presumably when a CPU reports its quiescent state to RCU core.

Interestingly enough, rcu_torture_fwd_prog_nr() detected this, but
my response was to add this to the beginning and end of it:

	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
		tick_dep_set_task(current, TICK_DEP_MASK_RCU);

	...

	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
		tick_dep_clear_task(current, TICK_DEP_MASK_RCU);

Thus, one could argue that this functionality should instead be in the
nohz_full code, but one has to start somewhere.  The goal would be to
be able to remove these statements from rcu_torture_fwd_prog_nr().

> What do you think?  Also test diff is below.

Looks reasonable, though the long-term approach is to remove the above
lines from rcu_torture_fwd_prog_nr().  :-/

Untested patch below that includes some inadvertent whitespace fixes,
probably does not even build.  Thoughts?

							Thanx, Paul

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 8c494a692728..ad906d6a74fb 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
 	 */
 	if (rdp->dynticks_nmi_nesting != 1) {
 		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
+		if (tick_nohz_full_cpu(rdp->cpu) &&
+		    rdp->dynticks_nmi_nesting == 2 &&
+		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
+			rdp->rcu_forced_tick = true;
+			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+		}
 		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
 			   rdp->dynticks_nmi_nesting - 2);
 		return;
@@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
 	local_irq_restore(flags);
 }
 
+/*
+ * If the scheduler-clock interrupt was enabled on a nohz_full CPU
+ * in order to get to a quiescent state, disable it.
+ */
+void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
+{
+	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
+		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
+}
+
 /**
  * rcu_is_watching - see if RCU thinks that the current CPU is not idle
  *
@@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
 		if (!offloaded)
 			needwake = rcu_accelerate_cbs(rnp, rdp);
 
+		rcu_disable_tick_upon_qs(rdp);
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 		/* ^^^ Released rnp->lock */
 		if (needwake)
@@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 	int cpu;
 	unsigned long flags;
 	unsigned long mask;
+	struct rcu_data *rdp;
 	struct rcu_node *rnp;
 
 	rcu_for_each_leaf_node(rnp) {
@@ -2293,8 +2311,10 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
 		for_each_leaf_node_possible_cpu(rnp, cpu) {
 			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
 			if ((rnp->qsmask & bit) != 0) {
-				if (f(per_cpu_ptr(&rcu_data, cpu)))
-					mask |= bit;
+				rdp = per_cpu_ptr(&rcu_data, cpu);
+				if (f(rdp))
+					rcu_disable_tick_upon_qs(rdp);
+				mask |= bit;
 			}
 		}
 		if (mask != 0) {
@@ -2322,7 +2342,7 @@ void rcu_force_quiescent_state(void)
 	rnp = __this_cpu_read(rcu_data.mynode);
 	for (; rnp != NULL; rnp = rnp->parent) {
 		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
-		      !raw_spin_trylock(&rnp->fqslock);
+		       !raw_spin_trylock(&rnp->fqslock);
 		if (rnp_old != NULL)
 			raw_spin_unlock(&rnp_old->fqslock);
 		if (ret)
@@ -2855,7 +2875,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
 {
 	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
 		rcu_barrier_trace(TPS("LastCB"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 		complete(&rcu_state.barrier_completion);
 	} else {
 		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
@@ -2879,7 +2899,7 @@ static void rcu_barrier_func(void *unused)
 	} else {
 		debug_rcu_head_unqueue(&rdp->barrier_head);
 		rcu_barrier_trace(TPS("IRQNQ"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 	}
 	rcu_nocb_unlock(rdp);
 }
@@ -2906,7 +2926,7 @@ void rcu_barrier(void)
 	/* Did someone else do our work for us? */
 	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
 		rcu_barrier_trace(TPS("EarlyExit"), -1,
-				   rcu_state.barrier_sequence);
+				  rcu_state.barrier_sequence);
 		smp_mb(); /* caller's subsequent code after above check. */
 		mutex_unlock(&rcu_state.barrier_mutex);
 		return;
@@ -2938,11 +2958,11 @@ void rcu_barrier(void)
 			continue;
 		if (rcu_segcblist_n_cbs(&rdp->cblist)) {
 			rcu_barrier_trace(TPS("OnlineQ"), cpu,
-					   rcu_state.barrier_sequence);
+					  rcu_state.barrier_sequence);
 			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
 		} else {
 			rcu_barrier_trace(TPS("OnlineNQ"), cpu,
-					   rcu_state.barrier_sequence);
+					  rcu_state.barrier_sequence);
 		}
 	}
 	put_online_cpus();
@@ -3168,6 +3188,7 @@ void rcu_cpu_starting(unsigned int cpu)
 	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
 	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
 	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
+		rcu_disable_tick_upon_qs(rdp);
 		/* Report QS -after- changing ->qsmaskinitnext! */
 		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
 	} else {
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index c612f306fe89..055c31781d3a 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -181,6 +181,7 @@ struct rcu_data {
 	atomic_t dynticks;		/* Even value for idle, else odd. */
 	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
 	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
+	bool rcu_forced_tick;		/* Forced tick to provide QS. */
 #ifdef CONFIG_RCU_FAST_NO_HZ
 	bool all_lazy;			/* All CPU's CBs lazy at idle start? */
 	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-12 23:01         ` Paul E. McKenney
@ 2019-08-13  1:02           ` Joel Fernandes
  2019-08-13  1:05             ` Joel Fernandes
                               ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-08-13  1:02 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2019 at 05:20:13PM -0400, Joel Fernandes wrote:
> > On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> > > On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > > > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > > > [snip]
> > > > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > > > reported a quiescent state for the current grace period.
> > > > 
> > > > Just wanted to ask something - how does resched_cpu() help for this case?
> > > > 
> > > > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > > > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > > > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > > > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > > > 
> > > > This seems not that useful. Even if we enter the scheduler due to the
> > > > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > > > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > > > quiescent state to the leaf node. Neither will anything to do a
> > > > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > > > will still end up getting blocked.
> > > > 
> > > > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > > > kernel actually helps to end the grace period when we call resched_cpu() on
> > > > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > > > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > > > CPU? Maybe I should do an experiment to see this all play out.
> > > 
> > > An experiment would be good!
> > 
> > Hi Paul,
> > Some very interesting findings!
> > 
> > Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
> > CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
> > exits. Diff for test is appended.
> > 
> > Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
> > thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
> > grace period is unable to come to an end with the hold up seemingly due to
> > CPU 3.
> 
> Good catch!

Thanks!

> > I see that the scheduler tick is off mostly, but occasionally is turned back
> > on during the test loop. However it has no effect and the grace period is
> > stuck on the same rcu_state.gp_seq value for the duration of the test. I
> > think the scheduler-tick ineffectiveness could be because of this patch?
> > https://lore.kernel.org/patchwork/patch/446044/
> 
> Unlikely, given that __rcu_pending(), which is now just rcu_pending(),
> is only invoked from the scheduler-clock interrupt.

True, plus I missed that after a second, the RCU core is invoked from the
tick (via rcu_pending) even from a CPU designated as nohz_full
(rcu_nohz_full_cpu returns false).

> But from your traces below, clearly something is in need of repair.

Ok.

> > Relevant traces, sorry I did not wrap it for better readability:
[snip]
> > I feel we could do one of:
> > 1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler
> 
> This would not be so good in the case where said handler interrupted
> an RCU read-side critical section.

Agreed, let us scratch that idea.

> > 2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.
> 
> This could work, but we normally need multiple softirq invocations to
> get to a quiescent state.  So we really need to turn the scheduler-clock
> interrupt on.

I did not fully understand why multiple softirq invocation would be needed.
But I think it would mean moving the complexity of the tick to the IPI
handler, which could possibly cause code duplication and missing of other
features that the tick already has. So it is better to turn on the tick as
you pointed.

> We do have a hook into the interrupt handlers in the guise of the
> irq==true case in rcu_nmi_exit_common().  When switching to an extended
> quiescent state, there is nothing to do because FQS will detect the
> quiescent state on the next pass.  Otherwise, the scheduler-clock
> tick could be turned on.  But only if this is a nohz_full CPU and the
> rdp->rcu_urgent_qs flag is set and the rdp->dynticks_nmi_nesting value
> is 2.

dynticks_nmi_nesting == 2 means it is an outer-most IRQ that interrupted
the CPU when it was not in a dynticks EQS state. Right?

> We also would need to turn the scheduler-clock interrupt off at some
> point, presumably when a CPU reports its quiescent state to RCU core.

Makes sense, both the sites you add this to look like the right places.

> Interestingly enough, rcu_torture_fwd_prog_nr() detected this, but
> my response was to add this to the beginning and end of it:
> 
> 	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
> 		tick_dep_set_task(current, TICK_DEP_MASK_RCU);

Ok, not sure about the internals of this. But I am guessing this has the
effect of immediately turning on the tick.

> 
> 	...
> 
> 	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
> 		tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
> 
> Thus, one could argue that this functionality should instead be in the
> nohz_full code, but one has to start somewhere.  The goal would be to
> be able to remove these statements from rcu_torture_fwd_prog_nr().

Agreed.

> > What do you think?  Also test diff is below.
> 
> Looks reasonable, though the long-term approach is to remove the above
> lines from rcu_torture_fwd_prog_nr().  :-/
> 
> Untested patch below that includes some inadvertent whitespace fixes,
> probably does not even build.  Thoughts?

It looks correct to me, one comment below:

> 							Thanx, Paul
> 
> ------------------------------------------------------------------------
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index 8c494a692728..ad906d6a74fb 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
>  	 */
>  	if (rdp->dynticks_nmi_nesting != 1) {
>  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> +		if (tick_nohz_full_cpu(rdp->cpu) &&
> +		    rdp->dynticks_nmi_nesting == 2 &&
> +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> +			rdp->rcu_forced_tick = true;
> +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> +		}


Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
rcu_nmi_enter_common() ? We could add this code there, under the "if
(rcu_dynticks_curr_cpu_in_eqs())".

I will test this patch tomorrow and let you know how it goes.

thanks,

 - Joel




>  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
>  			   rdp->dynticks_nmi_nesting - 2);
>  		return;
> @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
>  	local_irq_restore(flags);
>  }
>  
> +/*
> + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> + * in order to get to a quiescent state, disable it.
> + */
> +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> +{
> +	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> +		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> +}
> +
>  /**
>   * rcu_is_watching - see if RCU thinks that the current CPU is not idle
>   *
> @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
>  		if (!offloaded)
>  			needwake = rcu_accelerate_cbs(rnp, rdp);
>  
> +		rcu_disable_tick_upon_qs(rdp);
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  		/* ^^^ Released rnp->lock */
>  		if (needwake)
> @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
>  	int cpu;
>  	unsigned long flags;
>  	unsigned long mask;
> +	struct rcu_data *rdp;
>  	struct rcu_node *rnp;
>  
>  	rcu_for_each_leaf_node(rnp) {
> @@ -2293,8 +2311,10 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
>  		for_each_leaf_node_possible_cpu(rnp, cpu) {
>  			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
>  			if ((rnp->qsmask & bit) != 0) {
> -				if (f(per_cpu_ptr(&rcu_data, cpu)))
> -					mask |= bit;
> +				rdp = per_cpu_ptr(&rcu_data, cpu);
> +				if (f(rdp))
> +					rcu_disable_tick_upon_qs(rdp);
> +				mask |= bit;
>  			}
>  		}
>  		if (mask != 0) {
> @@ -2322,7 +2342,7 @@ void rcu_force_quiescent_state(void)
>  	rnp = __this_cpu_read(rcu_data.mynode);
>  	for (; rnp != NULL; rnp = rnp->parent) {
>  		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
> -		      !raw_spin_trylock(&rnp->fqslock);
> +		       !raw_spin_trylock(&rnp->fqslock);
>  		if (rnp_old != NULL)
>  			raw_spin_unlock(&rnp_old->fqslock);
>  		if (ret)
> @@ -2855,7 +2875,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
>  {
>  	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
>  		rcu_barrier_trace(TPS("LastCB"), -1,
> -				   rcu_state.barrier_sequence);
> +				  rcu_state.barrier_sequence);
>  		complete(&rcu_state.barrier_completion);
>  	} else {
>  		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
> @@ -2879,7 +2899,7 @@ static void rcu_barrier_func(void *unused)
>  	} else {
>  		debug_rcu_head_unqueue(&rdp->barrier_head);
>  		rcu_barrier_trace(TPS("IRQNQ"), -1,
> -				   rcu_state.barrier_sequence);
> +				  rcu_state.barrier_sequence);
>  	}
>  	rcu_nocb_unlock(rdp);
>  }
> @@ -2906,7 +2926,7 @@ void rcu_barrier(void)
>  	/* Did someone else do our work for us? */
>  	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
>  		rcu_barrier_trace(TPS("EarlyExit"), -1,
> -				   rcu_state.barrier_sequence);
> +				  rcu_state.barrier_sequence);
>  		smp_mb(); /* caller's subsequent code after above check. */
>  		mutex_unlock(&rcu_state.barrier_mutex);
>  		return;
> @@ -2938,11 +2958,11 @@ void rcu_barrier(void)
>  			continue;
>  		if (rcu_segcblist_n_cbs(&rdp->cblist)) {
>  			rcu_barrier_trace(TPS("OnlineQ"), cpu,
> -					   rcu_state.barrier_sequence);
> +					  rcu_state.barrier_sequence);
>  			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
>  		} else {
>  			rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> -					   rcu_state.barrier_sequence);
> +					  rcu_state.barrier_sequence);
>  		}
>  	}
>  	put_online_cpus();
> @@ -3168,6 +3188,7 @@ void rcu_cpu_starting(unsigned int cpu)
>  	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
>  	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
>  	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> +		rcu_disable_tick_upon_qs(rdp);
>  		/* Report QS -after- changing ->qsmaskinitnext! */
>  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
>  	} else {
> diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> index c612f306fe89..055c31781d3a 100644
> --- a/kernel/rcu/tree.h
> +++ b/kernel/rcu/tree.h
> @@ -181,6 +181,7 @@ struct rcu_data {
>  	atomic_t dynticks;		/* Even value for idle, else odd. */
>  	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
>  	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
> +	bool rcu_forced_tick;		/* Forced tick to provide QS. */
>  #ifdef CONFIG_RCU_FAST_NO_HZ
>  	bool all_lazy;			/* All CPU's CBs lazy at idle start? */
>  	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-13  1:02           ` Joel Fernandes
@ 2019-08-13  1:05             ` Joel Fernandes
  2019-08-13  2:28               ` Paul E. McKenney
  2019-08-13  2:27             ` Paul E. McKenney
  2019-08-15 17:17             ` Paul E. McKenney
  2 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-13  1:05 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
[snip]
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8c494a692728..ad906d6a74fb 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> >  	 */
> >  	if (rdp->dynticks_nmi_nesting != 1) {
> >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > +		    rdp->dynticks_nmi_nesting == 2 &&
> > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > +			rdp->rcu_forced_tick = true;
> > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +		}
> 
> 
> Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> rcu_nmi_enter_common() ? We could add this code there, under the "if
> (rcu_dynticks_curr_cpu_in_eqs())".

Actually, the other way. if (!rcu_dynticks_curr_cpu_in_eqs()) then do the
tick_dep_set_cpu().

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-13  1:02           ` Joel Fernandes
  2019-08-13  1:05             ` Joel Fernandes
@ 2019-08-13  2:27             ` Paul E. McKenney
  2019-08-13  2:50               ` Paul E. McKenney
  2019-08-15 17:17             ` Paul E. McKenney
  2 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-13  2:27 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 12, 2019 at 05:20:13PM -0400, Joel Fernandes wrote:
> > > On Sun, Aug 11, 2019 at 08:53:06PM -0700, Paul E. McKenney wrote:
> > > > On Sun, Aug 11, 2019 at 11:21:42PM -0400, Joel Fernandes wrote:
> > > > > On Sun, Aug 11, 2019 at 02:13:18PM -0700, Paul E. McKenney wrote:
> > > > > [snip]
> > > > > > This leaves NO_HZ_FULL=y&&PREEMPT=y kernels.  In that case, RCU is
> > > > > > more aggressive about using resched_cpu() on CPUs that have not yet
> > > > > > reported a quiescent state for the current grace period.
> > > > > 
> > > > > Just wanted to ask something - how does resched_cpu() help for this case?
> > > > > 
> > > > > Consider a nohz_full CPU and a PREEMPT=y kernel. Say a single task is running
> > > > > in kernel mode with scheduler tick off. As we discussed, we have no help from
> > > > > cond_resched() (since its a PREEMPT=y kernel).  Because enough time has
> > > > > passed (jtsq*3), we send the CPU a re-scheduling IPI.
> > > > > 
> > > > > This seems not that useful. Even if we enter the scheduler due to the
> > > > > rescheduling flags set on that CPU, nothing will do the rcu_report_qs_rdp()
> > > > > or rcu_report_qs_rnp() on those CPUs, which are needed to propagate the
> > > > > quiescent state to the leaf node. Neither will anything to do a
> > > > > rcu_momentary_dyntick_idle() for that CPU. Without this, the grace period
> > > > > will still end up getting blocked.
> > > > > 
> > > > > Could you clarify which code paths on a nohz_full CPU running PREEMPT=y
> > > > > kernel actually helps to end the grace period when we call resched_cpu() on
> > > > > it?  Don't we need atleast do a rcu_momentary_dyntick_idle() from the
> > > > > scheduler IPI handler or from resched_cpu() for the benefit of a nohz_full
> > > > > CPU? Maybe I should do an experiment to see this all play out.
> > > > 
> > > > An experiment would be good!
> > > 
> > > Hi Paul,
> > > Some very interesting findings!
> > > 
> > > Experiment: a tight loop rcuperf thread bound to a nohz_full CPU with
> > > CONFIG_PREEMPT=y and CONFIG_NO_HZ_FULL=y. Executes for 5000 jiffies and
> > > exits. Diff for test is appended.
> > > 
> > > Inference: I see that the tick is off on the nohz_full CPU 3 (the looping
> > > thread is affined to CPU 3). The FQS loop does resched_cpu on 3, but the
> > > grace period is unable to come to an end with the hold up seemingly due to
> > > CPU 3.
> > 
> > Good catch!
> 
> Thanks!
> 
> > > I see that the scheduler tick is off mostly, but occasionally is turned back
> > > on during the test loop. However it has no effect and the grace period is
> > > stuck on the same rcu_state.gp_seq value for the duration of the test. I
> > > think the scheduler-tick ineffectiveness could be because of this patch?
> > > https://lore.kernel.org/patchwork/patch/446044/
> > 
> > Unlikely, given that __rcu_pending(), which is now just rcu_pending(),
> > is only invoked from the scheduler-clock interrupt.
> 
> True, plus I missed that after a second, the RCU core is invoked from the
> tick (via rcu_pending) even from a CPU designated as nohz_full
> (rcu_nohz_full_cpu returns false).
> 
> > But from your traces below, clearly something is in need of repair.
> 
> Ok.
> 
> > > Relevant traces, sorry I did not wrap it for better readability:
> [snip]
> > > I feel we could do one of:
> > > 1. Call rcu_momentary_dyntick_idle() from the re-schedule IPI handler
> > 
> > This would not be so good in the case where said handler interrupted
> > an RCU read-side critical section.
> 
> Agreed, let us scratch that idea.
> 
> > > 2. Raise the RCU softirq for NOHZ_FULL CPUs from re-schedule IPI handler or timer tick.
> > 
> > This could work, but we normally need multiple softirq invocations to
> > get to a quiescent state.  So we really need to turn the scheduler-clock
> > interrupt on.
> 
> I did not fully understand why multiple softirq invocation would be needed.
> But I think it would mean moving the complexity of the tick to the IPI
> handler, which could possibly cause code duplication and missing of other
> features that the tick already has. So it is better to turn on the tick as
> you pointed.
> 
> > We do have a hook into the interrupt handlers in the guise of the
> > irq==true case in rcu_nmi_exit_common().  When switching to an extended
> > quiescent state, there is nothing to do because FQS will detect the
> > quiescent state on the next pass.  Otherwise, the scheduler-clock
> > tick could be turned on.  But only if this is a nohz_full CPU and the
> > rdp->rcu_urgent_qs flag is set and the rdp->dynticks_nmi_nesting value
> > is 2.
> 
> dynticks_nmi_nesting == 2 means it is an outer-most IRQ that interrupted
> the CPU when it was not in a dynticks EQS state. Right?

Yes, that is the idea.

> > We also would need to turn the scheduler-clock interrupt off at some
> > point, presumably when a CPU reports its quiescent state to RCU core.
> 
> Makes sense, both the sites you add this to look like the right places.
> 
> > Interestingly enough, rcu_torture_fwd_prog_nr() detected this, but
> > my response was to add this to the beginning and end of it:
> > 
> > 	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
> > 		tick_dep_set_task(current, TICK_DEP_MASK_RCU);
> 
> Ok, not sure about the internals of this. But I am guessing this has the
> effect of immediately turning on the tick.

For that task, yes, but I am not so sure about the "immediately" part.

> > 	...
> > 
> > 	if (IS_ENABLED(CONFIG_NO_HZ_FULL))
> > 		tick_dep_clear_task(current, TICK_DEP_MASK_RCU);
> > 
> > Thus, one could argue that this functionality should instead be in the
> > nohz_full code, but one has to start somewhere.  The goal would be to
> > be able to remove these statements from rcu_torture_fwd_prog_nr().
> 
> Agreed.
> 
> > > What do you think?  Also test diff is below.
> > 
> > Looks reasonable, though the long-term approach is to remove the above
> > lines from rcu_torture_fwd_prog_nr().  :-/
> > 
> > Untested patch below that includes some inadvertent whitespace fixes,
> > probably does not even build.  Thoughts?
> 
> It looks correct to me, one comment below:
> 
> > 							Thanx, Paul
> > 
> > ------------------------------------------------------------------------
> > 
> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8c494a692728..ad906d6a74fb 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> >  	 */
> >  	if (rdp->dynticks_nmi_nesting != 1) {
> >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > +		    rdp->dynticks_nmi_nesting == 2 &&
> > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > +			rdp->rcu_forced_tick = true;
> > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +		}
> 
> 
> Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> rcu_nmi_enter_common() ? We could add this code there, under the "if
> (rcu_dynticks_curr_cpu_in_eqs())".

It might well make sense to do this on entry, especially if doing so
might safe an instruction or two.  It won't make the tick happen much
faster, given that interrupts are disabled throughout the handler.

> I will test this patch tomorrow and let you know how it goes.

No need, it dies horribly and quickly with rcutorture.  Not exactly
a surprise.  ;-)

							Thanx, Paul

> thanks,
> 
>  - Joel
> 
> 
> 
> 
> >  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
> >  			   rdp->dynticks_nmi_nesting - 2);
> >  		return;
> > @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
> >  	local_irq_restore(flags);
> >  }
> >  
> > +/*
> > + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> > + * in order to get to a quiescent state, disable it.
> > + */
> > +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> > +{
> > +	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> > +		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +}
> > +
> >  /**
> >   * rcu_is_watching - see if RCU thinks that the current CPU is not idle
> >   *
> > @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
> >  		if (!offloaded)
> >  			needwake = rcu_accelerate_cbs(rnp, rdp);
> >  
> > +		rcu_disable_tick_upon_qs(rdp);
> >  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> >  		/* ^^^ Released rnp->lock */
> >  		if (needwake)
> > @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> >  	int cpu;
> >  	unsigned long flags;
> >  	unsigned long mask;
> > +	struct rcu_data *rdp;
> >  	struct rcu_node *rnp;
> >  
> >  	rcu_for_each_leaf_node(rnp) {
> > @@ -2293,8 +2311,10 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> >  		for_each_leaf_node_possible_cpu(rnp, cpu) {
> >  			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
> >  			if ((rnp->qsmask & bit) != 0) {
> > -				if (f(per_cpu_ptr(&rcu_data, cpu)))
> > -					mask |= bit;
> > +				rdp = per_cpu_ptr(&rcu_data, cpu);
> > +				if (f(rdp))
> > +					rcu_disable_tick_upon_qs(rdp);
> > +				mask |= bit;
> >  			}
> >  		}
> >  		if (mask != 0) {
> > @@ -2322,7 +2342,7 @@ void rcu_force_quiescent_state(void)
> >  	rnp = __this_cpu_read(rcu_data.mynode);
> >  	for (; rnp != NULL; rnp = rnp->parent) {
> >  		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
> > -		      !raw_spin_trylock(&rnp->fqslock);
> > +		       !raw_spin_trylock(&rnp->fqslock);
> >  		if (rnp_old != NULL)
> >  			raw_spin_unlock(&rnp_old->fqslock);
> >  		if (ret)
> > @@ -2855,7 +2875,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
> >  {
> >  	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
> >  		rcu_barrier_trace(TPS("LastCB"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  		complete(&rcu_state.barrier_completion);
> >  	} else {
> >  		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
> > @@ -2879,7 +2899,7 @@ static void rcu_barrier_func(void *unused)
> >  	} else {
> >  		debug_rcu_head_unqueue(&rdp->barrier_head);
> >  		rcu_barrier_trace(TPS("IRQNQ"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  	}
> >  	rcu_nocb_unlock(rdp);
> >  }
> > @@ -2906,7 +2926,7 @@ void rcu_barrier(void)
> >  	/* Did someone else do our work for us? */
> >  	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> >  		rcu_barrier_trace(TPS("EarlyExit"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  		smp_mb(); /* caller's subsequent code after above check. */
> >  		mutex_unlock(&rcu_state.barrier_mutex);
> >  		return;
> > @@ -2938,11 +2958,11 @@ void rcu_barrier(void)
> >  			continue;
> >  		if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> >  			rcu_barrier_trace(TPS("OnlineQ"), cpu,
> > -					   rcu_state.barrier_sequence);
> > +					  rcu_state.barrier_sequence);
> >  			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> >  		} else {
> >  			rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> > -					   rcu_state.barrier_sequence);
> > +					  rcu_state.barrier_sequence);
> >  		}
> >  	}
> >  	put_online_cpus();
> > @@ -3168,6 +3188,7 @@ void rcu_cpu_starting(unsigned int cpu)
> >  	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> >  	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> >  	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > +		rcu_disable_tick_upon_qs(rdp);
> >  		/* Report QS -after- changing ->qsmaskinitnext! */
> >  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> >  	} else {
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index c612f306fe89..055c31781d3a 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -181,6 +181,7 @@ struct rcu_data {
> >  	atomic_t dynticks;		/* Even value for idle, else odd. */
> >  	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
> >  	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
> > +	bool rcu_forced_tick;		/* Forced tick to provide QS. */
> >  #ifdef CONFIG_RCU_FAST_NO_HZ
> >  	bool all_lazy;			/* All CPU's CBs lazy at idle start? */
> >  	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-13  1:05             ` Joel Fernandes
@ 2019-08-13  2:28               ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-13  2:28 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Mon, Aug 12, 2019 at 09:05:25PM -0400, Joel Fernandes wrote:
> On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> [snip]
> > > 							Thanx, Paul
> > > 
> > > ------------------------------------------------------------------------
> > > 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 8c494a692728..ad906d6a74fb 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > >  	 */
> > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > +			rdp->rcu_forced_tick = true;
> > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > +		}
> > 
> > 
> > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > (rcu_dynticks_curr_cpu_in_eqs())".
> 
> Actually, the other way. if (!rcu_dynticks_curr_cpu_in_eqs()) then do the
> tick_dep_set_cpu().

That does sound more likely.  Still, need to work out what the
breakage is.

							Thanx, Paul


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-13  2:27             ` Paul E. McKenney
@ 2019-08-13  2:50               ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-13  2:50 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Mon, Aug 12, 2019 at 07:27:26PM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > (rcu_dynticks_curr_cpu_in_eqs())".
> 
> It might well make sense to do this on entry, especially if doing so
> might safe an instruction or two.  It won't make the tick happen much
> faster, given that interrupts are disabled throughout the handler.
> 
> > I will test this patch tomorrow and let you know how it goes.
> 
> No need, it dies horribly and quickly with rcutorture.  Not exactly
> a surprise.  ;-)

And also not a surprise, due to a laughably stupid mistake...

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-13  1:02           ` Joel Fernandes
  2019-08-13  1:05             ` Joel Fernandes
  2019-08-13  2:27             ` Paul E. McKenney
@ 2019-08-15 17:17             ` Paul E. McKenney
  2019-08-15 20:04               ` Joel Fernandes
  2 siblings, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-15 17:17 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu

On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:

[ . . . ]

> > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > index 8c494a692728..ad906d6a74fb 100644
> > --- a/kernel/rcu/tree.c
> > +++ b/kernel/rcu/tree.c
> > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> >  	 */
> >  	if (rdp->dynticks_nmi_nesting != 1) {
> >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > +		    rdp->dynticks_nmi_nesting == 2 &&
> > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > +			rdp->rcu_forced_tick = true;
> > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +		}
> 
> 
> Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> rcu_nmi_enter_common() ? We could add this code there, under the "if
> (rcu_dynticks_curr_cpu_in_eqs())".

This would need to go in an "else" clause, correct?  But there would still
want to be a check for interrupt from base level (which would admittedly
be an equality comparison with zero) and we would also still need to check
for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.

Still, an equal-zero comparison is probably going to be a bit cheaper than
an equals-two comparison, and this is on the interrupt-entry fastpath,
so this change is likely worth making.  Good call!!!

							Thanx, Paul

> I will test this patch tomorrow and let you know how it goes.
> 
> thanks,
> 
>  - Joel
> 
> 
> 
> 
> >  		WRITE_ONCE(rdp->dynticks_nmi_nesting, /* No store tearing. */
> >  			   rdp->dynticks_nmi_nesting - 2);
> >  		return;
> > @@ -886,6 +892,16 @@ void rcu_irq_enter_irqson(void)
> >  	local_irq_restore(flags);
> >  }
> >  
> > +/*
> > + * If the scheduler-clock interrupt was enabled on a nohz_full CPU
> > + * in order to get to a quiescent state, disable it.
> > + */
> > +void rcu_disable_tick_upon_qs(struct rcu_data *rdp)
> > +{
> > +	if (tick_nohz_full_cpu(rdp->cpu) && rdp->rcu_forced_tick)
> > +		tick_dep_clear_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > +}
> > +
> >  /**
> >   * rcu_is_watching - see if RCU thinks that the current CPU is not idle
> >   *
> > @@ -1980,6 +1996,7 @@ rcu_report_qs_rdp(int cpu, struct rcu_data *rdp)
> >  		if (!offloaded)
> >  			needwake = rcu_accelerate_cbs(rnp, rdp);
> >  
> > +		rcu_disable_tick_upon_qs(rdp);
> >  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> >  		/* ^^^ Released rnp->lock */
> >  		if (needwake)
> > @@ -2269,6 +2286,7 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> >  	int cpu;
> >  	unsigned long flags;
> >  	unsigned long mask;
> > +	struct rcu_data *rdp;
> >  	struct rcu_node *rnp;
> >  
> >  	rcu_for_each_leaf_node(rnp) {
> > @@ -2293,8 +2311,10 @@ static void force_qs_rnp(int (*f)(struct rcu_data *rdp))
> >  		for_each_leaf_node_possible_cpu(rnp, cpu) {
> >  			unsigned long bit = leaf_node_cpu_bit(rnp, cpu);
> >  			if ((rnp->qsmask & bit) != 0) {
> > -				if (f(per_cpu_ptr(&rcu_data, cpu)))
> > -					mask |= bit;
> > +				rdp = per_cpu_ptr(&rcu_data, cpu);
> > +				if (f(rdp))
> > +					rcu_disable_tick_upon_qs(rdp);
> > +				mask |= bit;
> >  			}
> >  		}
> >  		if (mask != 0) {
> > @@ -2322,7 +2342,7 @@ void rcu_force_quiescent_state(void)
> >  	rnp = __this_cpu_read(rcu_data.mynode);
> >  	for (; rnp != NULL; rnp = rnp->parent) {
> >  		ret = (READ_ONCE(rcu_state.gp_flags) & RCU_GP_FLAG_FQS) ||
> > -		      !raw_spin_trylock(&rnp->fqslock);
> > +		       !raw_spin_trylock(&rnp->fqslock);
> >  		if (rnp_old != NULL)
> >  			raw_spin_unlock(&rnp_old->fqslock);
> >  		if (ret)
> > @@ -2855,7 +2875,7 @@ static void rcu_barrier_callback(struct rcu_head *rhp)
> >  {
> >  	if (atomic_dec_and_test(&rcu_state.barrier_cpu_count)) {
> >  		rcu_barrier_trace(TPS("LastCB"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  		complete(&rcu_state.barrier_completion);
> >  	} else {
> >  		rcu_barrier_trace(TPS("CB"), -1, rcu_state.barrier_sequence);
> > @@ -2879,7 +2899,7 @@ static void rcu_barrier_func(void *unused)
> >  	} else {
> >  		debug_rcu_head_unqueue(&rdp->barrier_head);
> >  		rcu_barrier_trace(TPS("IRQNQ"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  	}
> >  	rcu_nocb_unlock(rdp);
> >  }
> > @@ -2906,7 +2926,7 @@ void rcu_barrier(void)
> >  	/* Did someone else do our work for us? */
> >  	if (rcu_seq_done(&rcu_state.barrier_sequence, s)) {
> >  		rcu_barrier_trace(TPS("EarlyExit"), -1,
> > -				   rcu_state.barrier_sequence);
> > +				  rcu_state.barrier_sequence);
> >  		smp_mb(); /* caller's subsequent code after above check. */
> >  		mutex_unlock(&rcu_state.barrier_mutex);
> >  		return;
> > @@ -2938,11 +2958,11 @@ void rcu_barrier(void)
> >  			continue;
> >  		if (rcu_segcblist_n_cbs(&rdp->cblist)) {
> >  			rcu_barrier_trace(TPS("OnlineQ"), cpu,
> > -					   rcu_state.barrier_sequence);
> > +					  rcu_state.barrier_sequence);
> >  			smp_call_function_single(cpu, rcu_barrier_func, NULL, 1);
> >  		} else {
> >  			rcu_barrier_trace(TPS("OnlineNQ"), cpu,
> > -					   rcu_state.barrier_sequence);
> > +					  rcu_state.barrier_sequence);
> >  		}
> >  	}
> >  	put_online_cpus();
> > @@ -3168,6 +3188,7 @@ void rcu_cpu_starting(unsigned int cpu)
> >  	rdp->rcu_onl_gp_seq = READ_ONCE(rcu_state.gp_seq);
> >  	rdp->rcu_onl_gp_flags = READ_ONCE(rcu_state.gp_flags);
> >  	if (rnp->qsmask & mask) { /* RCU waiting on incoming CPU? */
> > +		rcu_disable_tick_upon_qs(rdp);
> >  		/* Report QS -after- changing ->qsmaskinitnext! */
> >  		rcu_report_qs_rnp(mask, rnp, rnp->gp_seq, flags);
> >  	} else {
> > diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
> > index c612f306fe89..055c31781d3a 100644
> > --- a/kernel/rcu/tree.h
> > +++ b/kernel/rcu/tree.h
> > @@ -181,6 +181,7 @@ struct rcu_data {
> >  	atomic_t dynticks;		/* Even value for idle, else odd. */
> >  	bool rcu_need_heavy_qs;		/* GP old, so heavy quiescent state! */
> >  	bool rcu_urgent_qs;		/* GP old need light quiescent state. */
> > +	bool rcu_forced_tick;		/* Forced tick to provide QS. */
> >  #ifdef CONFIG_RCU_FAST_NO_HZ
> >  	bool all_lazy;			/* All CPU's CBs lazy at idle start? */
> >  	unsigned long last_accelerate;	/* Last jiffy CBs were accelerated. */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 17:17             ` Paul E. McKenney
@ 2019-08-15 20:04               ` Joel Fernandes
  2019-08-15 20:31                 ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-15 20:04 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu

On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> 
> [ . . . ]
> 
> > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > index 8c494a692728..ad906d6a74fb 100644
> > > --- a/kernel/rcu/tree.c
> > > +++ b/kernel/rcu/tree.c
> > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > >  	 */
> > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > +			rdp->rcu_forced_tick = true;
> > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > +		}
> > 
> > 
> > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > (rcu_dynticks_curr_cpu_in_eqs())".
> 
> This would need to go in an "else" clause, correct?  But there would still
> want to be a check for interrupt from base level (which would admittedly
> be an equality comparison with zero) and we would also still need to check
> for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.

True, agreed. I replied to this before saying it should be
!rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
well which is below as diff. Testing it more now!

And, with this I do get many more ticks during the test. But there are
intervals where the tick is not seen. Still it is much better than before:

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index be9707f68024..e697c7a2ce67 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
 		return true;
 	}
 
+	if (val & TICK_DEP_MASK_CLOCK_RCU) {
+		return true;
+	}
+
 	return false;
 }
 
> Still, an equal-zero comparison is probably going to be a bit cheaper than
> an equals-two comparison, and this is on the interrupt-entry fastpath,
> so this change is likely worth making.  Good call!!!

Thanks!!

 - Joel

[snip]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 20:04               ` Joel Fernandes
@ 2019-08-15 20:31                 ` Paul E. McKenney
  2019-08-15 21:22                   ` Joel Fernandes
                                     ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-15 20:31 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > 
> > [ . . . ]
> > 
> > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > index 8c494a692728..ad906d6a74fb 100644
> > > > --- a/kernel/rcu/tree.c
> > > > +++ b/kernel/rcu/tree.c
> > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > >  	 */
> > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > +			rdp->rcu_forced_tick = true;
> > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > +		}
> > > 
> > > 
> > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > (rcu_dynticks_curr_cpu_in_eqs())".
> > 
> > This would need to go in an "else" clause, correct?  But there would still
> > want to be a check for interrupt from base level (which would admittedly
> > be an equality comparison with zero) and we would also still need to check
> > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> 
> True, agreed. I replied to this before saying it should be
> !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> well which is below as diff. Testing it more now!
> 
> And, with this I do get many more ticks during the test. But there are
> intervals where the tick is not seen. Still it is much better than before:
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index be9707f68024..e697c7a2ce67 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
>  		return true;
>  	}
>  
> +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> +		return true;
> +	}
> +
>  	return false;
>  }

That one is not in my tree, either.  Frederic, should I add this to
your patch?  For that matter, may I add your Signed-off-by as well?
Your original is in my -rcu tree at:

0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")

I am testing Joel's suggested addition now.

							Thanx, Paul

> > Still, an equal-zero comparison is probably going to be a bit cheaper than
> > an equals-two comparison, and this is on the interrupt-entry fastpath,
> > so this change is likely worth making.  Good call!!!
> 
> Thanks!!
> 
>  - Joel
> 
> [snip]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 20:31                 ` Paul E. McKenney
@ 2019-08-15 21:22                   ` Joel Fernandes
  2019-08-15 21:27                     ` Joel Fernandes
  2019-08-15 21:45                     ` Paul E. McKenney
  2019-08-19 12:09                   ` Frederic Weisbecker
  2019-08-19 16:57                   ` Frederic Weisbecker
  2 siblings, 2 replies; 31+ messages in thread
From: Joel Fernandes @ 2019-08-15 21:22 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > 
> > > [ . . . ]
> > > 
> > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > --- a/kernel/rcu/tree.c
> > > > > +++ b/kernel/rcu/tree.c
> > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > >  	 */
> > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > +			rdp->rcu_forced_tick = true;
> > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > +		}
> > > > 
> > > > 
> > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > 
> > > This would need to go in an "else" clause, correct?  But there would still
> > > want to be a check for interrupt from base level (which would admittedly
> > > be an equality comparison with zero) and we would also still need to check
> > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > 
> > True, agreed. I replied to this before saying it should be
> > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > well which is below as diff. Testing it more now!
> > 
> > And, with this I do get many more ticks during the test. But there are
> > intervals where the tick is not seen. Still it is much better than before:
> > 
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index be9707f68024..e697c7a2ce67 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> >  		return true;
> >  	}
> >  
> > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > +		return true;
> > +	}
> > +
> >  	return false;
> >  }
> 
> That one is not in my tree, either.  Frederic, should I add this to
> your patch?  For that matter, may I add your Signed-off-by as well?
> Your original is in my -rcu tree at:
> 
> 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> 
> I am testing Joel's suggested addition now.

Actually there's more addition needed! I found another thing missing:

There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
RCU is setting the latter.

So I added a check for both, below is the diff:

However, I see in some cases that the tick_dep_mask is just 0 but I have to
debug that tomorrow if that's an issue on the RCU side of things. For now,
below should be the completed Frederick patch which you could squash into his
if he's Ok with it:

---8<-----------------------

diff --git a/include/linux/tick.h b/include/linux/tick.h
index f92a10b5e112..3f476e2a4bf7 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -108,7 +108,8 @@ enum tick_dep_bits {
 	TICK_DEP_BIT_POSIX_TIMER	= 0,
 	TICK_DEP_BIT_PERF_EVENTS	= 1,
 	TICK_DEP_BIT_SCHED		= 2,
-	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
+	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
+	TICK_DEP_BIT_RCU		= 4
 };
 
 #define TICK_DEP_MASK_NONE		0
@@ -116,6 +117,7 @@ enum tick_dep_bits {
 #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
 #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
 #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
+#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
 
 #ifdef CONFIG_NO_HZ_COMMON
 extern bool tick_nohz_enabled;
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index be9707f68024..a613916cc3f0 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
 		return true;
 	}
 
+	if (val & TICK_DEP_MASK_RCU) {
+		trace_tick_stop(0, TICK_DEP_MASK_RCU);
+		return true;
+	}
+
 	return false;
 }
 
@@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
 	if (unlikely(!cpu_online(cpu)))
 		return false;
 
-	if (check_tick_dependency(&tick_dep_mask))
+	if (check_tick_dependency(&ts->tick_dep_mask)) {
 		return false;
+	}
+
+	if (check_tick_dependency(&tick_dep_mask)) {
+		return false;
+	}
 
 	if (check_tick_dependency(&ts->tick_dep_mask))
 		return false;
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 21:22                   ` Joel Fernandes
@ 2019-08-15 21:27                     ` Joel Fernandes
  2019-08-15 21:34                       ` Joel Fernandes
  2019-08-15 21:45                     ` Paul E. McKenney
  1 sibling, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-15 21:27 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 05:22:16PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > > >  	 */
> > > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > > +			rdp->rcu_forced_tick = true;
> > > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > > +		}
> > > > > 
> > > > > 
> > > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > > 
> > > > This would need to go in an "else" clause, correct?  But there would still
> > > > want to be a check for interrupt from base level (which would admittedly
> > > > be an equality comparison with zero) and we would also still need to check
> > > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > > 
> > > True, agreed. I replied to this before saying it should be
> > > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > > well which is below as diff. Testing it more now!
> > > 
> > > And, with this I do get many more ticks during the test. But there are
> > > intervals where the tick is not seen. Still it is much better than before:
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index be9707f68024..e697c7a2ce67 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > >  		return true;
> > >  	}
> > >  
> > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > +		return true;
> > > +	}
> > > +
> > >  	return false;
> > >  }
> > 
> > That one is not in my tree, either.  Frederic, should I add this to
> > your patch?  For that matter, may I add your Signed-off-by as well?
> > Your original is in my -rcu tree at:
> > 
> > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > 
> > I am testing Joel's suggested addition now.
> 
> Actually there's more addition needed! I found another thing missing:
> 
> There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
> RCU is setting the latter.
> 
> So I added a check for both, below is the diff:
> 
> However, I see in some cases that the tick_dep_mask is just 0 but I have to
> debug that tomorrow if that's an issue on the RCU side of things. For now,
> below should be the completed Frederick patch which you could squash into his
> if he's Ok with it:
> 
> ---8<-----------------------
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index f92a10b5e112..3f476e2a4bf7 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -108,7 +108,8 @@ enum tick_dep_bits {
>  	TICK_DEP_BIT_POSIX_TIMER	= 0,
>  	TICK_DEP_BIT_PERF_EVENTS	= 1,
>  	TICK_DEP_BIT_SCHED		= 2,
> -	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
> +	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
> +	TICK_DEP_BIT_RCU		= 4
>  };
>  
>  #define TICK_DEP_MASK_NONE		0
> @@ -116,6 +117,7 @@ enum tick_dep_bits {
>  #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
>  #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
>  #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
> +#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
>  
>  #ifdef CONFIG_NO_HZ_COMMON
>  extern bool tick_nohz_enabled;
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index be9707f68024..a613916cc3f0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
>  		return true;
>  	}
>  
> +	if (val & TICK_DEP_MASK_RCU) {
> +		trace_tick_stop(0, TICK_DEP_MASK_RCU);
> +		return true;
> +	}
> +
>  	return false;
>  }
>  
> @@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
>  	if (unlikely(!cpu_online(cpu)))
>  		return false;
>  
> -	if (check_tick_dependency(&tick_dep_mask))
> +	if (check_tick_dependency(&ts->tick_dep_mask)) {
>  		return false;
> +	}
> +
> +	if (check_tick_dependency(&tick_dep_mask)) {
> +		return false;
> +	}
>  
>  	if (check_tick_dependency(&ts->tick_dep_mask))
>  		return false;

Ah, I was being silly... this is already taken care off here. So you could
just drop this hunk and keep the other hunks.

Sorry!

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 21:27                     ` Joel Fernandes
@ 2019-08-15 21:34                       ` Joel Fernandes
  2019-08-15 21:57                         ` Paul E. McKenney
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-15 21:34 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 05:27:56PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 05:22:16PM -0400, Joel Fernandes wrote:
> > On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > > > >  	 */
> > > > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > > > +			rdp->rcu_forced_tick = true;
> > > > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > > > +		}
> > > > > > 
> > > > > > 
> > > > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > > > 
> > > > > This would need to go in an "else" clause, correct?  But there would still
> > > > > want to be a check for interrupt from base level (which would admittedly
> > > > > be an equality comparison with zero) and we would also still need to check
> > > > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > > > 
> > > > True, agreed. I replied to this before saying it should be
> > > > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > > > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > > > well which is below as diff. Testing it more now!
> > > > 
> > > > And, with this I do get many more ticks during the test. But there are
> > > > intervals where the tick is not seen. Still it is much better than before:
> > > > 
> > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > index be9707f68024..e697c7a2ce67 100644
> > > > --- a/kernel/time/tick-sched.c
> > > > +++ b/kernel/time/tick-sched.c
> > > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > > >  		return true;
> > > >  	}
> > > >  
> > > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > > +		return true;
> > > > +	}
> > > > +
> > > >  	return false;
> > > >  }
> > > 
> > > That one is not in my tree, either.  Frederic, should I add this to
> > > your patch?  For that matter, may I add your Signed-off-by as well?
> > > Your original is in my -rcu tree at:
> > > 
> > > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > > 
> > > I am testing Joel's suggested addition now.
> > 
> > Actually there's more addition needed! I found another thing missing:
> > 
> > There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
> > RCU is setting the latter.
> > 
> > So I added a check for both, below is the diff:
> > 
> > However, I see in some cases that the tick_dep_mask is just 0 but I have to
> > debug that tomorrow if that's an issue on the RCU side of things. For now,
> > below should be the completed Frederick patch which you could squash into his
> > if he's Ok with it:
> > 
> > ---8<-----------------------
> > 
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index f92a10b5e112..3f476e2a4bf7 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -108,7 +108,8 @@ enum tick_dep_bits {
> >  	TICK_DEP_BIT_POSIX_TIMER	= 0,
> >  	TICK_DEP_BIT_PERF_EVENTS	= 1,
> >  	TICK_DEP_BIT_SCHED		= 2,
> > -	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
> > +	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
> > +	TICK_DEP_BIT_RCU		= 4
> >  };
> >  
> >  #define TICK_DEP_MASK_NONE		0
> > @@ -116,6 +117,7 @@ enum tick_dep_bits {
> >  #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
> >  #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
> >  #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
> > +#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
> >  
> >  #ifdef CONFIG_NO_HZ_COMMON
> >  extern bool tick_nohz_enabled;
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index be9707f68024..a613916cc3f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
> >  		return true;
> >  	}
> >  
> > +	if (val & TICK_DEP_MASK_RCU) {
> > +		trace_tick_stop(0, TICK_DEP_MASK_RCU);
> > +		return true;
> > +	}
> > +
> >  	return false;
> >  }
> >  
> > @@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
> >  	if (unlikely(!cpu_online(cpu)))
> >  		return false;
> >  
> > -	if (check_tick_dependency(&tick_dep_mask))
> > +	if (check_tick_dependency(&ts->tick_dep_mask)) {
> >  		return false;
> > +	}
> > +
> > +	if (check_tick_dependency(&tick_dep_mask)) {
> > +		return false;
> > +	}
> >  
> >  	if (check_tick_dependency(&ts->tick_dep_mask))
> >  		return false;
> 
> Ah, I was being silly... this is already taken care off here. So you could
> just drop this hunk and keep the other hunks.

Sorry for the noise, to truly prevent the tick from getting turned off, I had
to do something like the following, it is a complete hack but it worked well
for me. I will debug this more and try to come up with a better solution
tomorrow:

---8<-----------------------

From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Subject: [PATCH] HACK: Force tick to not turn off if RCU is in urgent need of QS report

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
---
 kernel/rcu/tree.c        | 12 ++++++++++++
 kernel/time/tick-sched.c | 15 +++++++++++----
 2 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index a86c3c705e4d..60f81e151538 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -215,6 +215,18 @@ static long rcu_get_n_cbs_cpu(int cpu)
 	return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */
 }
 
+int rdp_nhq(void) {
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+
+	return !!rdp->rcu_need_heavy_qs;
+}
+
+int  rdp_uq(void) {
+	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
+
+	return !!rdp->rcu_urgent_qs;
+}
+
 void rcu_softirq_qs(void)
 {
 	rcu_qs();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index b14fc72c3b31..40df90222e34 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -851,6 +851,9 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
 	tick_nohz_restart(ts, now);
 }
 
+int rdp_nhq(void);
+int  rdp_uq(void);
+
 static void tick_nohz_full_update_tick(struct tick_sched *ts)
 {
 #ifdef CONFIG_NO_HZ_FULL
@@ -862,14 +865,18 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
 	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
 		return;
 
-	if (can_stop_full_tick(cpu, ts)) {
-		trace_printk("stopping sched-tick: need_heavy_qs=%d urgent_qs=%d\n", rdp_nhq(), rdp_uq());
-		trace_printk("stopping sched-tick: tick_dep_rcu=%d\n",
-				(atomic_read(&ts->tick_dep_mask) | TICK_DEP_MASK_RCU));
+	if (can_stop_full_tick(cpu, ts) && !rdp_nhq() && !rdp_uq()) {
+#if 0
+		trace_printk("stopping sched-tick: need_heavy_qs=%d urgent_qs=%d\n", );
+		trace_printk("stopping sched-tick: tick_dep_rcu=%d , ts %lu\n",
+				(atomic_read(&ts->tick_dep_mask) & TICK_DEP_MASK_RCU), (unsigned long)(&ts->tick_dep_mask));
+#endif
 		tick_nohz_stop_sched_tick(ts, cpu);
 	}
 	else if (ts->tick_stopped) {
+#if 0
 		trace_printk("restarting sched-tick\n");
+#endif
 		tick_nohz_restart_sched_tick(ts, ktime_get());
 	}
 #endif
-- 
2.23.0.rc1.153.gdeed80330f-goog


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 21:22                   ` Joel Fernandes
  2019-08-15 21:27                     ` Joel Fernandes
@ 2019-08-15 21:45                     ` Paul E. McKenney
  2019-08-16  0:02                       ` Joel Fernandes
  1 sibling, 1 reply; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-15 21:45 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 05:22:16PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > > 
> > > > [ . . . ]
> > > > 
> > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > > --- a/kernel/rcu/tree.c
> > > > > > +++ b/kernel/rcu/tree.c
> > > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > > >  	 */
> > > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > > +			rdp->rcu_forced_tick = true;
> > > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > > +		}
> > > > > 
> > > > > 
> > > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > > 
> > > > This would need to go in an "else" clause, correct?  But there would still
> > > > want to be a check for interrupt from base level (which would admittedly
> > > > be an equality comparison with zero) and we would also still need to check
> > > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > > 
> > > True, agreed. I replied to this before saying it should be
> > > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > > well which is below as diff. Testing it more now!
> > > 
> > > And, with this I do get many more ticks during the test. But there are
> > > intervals where the tick is not seen. Still it is much better than before:
> > > 
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index be9707f68024..e697c7a2ce67 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > >  		return true;
> > >  	}
> > >  
> > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > +		return true;
> > > +	}
> > > +
> > >  	return false;
> > >  }
> > 
> > That one is not in my tree, either.  Frederic, should I add this to
> > your patch?  For that matter, may I add your Signed-off-by as well?
> > Your original is in my -rcu tree at:
> > 
> > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > 
> > I am testing Joel's suggested addition now.
> 
> Actually there's more addition needed! I found another thing missing:
> 
> There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
> RCU is setting the latter.

As I understand it, tick_dep_mask forces the tick on globally,
ts->tick_dep_mask forces it on for a specific CPU (which RCU uses when it
needs a quiescent state from that CPU), current->tick_dep_mask forces
it on for a specific task (which RCU uses for callback invocation
and certain rcutorture kthreads), and I don't pretend to understand
current->signal->tick_dep_mask (the comment says something about POSIX
CPU timers).

But it looks to me that can_stop_full_tick() and check_tick_dependency()
already cover all of these.  What am I missing?

> So I added a check for both, below is the diff:
> 
> However, I see in some cases that the tick_dep_mask is just 0 but I have to
> debug that tomorrow if that's an issue on the RCU side of things. For now,
> below should be the completed Frederick patch which you could squash into his
> if he's Ok with it:
> 
> ---8<-----------------------
> 
> diff --git a/include/linux/tick.h b/include/linux/tick.h
> index f92a10b5e112..3f476e2a4bf7 100644
> --- a/include/linux/tick.h
> +++ b/include/linux/tick.h
> @@ -108,7 +108,8 @@ enum tick_dep_bits {
>  	TICK_DEP_BIT_POSIX_TIMER	= 0,
>  	TICK_DEP_BIT_PERF_EVENTS	= 1,
>  	TICK_DEP_BIT_SCHED		= 2,
> -	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
> +	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
> +	TICK_DEP_BIT_RCU		= 4
>  };
>  
>  #define TICK_DEP_MASK_NONE		0
> @@ -116,6 +117,7 @@ enum tick_dep_bits {
>  #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
>  #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
>  #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
> +#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
>  
>  #ifdef CONFIG_NO_HZ_COMMON
>  extern bool tick_nohz_enabled;
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index be9707f68024..a613916cc3f0 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
>  		return true;
>  	}
>  
> +	if (val & TICK_DEP_MASK_RCU) {
> +		trace_tick_stop(0, TICK_DEP_MASK_RCU);
> +		return true;
> +	}
> +
>  	return false;
>  }
>  
> @@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
>  	if (unlikely(!cpu_online(cpu)))
>  		return false;
>  
> -	if (check_tick_dependency(&tick_dep_mask))
> +	if (check_tick_dependency(&ts->tick_dep_mask)) {
>  		return false;
> +	}
> +
> +	if (check_tick_dependency(&tick_dep_mask)) {
> +		return false;
> +	}
>  
>  	if (check_tick_dependency(&ts->tick_dep_mask))
>  		return false;

You lost me on this one.  How does it help to check ts->tick_dep_mask
twice?  And why is it important to check it before checking tick_dep_mask,
especially given that the common case of all-zero masks will cause
all to be checked anyway?

							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 21:34                       ` Joel Fernandes
@ 2019-08-15 21:57                         ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-15 21:57 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 05:34:12PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 05:27:56PM -0400, Joel Fernandes wrote:
> > On Thu, Aug 15, 2019 at 05:22:16PM -0400, Joel Fernandes wrote:
> > > On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > > > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > > > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > > > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > [ . . . ]
> > > > > > 
> > > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > > > > >  	 */
> > > > > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > > > > +			rdp->rcu_forced_tick = true;
> > > > > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > > > > +		}
> > > > > > > 
> > > > > > > 
> > > > > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > > > > 
> > > > > > This would need to go in an "else" clause, correct?  But there would still
> > > > > > want to be a check for interrupt from base level (which would admittedly
> > > > > > be an equality comparison with zero) and we would also still need to check
> > > > > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > > > > 
> > > > > True, agreed. I replied to this before saying it should be
> > > > > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > > > > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > > > > well which is below as diff. Testing it more now!
> > > > > 
> > > > > And, with this I do get many more ticks during the test. But there are
> > > > > intervals where the tick is not seen. Still it is much better than before:
> > > > > 
> > > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > > index be9707f68024..e697c7a2ce67 100644
> > > > > --- a/kernel/time/tick-sched.c
> > > > > +++ b/kernel/time/tick-sched.c
> > > > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > > > >  		return true;
> > > > >  	}
> > > > >  
> > > > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > > > +		return true;
> > > > > +	}
> > > > > +
> > > > >  	return false;
> > > > >  }
> > > > 
> > > > That one is not in my tree, either.  Frederic, should I add this to
> > > > your patch?  For that matter, may I add your Signed-off-by as well?
> > > > Your original is in my -rcu tree at:
> > > > 
> > > > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > > > 
> > > > I am testing Joel's suggested addition now.
> > > 
> > > Actually there's more addition needed! I found another thing missing:
> > > 
> > > There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
> > > RCU is setting the latter.
> > > 
> > > So I added a check for both, below is the diff:
> > > 
> > > However, I see in some cases that the tick_dep_mask is just 0 but I have to
> > > debug that tomorrow if that's an issue on the RCU side of things. For now,
> > > below should be the completed Frederick patch which you could squash into his
> > > if he's Ok with it:
> > > 
> > > ---8<-----------------------
> > > 
> > > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > > index f92a10b5e112..3f476e2a4bf7 100644
> > > --- a/include/linux/tick.h
> > > +++ b/include/linux/tick.h
> > > @@ -108,7 +108,8 @@ enum tick_dep_bits {
> > >  	TICK_DEP_BIT_POSIX_TIMER	= 0,
> > >  	TICK_DEP_BIT_PERF_EVENTS	= 1,
> > >  	TICK_DEP_BIT_SCHED		= 2,
> > > -	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
> > > +	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
> > > +	TICK_DEP_BIT_RCU		= 4
> > >  };
> > >  
> > >  #define TICK_DEP_MASK_NONE		0
> > > @@ -116,6 +117,7 @@ enum tick_dep_bits {
> > >  #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
> > >  #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
> > >  #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
> > > +#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
> > >  
> > >  #ifdef CONFIG_NO_HZ_COMMON
> > >  extern bool tick_nohz_enabled;
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index be9707f68024..a613916cc3f0 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
> > >  		return true;
> > >  	}
> > >  
> > > +	if (val & TICK_DEP_MASK_RCU) {
> > > +		trace_tick_stop(0, TICK_DEP_MASK_RCU);
> > > +		return true;
> > > +	}
> > > +
> > >  	return false;
> > >  }
> > >  
> > > @@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
> > >  	if (unlikely(!cpu_online(cpu)))
> > >  		return false;
> > >  
> > > -	if (check_tick_dependency(&tick_dep_mask))
> > > +	if (check_tick_dependency(&ts->tick_dep_mask)) {
> > >  		return false;
> > > +	}
> > > +
> > > +	if (check_tick_dependency(&tick_dep_mask)) {
> > > +		return false;
> > > +	}
> > >  
> > >  	if (check_tick_dependency(&ts->tick_dep_mask))
> > >  		return false;
> > 
> > Ah, I was being silly... this is already taken care off here. So you could
> > just drop this hunk and keep the other hunks.
> 
> Sorry for the noise, to truly prevent the tick from getting turned off, I had
> to do something like the following, it is a complete hack but it worked well
> for me. I will debug this more and try to come up with a better solution
> tomorrow:

Hmmm...  In the cases where the tick is not being turned off, is it
possible that the CPU in question has already supplied its quiescent
state for the current grace period?  Or is ->rcu_urgent_qs being cleared
before RCU's dyntick-idle code can see it?

Ah, the latter, I bet.  Another argument for making the dyntick-idle
irq-entry code turn on the tick!  ;-)

I pushed a commit making that change to -rcu branch dev.

							Thanx, Paul

> ---8<-----------------------
> 
> From: "Joel Fernandes (Google)" <joel@joelfernandes.org>
> Subject: [PATCH] HACK: Force tick to not turn off if RCU is in urgent need of QS report
> 
> Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> ---
>  kernel/rcu/tree.c        | 12 ++++++++++++
>  kernel/time/tick-sched.c | 15 +++++++++++----
>  2 files changed, 23 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> index a86c3c705e4d..60f81e151538 100644
> --- a/kernel/rcu/tree.c
> +++ b/kernel/rcu/tree.c
> @@ -215,6 +215,18 @@ static long rcu_get_n_cbs_cpu(int cpu)
>  	return rcu_get_n_cbs_nocb_cpu(rdp); /* Works for offline, too. */
>  }
>  
> +int rdp_nhq(void) {
> +	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> +
> +	return !!rdp->rcu_need_heavy_qs;
> +}
> +
> +int  rdp_uq(void) {
> +	struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
> +
> +	return !!rdp->rcu_urgent_qs;
> +}
> +
>  void rcu_softirq_qs(void)
>  {
>  	rcu_qs();
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index b14fc72c3b31..40df90222e34 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -851,6 +851,9 @@ static void tick_nohz_restart_sched_tick(struct tick_sched *ts, ktime_t now)
>  	tick_nohz_restart(ts, now);
>  }
>  
> +int rdp_nhq(void);
> +int  rdp_uq(void);
> +
>  static void tick_nohz_full_update_tick(struct tick_sched *ts)
>  {
>  #ifdef CONFIG_NO_HZ_FULL
> @@ -862,14 +865,18 @@ static void tick_nohz_full_update_tick(struct tick_sched *ts)
>  	if (!ts->tick_stopped && ts->nohz_mode == NOHZ_MODE_INACTIVE)
>  		return;
>  
> -	if (can_stop_full_tick(cpu, ts)) {
> -		trace_printk("stopping sched-tick: need_heavy_qs=%d urgent_qs=%d\n", rdp_nhq(), rdp_uq());
> -		trace_printk("stopping sched-tick: tick_dep_rcu=%d\n",
> -				(atomic_read(&ts->tick_dep_mask) | TICK_DEP_MASK_RCU));
> +	if (can_stop_full_tick(cpu, ts) && !rdp_nhq() && !rdp_uq()) {
> +#if 0
> +		trace_printk("stopping sched-tick: need_heavy_qs=%d urgent_qs=%d\n", );
> +		trace_printk("stopping sched-tick: tick_dep_rcu=%d , ts %lu\n",
> +				(atomic_read(&ts->tick_dep_mask) & TICK_DEP_MASK_RCU), (unsigned long)(&ts->tick_dep_mask));
> +#endif
>  		tick_nohz_stop_sched_tick(ts, cpu);
>  	}
>  	else if (ts->tick_stopped) {
> +#if 0
>  		trace_printk("restarting sched-tick\n");
> +#endif
>  		tick_nohz_restart_sched_tick(ts, ktime_get());
>  	}
>  #endif
> -- 
> 2.23.0.rc1.153.gdeed80330f-goog
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 21:45                     ` Paul E. McKenney
@ 2019-08-16  0:02                       ` Joel Fernandes
  2019-08-19 12:34                         ` Frederic Weisbecker
  0 siblings, 1 reply; 31+ messages in thread
From: Joel Fernandes @ 2019-08-16  0:02 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: rcu, frederic

On Thu, Aug 15, 2019 at 02:45:42PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 05:22:16PM -0400, Joel Fernandes wrote:
> > On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > > On Thu, Aug 15, 2019 at 10:17:14AM -0700, Paul E. McKenney wrote:
> > > > > On Mon, Aug 12, 2019 at 09:02:49PM -0400, Joel Fernandes wrote:
> > > > > > On Mon, Aug 12, 2019 at 04:01:38PM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > [ . . . ]
> > > > > 
> > > > > > > diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
> > > > > > > index 8c494a692728..ad906d6a74fb 100644
> > > > > > > --- a/kernel/rcu/tree.c
> > > > > > > +++ b/kernel/rcu/tree.c
> > > > > > > @@ -651,6 +651,12 @@ static __always_inline void rcu_nmi_exit_common(bool irq)
> > > > > > >  	 */
> > > > > > >  	if (rdp->dynticks_nmi_nesting != 1) {
> > > > > > >  		trace_rcu_dyntick(TPS("--="), rdp->dynticks_nmi_nesting, rdp->dynticks_nmi_nesting - 2, rdp->dynticks);
> > > > > > > +		if (tick_nohz_full_cpu(rdp->cpu) &&
> > > > > > > +		    rdp->dynticks_nmi_nesting == 2 &&
> > > > > > > +		    rdp->rcu_urgent_qs && !rdp->rcu_forced_tick) {
> > > > > > > +			rdp->rcu_forced_tick = true;
> > > > > > > +			tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU);
> > > > > > > +		}
> > > > > > 
> > > > > > 
> > > > > > Instead of checking dynticks_nmi_nesting == 2 in rcu_nmi_exit_common(), can
> > > > > > we do the tick_dep_set_cpu(rdp->cpu, TICK_DEP_MASK_RCU)  from
> > > > > > rcu_nmi_enter_common() ? We could add this code there, under the "if
> > > > > > (rcu_dynticks_curr_cpu_in_eqs())".
> > > > > 
> > > > > This would need to go in an "else" clause, correct?  But there would still
> > > > > want to be a check for interrupt from base level (which would admittedly
> > > > > be an equality comparison with zero) and we would also still need to check
> > > > > for rdp->rcu_urgent_qs && !rdp->rcu_forced_tick.
> > > > 
> > > > True, agreed. I replied to this before saying it should be
> > > > !rcu_dynticks_curr_cpu_in_eqs() in the "if" ;) But it seems I could also be
> > > > missing the check for TICK_DEP_MASK_RCU in my tree so I think we need this as
> > > > well which is below as diff. Testing it more now!
> > > > 
> > > > And, with this I do get many more ticks during the test. But there are
> > > > intervals where the tick is not seen. Still it is much better than before:
> > > > 
> > > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > > index be9707f68024..e697c7a2ce67 100644
> > > > --- a/kernel/time/tick-sched.c
> > > > +++ b/kernel/time/tick-sched.c
> > > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > > >  		return true;
> > > >  	}
> > > >  
> > > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > > +		return true;
> > > > +	}
> > > > +
> > > >  	return false;
> > > >  }
> > > 
> > > That one is not in my tree, either.  Frederic, should I add this to
> > > your patch?  For that matter, may I add your Signed-off-by as well?
> > > Your original is in my -rcu tree at:
> > > 
> > > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > > 
> > > I am testing Joel's suggested addition now.
> > 
> > Actually there's more addition needed! I found another thing missing:
> > 
> > There's a per-cpu &tick_dep_mask and a per-cpu ts->tick_dep_mask. It seems
> > RCU is setting the latter.
> 
> As I understand it, tick_dep_mask forces the tick on globally,
> ts->tick_dep_mask forces it on for a specific CPU (which RCU uses when it
> needs a quiescent state from that CPU), current->tick_dep_mask forces
> it on for a specific task (which RCU uses for callback invocation
> and certain rcutorture kthreads), and I don't pretend to understand
> current->signal->tick_dep_mask (the comment says something about POSIX
> CPU timers).

Right. I am pretty new to all of these so I could have something incorrect in
a hurry. But thanks for the explanation of your understanding of these.

Yes this commit talks about timers as well for the signal->tick_dep_mask:
d027d45d8a17 ("nohz: New tick dependency mask")

> But it looks to me that can_stop_full_tick() and check_tick_dependency()
> already cover all of these.  What am I missing?

As you mentioned, all of these are covered.

> > So I added a check for both, below is the diff:
> > 
> > However, I see in some cases that the tick_dep_mask is just 0 but I have to
> > debug that tomorrow if that's an issue on the RCU side of things. For now,
> > below should be the completed Frederick patch which you could squash into his
> > if he's Ok with it:
> > 
> > ---8<-----------------------
> > 
> > diff --git a/include/linux/tick.h b/include/linux/tick.h
> > index f92a10b5e112..3f476e2a4bf7 100644
> > --- a/include/linux/tick.h
> > +++ b/include/linux/tick.h
> > @@ -108,7 +108,8 @@ enum tick_dep_bits {
> >  	TICK_DEP_BIT_POSIX_TIMER	= 0,
> >  	TICK_DEP_BIT_PERF_EVENTS	= 1,
> >  	TICK_DEP_BIT_SCHED		= 2,
> > -	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3
> > +	TICK_DEP_BIT_CLOCK_UNSTABLE	= 3,
> > +	TICK_DEP_BIT_RCU		= 4
> >  };
> >  
> >  #define TICK_DEP_MASK_NONE		0
> > @@ -116,6 +117,7 @@ enum tick_dep_bits {
> >  #define TICK_DEP_MASK_PERF_EVENTS	(1 << TICK_DEP_BIT_PERF_EVENTS)
> >  #define TICK_DEP_MASK_SCHED		(1 << TICK_DEP_BIT_SCHED)
> >  #define TICK_DEP_MASK_CLOCK_UNSTABLE	(1 << TICK_DEP_BIT_CLOCK_UNSTABLE)
> > +#define TICK_DEP_MASK_RCU		(1 << TICK_DEP_BIT_RCU)
> >  
> >  #ifdef CONFIG_NO_HZ_COMMON
> >  extern bool tick_nohz_enabled;
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index be9707f68024..a613916cc3f0 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -198,6 +198,11 @@ static bool check_tick_dependency(atomic_t *dep)
> >  		return true;
> >  	}
> >  
> > +	if (val & TICK_DEP_MASK_RCU) {
> > +		trace_tick_stop(0, TICK_DEP_MASK_RCU);
> > +		return true;
> > +	}
> > +
> >  	return false;
> >  }
> >  
> > @@ -208,8 +213,13 @@ static bool can_stop_full_tick(int cpu, struct tick_sched *ts)
> >  	if (unlikely(!cpu_online(cpu)))
> >  		return false;
> >  
> > -	if (check_tick_dependency(&tick_dep_mask))
> > +	if (check_tick_dependency(&ts->tick_dep_mask)) {
> >  		return false;
> > +	}
> > +
> > +	if (check_tick_dependency(&tick_dep_mask)) {
> > +		return false;
> > +	}
> >  
> >  	if (check_tick_dependency(&ts->tick_dep_mask))
> >  		return false;
> 
> You lost me on this one.  How does it help to check ts->tick_dep_mask
> twice?  And why is it important to check it before checking tick_dep_mask,
> especially given that the common case of all-zero masks will cause
> all to be checked anyway?

You are right. In a later reply I had mentioned to you to drop this hunk. It
is not needed.

I will pull your -rcu dev branch now to get the latest and will test the RCU
dyntick code to see if I can make that work with the tick-sched code.

Keep you posted!

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 20:31                 ` Paul E. McKenney
  2019-08-15 21:22                   ` Joel Fernandes
@ 2019-08-19 12:09                   ` Frederic Weisbecker
  2019-08-19 16:57                   ` Frederic Weisbecker
  2 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2019-08-19 12:09 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Joel Fernandes, rcu

On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index be9707f68024..e697c7a2ce67 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> >  		return true;
> >  	}
> >  
> > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > +		return true;
> > +	}
> > +
> >  	return false;
> >  }
> 
> That one is not in my tree, either.  Frederic, should I add this to
> your patch?  For that matter, may I add your Signed-off-by as well?
> Your original is in my -rcu tree at:
> 
> 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")

Of course, thanks!

> 
> I am testing Joel's suggested addition now.
> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-16  0:02                       ` Joel Fernandes
@ 2019-08-19 12:34                         ` Frederic Weisbecker
  0 siblings, 0 replies; 31+ messages in thread
From: Frederic Weisbecker @ 2019-08-19 12:34 UTC (permalink / raw)
  To: Joel Fernandes; +Cc: Paul E. McKenney, rcu

On Thu, Aug 15, 2019 at 08:02:01PM -0400, Joel Fernandes wrote:
> On Thu, Aug 15, 2019 at 02:45:42PM -0700, Paul E. McKenney wrote:
> > As I understand it, tick_dep_mask forces the tick on globally,
> > ts->tick_dep_mask forces it on for a specific CPU (which RCU uses when it
> > needs a quiescent state from that CPU), current->tick_dep_mask forces
> > it on for a specific task (which RCU uses for callback invocation
> > and certain rcutorture kthreads), and I don't pretend to understand
> > current->signal->tick_dep_mask (the comment says something about POSIX
> > CPU timers).
> 
> Right. I am pretty new to all of these so I could have something incorrect in
> a hurry. But thanks for the explanation of your understanding of these.
> 
> Yes this commit talks about timers as well for the signal->tick_dep_mask:
> d027d45d8a17 ("nohz: New tick dependency mask")

FWIW, the signal->tick_dep_mask is a thread group wide tick dependency. I'm
not proud of that requirement but we need to keep it for posix CPU timers
that can elapse a timer on all threads of a same process and the tick is
the cheapest way to account for that.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-15 20:31                 ` Paul E. McKenney
  2019-08-15 21:22                   ` Joel Fernandes
  2019-08-19 12:09                   ` Frederic Weisbecker
@ 2019-08-19 16:57                   ` Frederic Weisbecker
  2019-08-19 22:31                     ` Paul E. McKenney
  2 siblings, 1 reply; 31+ messages in thread
From: Frederic Weisbecker @ 2019-08-19 16:57 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Joel Fernandes, rcu

On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index be9707f68024..e697c7a2ce67 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> >  		return true;
> >  	}
> >  
> > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > +		return true;
> > +	}
> > +
> >  	return false;
> >  }
> 
> That one is not in my tree, either.  Frederic, should I add this to
> your patch?  For that matter, may I add your Signed-off-by as well?
> Your original is in my -rcu tree at:
> 
> 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> 
> I am testing Joel's suggested addition now.

Also this:

diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
index b7a904825e7d..295517f109d7 100644
--- a/include/trace/events/timer.h
+++ b/include/trace/events/timer.h
@@ -367,7 +367,8 @@ TRACE_EVENT(itimer_expire,
 		tick_dep_name(POSIX_TIMER)		\
 		tick_dep_name(PERF_EVENTS)		\
 		tick_dep_name(SCHED)			\
-		tick_dep_name_end(CLOCK_UNSTABLE)
+		tick_dep_name(CLOCK_UNSTABLE)		\
+		tick_dep_name_end(RCU)
 
 #undef tick_dep_name
 #undef tick_dep_mask_name


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: need_heavy_qs flag for PREEMPT=y kernels
  2019-08-19 16:57                   ` Frederic Weisbecker
@ 2019-08-19 22:31                     ` Paul E. McKenney
  0 siblings, 0 replies; 31+ messages in thread
From: Paul E. McKenney @ 2019-08-19 22:31 UTC (permalink / raw)
  To: Frederic Weisbecker; +Cc: Joel Fernandes, rcu

On Mon, Aug 19, 2019 at 06:57:10PM +0200, Frederic Weisbecker wrote:
> On Thu, Aug 15, 2019 at 01:31:07PM -0700, Paul E. McKenney wrote:
> > On Thu, Aug 15, 2019 at 04:04:32PM -0400, Joel Fernandes wrote:
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index be9707f68024..e697c7a2ce67 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -198,6 +198,10 @@ static bool check_tick_dependency(atomic_t *dep)
> > >  		return true;
> > >  	}
> > >  
> > > +	if (val & TICK_DEP_MASK_CLOCK_RCU) {
> > > +		return true;
> > > +	}
> > > +
> > >  	return false;
> > >  }
> > 
> > That one is not in my tree, either.  Frederic, should I add this to
> > your patch?  For that matter, may I add your Signed-off-by as well?
> > Your original is in my -rcu tree at:
> > 
> > 0cb41806c799 ("EXP nohz: Add TICK_DEP_BIT_RCU")
> > 
> > I am testing Joel's suggested addition now.
> 
> Also this:

Good point, added!

							Thanx, Paul

> diff --git a/include/trace/events/timer.h b/include/trace/events/timer.h
> index b7a904825e7d..295517f109d7 100644
> --- a/include/trace/events/timer.h
> +++ b/include/trace/events/timer.h
> @@ -367,7 +367,8 @@ TRACE_EVENT(itimer_expire,
>  		tick_dep_name(POSIX_TIMER)		\
>  		tick_dep_name(PERF_EVENTS)		\
>  		tick_dep_name(SCHED)			\
> -		tick_dep_name_end(CLOCK_UNSTABLE)
> +		tick_dep_name(CLOCK_UNSTABLE)		\
> +		tick_dep_name_end(RCU)
>  
>  #undef tick_dep_name
>  #undef tick_dep_mask_name
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, back to index

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-11 18:08 need_heavy_qs flag for PREEMPT=y kernels Joel Fernandes
2019-08-11 18:34 ` Joel Fernandes
2019-08-11 21:16   ` Paul E. McKenney
2019-08-11 21:25     ` Joel Fernandes
2019-08-11 23:30       ` Paul E. McKenney
2019-08-12  1:24         ` Joel Fernandes
2019-08-12  1:40           ` Joel Fernandes
2019-08-12  3:57             ` Paul E. McKenney
2019-08-11 21:13 ` Paul E. McKenney
2019-08-12  3:21   ` Joel Fernandes
2019-08-12  3:53     ` Paul E. McKenney
2019-08-12 21:20       ` Joel Fernandes
2019-08-12 23:01         ` Paul E. McKenney
2019-08-13  1:02           ` Joel Fernandes
2019-08-13  1:05             ` Joel Fernandes
2019-08-13  2:28               ` Paul E. McKenney
2019-08-13  2:27             ` Paul E. McKenney
2019-08-13  2:50               ` Paul E. McKenney
2019-08-15 17:17             ` Paul E. McKenney
2019-08-15 20:04               ` Joel Fernandes
2019-08-15 20:31                 ` Paul E. McKenney
2019-08-15 21:22                   ` Joel Fernandes
2019-08-15 21:27                     ` Joel Fernandes
2019-08-15 21:34                       ` Joel Fernandes
2019-08-15 21:57                         ` Paul E. McKenney
2019-08-15 21:45                     ` Paul E. McKenney
2019-08-16  0:02                       ` Joel Fernandes
2019-08-19 12:34                         ` Frederic Weisbecker
2019-08-19 12:09                   ` Frederic Weisbecker
2019-08-19 16:57                   ` Frederic Weisbecker
2019-08-19 22:31                     ` Paul E. McKenney

RCU Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/rcu/0 rcu/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 rcu rcu/ https://lore.kernel.org/rcu \
		rcu@vger.kernel.org
	public-inbox-index rcu

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.rcu


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git