All of lore.kernel.org
 help / color / mirror / Atom feed
* RCU vs NOHZ
@ 2022-09-15  8:39 Peter Zijlstra
  2022-09-15 15:37 ` Joel Fernandes
  2022-09-15 16:06 ` Paul E. McKenney
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Zijlstra @ 2022-09-15  8:39 UTC (permalink / raw)
  To: Joel Fernandes, Paul McKenney, Frederic Weisbecker, Thomas Gleixner
  Cc: linux-kernel, Boqun Feng, Rafael J. Wysocki

Hi,

After watching Joel's talk about RCU and idle ticks I was wondering
about why RCU doesn't have NOHZ hooks -- that is regular NOHZ, not the
NOHZ_FULL stuff.

These deep idle states are only feasible during NOHZ idle, and the NOHZ
path is already relatively expensive (which is offset by then mostly
staying idle for a long while).

Specifically my thinking was that when a CPU goes NOHZ it can splice
it's callback list onto a global list (cmpxchg), and then the
jiffy-updater CPU can look at and consume this global list (xchg).

Before you say... but globals suck (they do), NOHZ already has a fair
amount of global state, and as said before, it's offset by the CPU then
staying idle for a fair while. If there is heavy contention on the NOHZ
data, the idle governor is doing a bad job by selecting deep idle states
whilst we're not actually idle for long.

The above would remove the reason for RCU to inhibit NOHZ.


Additionally; when the very last CPU goes idle (I think we know this
somewhere, but I can't reaily remember where) we can insta-advance the
QS machinery and run the callbacks before going (NOHZ) idle.


Is there a reason this couldn't work? To me this seems like a much
simpler solution than the whole rcu-cb thing.

^ permalink raw reply	[flat|nested] 34+ messages in thread
* Re: RCU vs NOHZ
@ 2022-09-23 19:47 Joel Fernandes
  0 siblings, 0 replies; 34+ messages in thread
From: Joel Fernandes @ 2022-09-23 19:47 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Boqun Feng, Frederic Weisbecker, LKML, Peter Zijlstra,
	Rafael J. Wysocki, Thomas Gleixner

Hi  Paul,

On Fri, Sep 23, 2022 at 2:15 PM Paul E. McKenney <paulmck@kernel.org> wrote:
> 
> On Fri, Sep 23, 2022 at 01:47:40PM -0400, Joel Fernandes wrote:
>> On Fri, Sep 23, 2022 at 12:01 PM Paul E. McKenney <paulmck@kernel.org> wrote:
>> [...]
>>>>> And here is an untested patch that in theory might allow much of the
>>>>> reduction in power with minimal complexity/overhead for kernels without
>>>>> rcu_nocbs CPUs.  On the off-chance you know of someone who would be
>>>>> willing to do a realistic evaluation of it.
>>>>>                                                    Thanx, Paul
>>>>> ------------------------------------------------------------------------
>>>>> commit 80fc02e80a2dfb6c7468217cff2d4494a1c4b58d
>>>>> Author: Paul E. McKenney <paulmck@kernel.org>
>>>>> Date:   Wed Sep 21 13:30:24 2022 -0700
>>>>>    rcu: Let non-offloaded idle CPUs with callbacks defer tick
>>>>>    When a CPU goes idle, rcu_needs_cpu() is invoked to determine whether or
>>>>>    not RCU needs the scheduler-clock tick to keep interrupting.  Right now,
>>>>>    RCU keeps the tick on for a given idle CPU if there are any non-offloaded
>>>>>    callbacks queued on that CPU.
>>>>>    But if all of these callbacks are waiting for a grace period to finish,
>>>>>    there is no point in scheduling a tick before that grace period has any
>>>>>    reasonable chance of completing.  This commit therefore delays the tick
>>>>>    in the case where all the callbacks are waiting for a specific grace
>>>>>    period to elapse.  In theory, this should result in a 50-70% reduction in
>>>>>    RCU-induced scheduling-clock ticks on mostly-idle CPUs.  In practice, TBD.
>>>>>    Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
>>>>>    Cc: Peter Zijlstra <peterz@infradead.org>
>>>>> diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
>>>>> index 9bc025aa79a3..84e930c11065 100644
>>>>> --- a/include/linux/rcutiny.h
>>>>> +++ b/include/linux/rcutiny.h
>>>>> @@ -133,7 +133,7 @@ static inline void rcu_softirq_qs(void)
>>>>>            rcu_tasks_qs(current, (preempt)); \
>>>>>    } while (0)
>>>>> -static inline int rcu_needs_cpu(void)
>>>>> +static inline int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>>    return 0;
>>>>> }
>>>>> diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
>>>>> index 70795386b9ff..3066e0975022 100644
>>>>> --- a/include/linux/rcutree.h
>>>>> +++ b/include/linux/rcutree.h
>>>>> @@ -19,7 +19,7 @@
>>>>> void rcu_softirq_qs(void);
>>>>> void rcu_note_context_switch(bool preempt);
>>>>> -int rcu_needs_cpu(void);
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt);
>>>>> void rcu_cpu_stall_reset(void);
>>>>> /*
>>>>> diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
>>>>> index 5ec97e3f7468..47cd3b0d2a07 100644
>>>>> --- a/kernel/rcu/tree.c
>>>>> +++ b/kernel/rcu/tree.c
>>>>> @@ -676,12 +676,33 @@ void __rcu_irq_enter_check_tick(void)
>>>>>  * scheduler-clock interrupt.
>>>>>  *
>>>>>  * Just check whether or not this CPU has non-offloaded RCU callbacks
>>>>> - * queued.
>>>>> + * queued that need immediate attention.
>>>>>  */
>>>>> -int rcu_needs_cpu(void)
>>>>> +int rcu_needs_cpu(u64 basemono, u64 *nextevt)
>>>>> {
>>>>> -   return !rcu_segcblist_empty(&this_cpu_ptr(&rcu_data)->cblist) &&
>>>>> -           !rcu_rdp_is_offloaded(this_cpu_ptr(&rcu_data));
>>>>> +   struct rcu_data *rdp = this_cpu_ptr(&rcu_data);
>>>>> +   struct rcu_segcblist *rsclp = &rdp->cblist;
>>>>> +
>>>>> +   // Disabled, empty, or offloaded means nothing to do.
>>>>> +   if (!rcu_segcblist_is_enabled(rsclp) ||
>>>>> +       rcu_segcblist_empty(rsclp) || rcu_rdp_is_offloaded(rdp)) {
>>>>> +           *nextevt = KTIME_MAX;
>>>>> +           return 0;
>>>>> +   }
>>>>> +
>>>>> +   // Callbacks ready to invoke or that have not already been
>>>>> +   // assigned a grace period need immediate attention.
>>>>> +   if (!rcu_segcblist_segempty(rsclp, RCU_DONE_TAIL) ||
>>>>> +       !rcu_segcblist_segempty(rsclp, RCU_NEXT_TAIL))
>>>>> +           return 1;> +
>>>>> +   // There are callbacks waiting for some later grace period.
>>>>> +   // Wait for about a grace period or two for the next tick, at which
>>>>> +   // point there is high probability that this CPU will need to do some
>>>>> +   // work for RCU.
>>>>> +   *nextevt = basemono + TICK_NSEC * (READ_ONCE(jiffies_till_first_fqs) > +                                           READ_ONCE(jiffies_till_next_fqs) + 1);
>>>> Looks like nice idea. Could this race with the main GP thread on another CPU
>>>> completing the grace period, then on this CPU there is actually some work to do
>>>> but rcu_needs_cpu() returns 0.
>>>> I think it is plausible but not common, in which case the extra delay is
>>>> probably Ok.
>>> Glad you like it!
>>> Yes, that race can happen, but it can also happen today.
>>> A scheduling-clock interrupt might arrive at a CPU just as a grace
>>> period finishes.  Yes, the delay is longer with this patch.  If this
>>> proves to be a problem, then the delay heuristic might expanded to
>>> include the age of the current grace period.
>>> But keeping it simple to start with.
>> Sure sounds good and yes I agree to the point of the existing issue
>> but the error is just 1 jiffie there as you pointed.
> 
> One jiffy currently, but it would typically be about seven jiffies with
> the patch

Yes exactly, that’s what I meant.

> .  Systems with smaller values of HZ would have fewer jiffies,
> and systems with more than 128 CPUs would have more jiffies.  Systems
> booted with explicit values for the rcutree.jiffies_till_first_fqs and
> rcutree.jiffies_till_next_fqs kernel boot parameters could have whatever
> the administrator wanted.  ;-)


Makes sense, thanks for clarifying.

> But the key point is that the grace period itself can be extended by
> that value just due to timing and distribution of idle CPUs.
> 
>>>> Also, if the RCU readers take a long time, then we'd still wake up the system
>>>> periodically although with the above change, much fewer times, which is a good
>>>> thing.
>>> And the delay heuristic could also be expanded to include a digitally
>>> filtered estimate of grace-period duration.  But again, keeping it simple
>>> to start with.  ;-)
>>> My guess is that offloading gets you more power savings, but I don't
>>> have a good way of testing this guess.
>> I could try to run turbostat on Monday on our Intel SoCs, and see how
>> it reacts, but I was thinking of tracing this first to see the
>> behavior. Another thing I was thinking of was updating (the future)
>> rcutop to see how many 'idle ticks' are RCU related, vs others; and
>> then see how this patch effects that.
> 
> Such testing would be very welcome, thank you!
> 
> This patch might also need to keep track of the last tick on a given
> CPU in order to prevent frequent short idle periods from indefinitely
> delaying the tick.

I know what you mean! I had the exact same issue with the lazy timer initially, now the behavior is any lazy enqueue which happened will piggy back onto the existing timer already running.

- Joel


> 
> 
>                                                        Thanx, Paul
> 
>> thanks,
>> - Joel
>>>>>    unsigned long basejiff;
>>>>>    unsigned int seq;
>>>>> @@ -807,7 +807,7 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>>     * minimal delta which brings us back to this place
>>>>>     * immediately. Lather, rinse and repeat...
>>>>>     */
>>>>> -   if (rcu_needs_cpu() || arch_needs_cpu() ||
>>>>> +   if (rcu_needs_cpu(basemono, &next_rcu) || arch_needs_cpu() ||
>>>>>        irq_work_needs_cpu() || local_timer_softirq_pending()) {
>>>>>            next_tick = basemono + TICK_NSEC;
>>>>>    } else {
>>>>> @@ -818,8 +818,10 @@ static ktime_t tick_nohz_next_event(struct tick_sched *ts, int cpu)
>>>>>             * disabled this also looks at the next expiring
>>>>>             * hrtimer.
>>>>>             */
>>>>> -           next_tick = get_next_timer_interrupt(basejiff, basemono);
>>>>> -           ts->next_timer = next_tick;
>>>>> +           next_tmr = get_next_timer_interrupt(basejiff, basemono);
>>>>> +           ts->next_timer = next_tmr;
>>>>> +           /* Take the next rcu event into account */
>>>>> +           next_tick = next_rcu < next_tmr ? next_rcu : next_tmr;
>>>>>    }
>>>>>    /*

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2022-09-30 14:48 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-15  8:39 RCU vs NOHZ Peter Zijlstra
2022-09-15 15:37 ` Joel Fernandes
2022-09-15 16:06 ` Paul E. McKenney
2022-09-15 18:50   ` Peter Zijlstra
2022-09-15 19:14     ` Paul E. McKenney
2022-09-15 22:30       ` Peter Zijlstra
2022-09-16  3:40         ` Joel Fernandes
2022-09-16  7:58         ` Paul E. McKenney
2022-09-16  9:20           ` Peter Zijlstra
2022-09-16 18:11             ` Joel Fernandes
2022-09-17 13:35               ` Peter Zijlstra
2022-09-17 13:52                 ` Joel Fernandes
2022-09-17 14:28                   ` Paul E. McKenney
2022-09-17 14:25             ` Paul E. McKenney
2022-09-21 21:36               ` Paul E. McKenney
2022-09-23 15:12                 ` Joel Fernandes
2022-09-23 16:01                   ` Paul E. McKenney
2022-09-23 17:47                     ` Joel Fernandes
2022-09-23 18:15                       ` Paul E. McKenney
2022-09-29 11:06                 ` Peter Zijlstra
2022-09-29 10:55               ` Peter Zijlstra
2022-09-29 15:20                 ` Paul E. McKenney
2022-09-29 15:46                   ` Paul E. McKenney
2022-09-29 16:05                     ` Joel Fernandes
2022-09-29 16:23                     ` Peter Zijlstra
2022-09-29 16:42                       ` Paul E. McKenney
2022-09-29 19:01                         ` Peter Zijlstra
2022-09-29 19:52                           ` Paul E. McKenney
2022-09-29 16:18                   ` Peter Zijlstra
2022-09-29 16:36                     ` Paul E. McKenney
2022-09-29 19:08                       ` Peter Zijlstra
2022-09-29 19:56                         ` Paul E. McKenney
2022-09-30 14:48                           ` Paul E. McKenney
2022-09-23 19:47 Joel Fernandes

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.