All of lore.kernel.org
 help / color / mirror / Atom feed
From: srinivas pandruvada <srinivas.pandruvada@linux.intel.com>
To: Doug Smythies <dsmythies@telus.net>,
	"Rafael J. Wysocki" <rafael@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>,
	"Zhang, Rui" <rui.zhang@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	"paulmck@kernel.org" <paulmck@kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>,
	"x86@kernel.org" <x86@kernel.org>,
	"linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Subject: Re: CPU excessively long times between frequency scaling driver calls - bisected
Date: Tue, 22 Feb 2022 16:32:29 -0800	[thread overview]
Message-ID: <24f7d485dc60ba3ed5938230f477bf22a220d596.camel@linux.intel.com> (raw)
In-Reply-To: <CAAYoRsX-iw+88R9ZizMwJw2qc99XJZ8Fe0M5ETOy4=RUNsxWhQ@mail.gmail.com>

Hi Doug,

On Tue, 2022-02-22 at 16:07 -0800, Doug Smythies wrote:
> Hi All,
> 
> I am about 1/2 way through testing Feng's "hacky debug patch",
> let me know if I am wasting my time, and I'll abort. So far, it
> works fine.
This just proves that if you add some callback during long idle,  you
will reach a less aggressive p-state. I think you already proved that
with your results below showing 1W less average power ("Kernel 5.17-rc3
+ Feng patch (6 samples at 300 sec per").

Rafael replied with one possible option. Alternatively when planing to
enter deep idle, set P-state to min with a callback like we do in
offline callback.

So we need to think about a proper solution for this.

Thanks,
Srinivas

> 
> The compiler complains:
> 
> kernel/sched/idle.c: In function ‘do_idle’:
> ./include/linux/typecheck.h:12:18: warning: comparison of distinct
> pointer types lacks a cast
>    12 |  (void)(&__dummy == &__dummy2); \
>       |                  ^~
> ./include/linux/compiler.h:78:42: note: in definition of macro
> ‘unlikely’
>    78 | # define unlikely(x) __builtin_expect(!!(x), 0)
>       |                                          ^
> ./include/linux/jiffies.h:106:3: note: in expansion of macro
> ‘typecheck’
>   106 |   typecheck(unsigned long, b) && \
>       |   ^~~~~~~~~
> ./include/linux/jiffies.h:154:35: note: in expansion of macro
> ‘time_after’
>   154 | #define time_is_before_jiffies(a) time_after(jiffies, a)
>       |                                   ^~~~~~~~~~
> kernel/sched/idle.c:274:15: note: in expansion of macro
> ‘time_is_before_jiffies’
>   274 |  if (unlikely(time_is_before_jiffies(expire))) {
> 
> Test 1: 347 Hertz work/sleep frequency on one CPU while others do
> virtually no load, but at around 0.02 hertz aggregate.
> Processor package power (Watts):
> 
> Kernel 5.17-rc3 + Feng patch (6 samples at 300 sec per):
> Average: 3.2
> Min: 3.1
> Max: 3.3
> 
> Kernel 5.17-rc3 (Stock) re-stated:
> > Stock: 5 samples @ 300 seconds per sample:
> > average: 4.2 watts +31%
> > minimum: 3.5 watts
> > maximum: 4.9 watts
> 
> Kernel 5.17-rc3 with with b50db7095fe0 reverted. (Revert) re-stated:
> > Revert: 5 samples @ 300 seconds per sample:
> > average: 3.2 watts
> > minimum: 3.1 watts
> > maximum: 3.2 watts
> 
> I also ran intel_pstate_tracer.py for 5 hours 33 minutes
> (20,000 seconds) on an idle system looking for long durations.
> (just being thorough.) There were 12 occurrences of durations
> longer than 6.1 seconds.
> Max: 6.8 seconds. (O.K.)
> I had expected about 3 seconds max, based on my
> understanding of the patch code.
> 
> Old results restated, but corrected for a stupid mistake:
> 
> Stock:
> 1712 occurrences of durations longer than 6.1 seconds
> Max: 303.6 seconds. (Bad)
> 
> Revert:
> 3 occurrences of durations longer than 6.1 seconds
> Max: 6.5 seconds (O.K.)
> 
> On Tue, Feb 22, 2022 at 10:04 AM Rafael J. Wysocki
> <rafael@kernel.org> wrote:
> > 
> > On Tue, Feb 22, 2022 at 8:41 AM Feng Tang <feng.tang@intel.com>
> > wrote:
> > > 
> > > On Mon, Feb 14, 2022 at 07:17:24AM -0800, srinivas pandruvada
> > > wrote:
> > > > Hi Doug,
> > > > 
> > > > I think you use CONFIG_NO_HZ_FULL.
> > > > Here we are getting callback from scheduler. Can we check that
> > > > if
> > > > scheduler woke up on those CPUs?
> > > > We can run "trace-cmd -e sched" and check in kernel shark if
> > > > there is
> > > > similar gaps in activity.
> > > 
> > > Srinivas analyzed the scheduler trace data from trace-cmd, and
> > > thought is
> > > related with the cpufreq callback is not called timeley from
> > > scheduling
> > > events:
> > > 
> > > "
> > > I mean we ignore the callback when the target CPU is not a local
> > > CPU as
> > > we have to do IPI to adjust MSRs.
> > > This will happen many times when sched_wake will wake up a new
> > > CPU for
> > > the thread (we will get a callack for the target) but once the
> > > remote
> > > thread start executing "sched_switch", we will get a callback on
> > > local
> > > CPU, so we will adjust frequencies (provided 10ms interval from
> > > the
> > > last call).
> > > 
> > > > From the trace file I see the scenario where it took 72sec
> > > > between two
> > > updates:
> > > CPU 2
> > > 34412.597161    busy=78         freq=3232653
> > > 34484.450725    busy=63         freq=2606793
> > > 
> > > There is periodic activity in between, related to active load
> > > balancing
> > > in scheduler (since last frequency was higher these small work
> > > will
> > > also run at higher frequency). But those threads are not CFS
> > > class, so
> > > scheduler callback will not be called for them.
> > > 
> > > So removing the patch removed a trigger which would have caused a
> > > sched_switch to a CFS task and call a cpufreq/intel_pstate
> > > callback.
> > 
> > And so this behavior needs to be restored for the time being which
> > means reverting the problematic commit for 5.17 if possible.
> > 
> > It is unlikely that we'll get a proper fix before -rc7 and we still
> > need to test it properly.
> > 
> > > But calling for every class, will be too many callbacks and not
> > > sure we
> > > can even call for "stop" class, which these migration threads are
> > > using.
> > > "
> > 
> > Calling it for RT/deadline may not be a bad idea.
> > 
> > schedutil takes these classes into account when computing the
> > utilization now (see effective_cpu_util()), so doing callbacks only
> > for CFS seems insufficient.
> > 
> > Another way to avoid the issue at hand may be to prevent entering
> > deep
> > idle via PM QoS if the CPUs are running at high frequencies.
> > 
> > > Following this direction, I made a hacky debug patch which should
> > > help
> > > to restore the previous behavior.
> > > 
> > > Doug, could you help to try it? thanks
> > > 
> > > It basically tries to make sure the cpufreq-update-util be called
> > > timely
> > > even for a silent system with very few interrupts (even from
> > > tick).
> > > 
> > > Thanks,
> > > Feng
> > > 
> > > From 6be5f5da66a847860b0b9924fbb09f93b2e2d6e6 Mon Sep 17 00:00:00
> > > 2001
> > > From: Feng Tang <feng.tang@intel.com>
> > > Date: Tue, 22 Feb 2022 22:59:00 +0800
> > > Subject: [PATCH] idle/intel-pstate: hacky debug patch to make
> > > sure the
> > >  cpufreq_update_util callback being called timely in silent
> > > system
> > > 
> > > ---
> > >  kernel/sched/idle.c  | 10 ++++++++++
> > >  kernel/sched/sched.h | 13 +++++++++++++
> > >  2 files changed, 23 insertions(+)
> > > 
> > > diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
> > > index d17b0a5ce6ac..cc538acb3f1a 100644
> > > --- a/kernel/sched/idle.c
> > > +++ b/kernel/sched/idle.c
> > > @@ -258,15 +258,25 @@ static void cpuidle_idle_call(void)
> > >   *
> > >   * Called with polling cleared.
> > >   */
> > > +DEFINE_PER_CPU(u64, last_util_update_time);    /* in jiffies */
> > >  static void do_idle(void)
> > >  {
> > >         int cpu = smp_processor_id();
> > > +       u64 expire;
> > > 
> > >         /*
> > >          * Check if we need to update blocked load
> > >          */
> > >         nohz_run_idle_balance(cpu);
> > > 
> > > +#ifdef CONFIG_X86_INTEL_PSTATE
> > 
> > Why?  Doesn't this affect the other ccpufreq governors?
> 
> Yes, I have the same question.
> 
> > 
> > > +       expire = __this_cpu_read(last_util_update_time) + HZ * 3;
> > > +       if (unlikely(time_is_before_jiffies(expire))) {
> > > +               idle_update_util();
> > > +               __this_cpu_write(last_util_update_time,
> > > get_jiffies_64());
> > > +       }
> > > +#endif
> > > +
> > >         /*
> > >          * If the arch has a polling bit, we maintain an
> > > invariant:
> > >          *
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 0e66749486e7..2a8d87988d1f 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2809,6 +2809,19 @@ static inline void
> > > cpufreq_update_util(struct rq *rq, unsigned int flags)
> > >         if (data)
> > >                 data->func(data, rq_clock(rq), flags);
> > >  }
> > > +
> > > +static inline void idle_update_util(void)
> > > +{
> > > +       struct update_util_data *data;
> > > +       struct rq *rq = cpu_rq(raw_smp_processor_id());
> > > +
> > > +       data =
> > > rcu_dereference_sched(*per_cpu_ptr(&cpufreq_update_util_data,
> > > +                                                 cpu_of(rq)));
> > > +       if (data)
> > > +               data->func(data, rq_clock(rq), 0);
> > > +}
> > > +
> > > +
> > >  #else
> > >  static inline void cpufreq_update_util(struct rq *rq, unsigned
> > > int flags) {}
> > >  #endif /* CONFIG_CPU_FREQ */
> > > --


  reply	other threads:[~2022-02-23  0:32 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-08  1:40 CPU excessively long times between frequency scaling driver calls - bisected Doug Smythies
2022-02-08  2:39 ` Feng Tang
2022-02-08  7:13   ` Doug Smythies
2022-02-08  9:15     ` Feng Tang
2022-02-09  6:23       ` Doug Smythies
2022-02-10  7:45         ` Zhang, Rui
2022-02-13 18:54           ` Doug Smythies
2022-02-14 15:17             ` srinivas pandruvada
2022-02-15 21:35               ` Doug Smythies
2022-02-22  7:34               ` Feng Tang
2022-02-22 18:04                 ` Rafael J. Wysocki
2022-02-23  0:07                   ` Doug Smythies
2022-02-23  0:32                     ` srinivas pandruvada [this message]
2022-02-23  0:40                       ` Feng Tang
2022-02-23 14:23                         ` Rafael J. Wysocki
2022-02-24  8:08                           ` Feng Tang
2022-02-24 14:44                             ` Paul E. McKenney
2022-02-24 16:29                               ` Doug Smythies
2022-02-24 16:58                                 ` Paul E. McKenney
2022-02-25  0:29                               ` Feng Tang
2022-02-25  1:06                                 ` Paul E. McKenney
2022-02-25 17:45                             ` Rafael J. Wysocki
2022-02-26  0:36                               ` Doug Smythies
2022-02-28  4:12                                 ` Feng Tang
2022-02-28 19:36                                   ` Rafael J. Wysocki
2022-03-01  5:52                                     ` Feng Tang
2022-03-01 11:58                                       ` Rafael J. Wysocki
2022-03-01 17:18                                         ` Doug Smythies
2022-03-01 17:34                                           ` Rafael J. Wysocki
2022-03-02  4:06                                             ` Doug Smythies
2022-03-02 19:00                                               ` Rafael J. Wysocki
2022-03-03 23:00                                                 ` Doug Smythies
2022-03-04  6:59                                                   ` Doug Smythies
2022-03-16 15:54                                                     ` Doug Smythies
2022-03-17 12:30                                                       ` Rafael J. Wysocki
2022-03-17 13:58                                                         ` Doug Smythies
2022-03-24 14:04                                                           ` Doug Smythies
2022-03-24 18:17                                                             ` Rafael J. Wysocki
2022-03-25  0:03                                                               ` Doug Smythies
2022-03-03  5:27                                               ` Feng Tang
2022-03-03 12:02                                                 ` Rafael J. Wysocki
2022-03-04  5:13                                                   ` Feng Tang
2022-03-04 16:23                                                     ` Paul E. McKenney
2022-02-23  2:49                   ` Feng Tang
2022-02-23 14:11                     ` Rafael J. Wysocki
2022-02-23  9:40                   ` Thomas Gleixner
2022-02-23 14:23                     ` Rafael J. Wysocki

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=24f7d485dc60ba3ed5938230f477bf22a220d596.camel@linux.intel.com \
    --to=srinivas.pandruvada@linux.intel.com \
    --cc=dsmythies@telus.net \
    --cc=feng.tang@intel.com \
    --cc=linux-pm@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=rafael@kernel.org \
    --cc=rui.zhang@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.