[v2,2/2] cpufreq: intel_pstate: Conditional frequency invariant accounting
diff mbox series

Message ID 20191002122926.385-3-ggherdovich@suse.cz
State New, archived
Headers show
Series
  • Add support for frequency invariance for (some) x86
Related show

Commit Message

Giovanni Gherdovich Oct. 2, 2019, 12:29 p.m. UTC
From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

intel_pstate has two operating modes: active and passive. In "active"
mode, the in-built scaling governor is used and in "passive" mode,
the driver can be used with any governor like "schedutil". In "active"
mode the utilization values from schedutil is not used and there is
a requirement from high performance computing use cases, not to read
any APERF/MPERF MSRs. In this case no need to use CPU cycles for
frequency invariant accounting by reading APERF/MPERF MSRs.
With this change frequency invariant account is only enabled in
"passive" mode.

Signed-off-by: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: Giovanni Gherdovich <ggherdovich@suse.cz>
---
 drivers/cpufreq/intel_pstate.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Rafael J. Wysocki Oct. 3, 2019, 6:05 p.m. UTC | #1
On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich wrote:
> From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> 
> intel_pstate has two operating modes: active and passive. In "active"
> mode, the in-built scaling governor is used and in "passive" mode,
> the driver can be used with any governor like "schedutil". In "active"
> mode the utilization values from schedutil is not used and there is
> a requirement from high performance computing use cases, not to read
> any APERF/MPERF MSRs.

Well, this isn't quite convincing.

In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
applies *only* to intel_pstate in the "active" mode.  What about intel_pstate
in the "passive" mode combined with the "performance" governor?  Or any other
governor different from "schedutil" for that matter?

And what about acpi_cpufreq combined with any governor different from
"schedutil"?

Scale invariance is not really needed in all of those cases right now AFAICS,
or is it?

So is the real concern that intel_pstate in the "active" mode reads the MPERF
and APERF MSRs by itself and that kind of duplicates what the scale invariance
code does and is redundant etc?
Srinivas Pandruvada Oct. 4, 2019, 3:31 a.m. UTC | #2
On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> wrote:
> > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > 
> > intel_pstate has two operating modes: active and passive. In
> > "active"
> > mode, the in-built scaling governor is used and in "passive" mode,
> > the driver can be used with any governor like "schedutil". In
> > "active"
> > mode the utilization values from schedutil is not used and there is
> > a requirement from high performance computing use cases, not to
> > readas well
> > any APERF/MPERF MSRs.
> 
> Well, this isn't quite convincing.
> 
> In particular, I don't see why the "don't read APERF/MPERF MSRs"
> argument
> applies *only* to intel_pstate in the "active" mode.  What about
> intel_pstate
> in the "passive" mode combined with the "performance" governor?  Or
> any other
> governor different from "schedutil" for that matter?
> 
> And what about acpi_cpufreq combined with any governor different from
> "schedutil"?
> 
> Scale invariance is not really needed in all of those cases right now
> AFAICS,
> or is it?
Correct. This is just part of the patch to disable in active mode
(particularly in HWP and performance mode). 

But this patch is 2 years old. The folks who wanted this, disable
intel-pstate and use userspace governor with acpi-cpufreq. So may be
better to address those cases too.

> 
> So is the real concern that intel_pstate in the "active" mode reads
> the MPERF
> and APERF MSRs by itself and that kind of duplicates what the scale
> invariance
> code does and is redundant etc?
It is redundant in non-HWP mode. In HWP and performance (active mode)
we don't use atleast at this time.

Thanks
Srinivas
Rafael J. Wysocki Oct. 4, 2019, 8:08 a.m. UTC | #3
On Fri, Oct 4, 2019 at 5:31 AM Srinivas Pandruvada
<srinivas.pandruvada@linux.intel.com> wrote:
>
> On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > wrote:
> > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > >
> > > intel_pstate has two operating modes: active and passive. In
> > > "active"
> > > mode, the in-built scaling governor is used and in "passive" mode,
> > > the driver can be used with any governor like "schedutil". In
> > > "active"
> > > mode the utilization values from schedutil is not used and there is
> > > a requirement from high performance computing use cases, not to
> > > readas well
> > > any APERF/MPERF MSRs.
> >
> > Well, this isn't quite convincing.
> >
> > In particular, I don't see why the "don't read APERF/MPERF MSRs"
> > argument
> > applies *only* to intel_pstate in the "active" mode.  What about
> > intel_pstate
> > in the "passive" mode combined with the "performance" governor?  Or
> > any other
> > governor different from "schedutil" for that matter?
> >
> > And what about acpi_cpufreq combined with any governor different from
> > "schedutil"?
> >
> > Scale invariance is not really needed in all of those cases right now
> > AFAICS,
> > or is it?
>
> Correct. This is just part of the patch to disable in active mode
> (particularly in HWP and performance mode).
>
> But this patch is 2 years old. The folks who wanted this, disable
> intel-pstate and use userspace governor with acpi-cpufreq. So may be
> better to address those cases too.

Well, that's my point. :-)

It looks like the scale invariance is only needed when the schedutil
governor is used, regardless of the driver, and it may lead to
performance degradation in the other cases, at least in principle (I
wonder, though, if any hard data supporting that claim are available).
That can be addressed in two ways IMO, either by reducing the possible
negative impact of the scale invariance code (eg. by running it less
frequently), so that it can be always enabled (as long as it is
supported by the processor), or by avoiding to run it in all cases
when it is not needed (but that basically would require the governor
->init and ->exit to enable and disable the scale invariance,
respectively).

> >
> > So is the real concern that intel_pstate in the "active" mode reads
> > the MPERF
> > and APERF MSRs by itself and that kind of duplicates what the scale
> > invariance
> > code does and is redundant etc?
> It is redundant in non-HWP mode. In HWP and performance (active mode)
> we don't use atleast at this time.

Right.
Vincent Guittot Oct. 4, 2019, 8:28 a.m. UTC | #4
On Fri, 4 Oct 2019 at 10:24, Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > wrote:
> > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > >
> > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > the utilization values from schedutil is not used and there is a
> > > > requirement from high performance computing use cases, not to readas
> > > > well any APERF/MPERF MSRs.
> > >
> > > Well, this isn't quite convincing.
> > >
> > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > applies *only* to intel_pstate in the "active" mode.  What about
> > > intel_pstate in the "passive" mode combined with the "performance"
> > > governor?  Or any other governor different from "schedutil" for that
> > > matter?
> > >
> > > And what about acpi_cpufreq combined with any governor different from
> > > "schedutil"?
> > >
> > > Scale invariance is not really needed in all of those cases right now
> > > AFAICS, or is it?
> >
> > Correct. This is just part of the patch to disable in active mode
> > (particularly in HWP and performance mode).
> >
> > But this patch is 2 years old. The folks who wanted this, disable
> > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > better to address those cases too.
>
> I disagree with "scale invariance is needed only by the schedutil governor";
> the two other users are the CPU's estimated utilization in the wakeup path,
> via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> is used by update_sg_lb_stats().
>
> Also remember that scale invariance is applied to both PELT signals util_avg
> and load_avg; schedutil uses the former but not the latter.

You have been quicker than me to reply. I was about to say the exact
same things.
scale invariance also helps the scheduler in task placement by
stabilizing the metrics whatever the running frequency so a task will
not be seen as a big task just because of a CPU running at lower
frequency

>
> I understand Srinivas patch to disable MSR accesses during the tick as a
> band-aid solution to address a specific use case he cares about, but I don't
> think that extending this approach to any non-schedutil governor is a good
> idea -- you'd be killing load balancing in the process.
>
>
> Giovanni
Giovanni Gherdovich Oct. 4, 2019, 8:29 a.m. UTC | #5
On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > wrote:
> > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > 
> > > intel_pstate has two operating modes: active and passive. In "active"
> > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > driver can be used with any governor like "schedutil". In "active" mode
> > > the utilization values from schedutil is not used and there is a
> > > requirement from high performance computing use cases, not to readas
> > > well any APERF/MPERF MSRs.
> > 
> > Well, this isn't quite convincing.
> > 
> > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > applies *only* to intel_pstate in the "active" mode.  What about
> > intel_pstate in the "passive" mode combined with the "performance"
> > governor?  Or any other governor different from "schedutil" for that
> > matter?
> > 
> > And what about acpi_cpufreq combined with any governor different from
> > "schedutil"?
> > 
> > Scale invariance is not really needed in all of those cases right now
> > AFAICS, or is it?
> 
> Correct. This is just part of the patch to disable in active mode
> (particularly in HWP and performance mode). 
> 
> But this patch is 2 years old. The folks who wanted this, disable
> intel-pstate and use userspace governor with acpi-cpufreq. So may be
> better to address those cases too.

I disagree with "scale invariance is needed only by the schedutil governor";
the two other users are the CPU's estimated utilization in the wakeup path,
via cpu_util_without(), as well as the load-balance path, via cpu_util() which
is used by update_sg_lb_stats().

Also remember that scale invariance is applied to both PELT signals util_avg
and load_avg; schedutil uses the former but not the latter.

I understand Srinivas patch to disable MSR accesses during the tick as a
band-aid solution to address a specific use case he cares about, but I don't
think that extending this approach to any non-schedutil governor is a good
idea -- you'd be killing load balancing in the process.


Giovanni
Rafael J. Wysocki Oct. 4, 2019, 8:29 a.m. UTC | #6
On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > wrote:
> > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > >
> > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > the utilization values from schedutil is not used and there is a
> > > > requirement from high performance computing use cases, not to readas
> > > > well any APERF/MPERF MSRs.
> > >
> > > Well, this isn't quite convincing.
> > >
> > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > applies *only* to intel_pstate in the "active" mode.  What about
> > > intel_pstate in the "passive" mode combined with the "performance"
> > > governor?  Or any other governor different from "schedutil" for that
> > > matter?
> > >
> > > And what about acpi_cpufreq combined with any governor different from
> > > "schedutil"?
> > >
> > > Scale invariance is not really needed in all of those cases right now
> > > AFAICS, or is it?
> >
> > Correct. This is just part of the patch to disable in active mode
> > (particularly in HWP and performance mode).
> >
> > But this patch is 2 years old. The folks who wanted this, disable
> > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > better to address those cases too.
>
> I disagree with "scale invariance is needed only by the schedutil governor";
> the two other users are the CPU's estimated utilization in the wakeup path,
> via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> is used by update_sg_lb_stats().

OK, so there are reasons to run the scale invariance code which are
not related to the cpufreq governor in use.

I wonder then why those reasons are not relevant for intel_pstate in
the "active" mode.

> Also remember that scale invariance is applied to both PELT signals util_avg
> and load_avg; schedutil uses the former but not the latter.
>
> I understand Srinivas patch to disable MSR accesses during the tick as a
> band-aid solution to address a specific use case he cares about, but I don't
> think that extending this approach to any non-schedutil governor is a good
> idea -- you'd be killing load balancing in the process.

But that is also the case for intel_pstate in the "active" mode, isn't it?
Rafael J. Wysocki Oct. 4, 2019, 8:33 a.m. UTC | #7
On Fri, Oct 4, 2019 at 10:28 AM Vincent Guittot
<vincent.guittot@linaro.org> wrote:
>
> On Fri, 4 Oct 2019 at 10:24, Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> >
> > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > > wrote:
> > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > > >
> > > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > > the utilization values from schedutil is not used and there is a
> > > > > requirement from high performance computing use cases, not to readas
> > > > > well any APERF/MPERF MSRs.
> > > >
> > > > Well, this isn't quite convincing.
> > > >
> > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > > applies *only* to intel_pstate in the "active" mode.  What about
> > > > intel_pstate in the "passive" mode combined with the "performance"
> > > > governor?  Or any other governor different from "schedutil" for that
> > > > matter?
> > > >
> > > > And what about acpi_cpufreq combined with any governor different from
> > > > "schedutil"?
> > > >
> > > > Scale invariance is not really needed in all of those cases right now
> > > > AFAICS, or is it?
> > >
> > > Correct. This is just part of the patch to disable in active mode
> > > (particularly in HWP and performance mode).
> > >
> > > But this patch is 2 years old. The folks who wanted this, disable
> > > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > > better to address those cases too.
> >
> > I disagree with "scale invariance is needed only by the schedutil governor";
> > the two other users are the CPU's estimated utilization in the wakeup path,
> > via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> > is used by update_sg_lb_stats().
> >
> > Also remember that scale invariance is applied to both PELT signals util_avg
> > and load_avg; schedutil uses the former but not the latter.
>
> You have been quicker than me to reply. I was about to say the exact
> same things.
> scale invariance also helps the scheduler in task placement by
> stabilizing the metrics whatever the running frequency so a task will
> not be seen as a big task just because of a CPU running at lower
> frequency

So avoiding it just in one specific driver/governor configuration
would be inconsistent at best.

I guess that leaves us with the impact reduction option, realistically.
Giovanni Gherdovich Oct. 4, 2019, 8:57 a.m. UTC | #8
On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote:
> On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > 
> > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > > wrote:
> > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > > > 
> > > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > > the utilization values from schedutil is not used and there is a
> > > > > requirement from high performance computing use cases, not to readas
> > > > > well any APERF/MPERF MSRs.
> > > > 
> > > > Well, this isn't quite convincing.
> > > > 
> > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > > applies *only* to intel_pstate in the "active" mode.  What about
> > > > intel_pstate in the "passive" mode combined with the "performance"
> > > > governor?  Or any other governor different from "schedutil" for that
> > > > matter?
> > > > 
> > > > And what about acpi_cpufreq combined with any governor different from
> > > > "schedutil"?
> > > > 
> > > > Scale invariance is not really needed in all of those cases right now
> > > > AFAICS, or is it?
> > > 
> > > Correct. This is just part of the patch to disable in active mode
> > > (particularly in HWP and performance mode).
> > > 
> > > But this patch is 2 years old. The folks who wanted this, disable
> > > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > > better to address those cases too.
> > 
> > I disagree with "scale invariance is needed only by the schedutil governor";
> > the two other users are the CPU's estimated utilization in the wakeup path,
> > via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> > is used by update_sg_lb_stats().
> 
> OK, so there are reasons to run the scale invariance code which are
> not related to the cpufreq governor in use.
> 
> I wonder then why those reasons are not relevant for intel_pstate in
> the "active" mode.
> 
> > Also remember that scale invariance is applied to both PELT signals util_avg
> > and load_avg; schedutil uses the former but not the latter.
> > 
> > I understand Srinivas patch to disable MSR accesses during the tick as a
> > band-aid solution to address a specific use case he cares about, but I don't
> > think that extending this approach to any non-schedutil governor is a good
> > idea -- you'd be killing load balancing in the process.
> 
> But that is also the case for intel_pstate in the "active" mode, isn't it?

Sure it is.

Now, what's the performance impact of loosing scale-invariance in PELT signals?
And what's the performance impact of accessing two MSRs at the scheduler tick
on each CPU?

I am sporting Srinivas' patch because he expressed the concern that the losses
don't justify the gains for a specific class of users (supercomputing),
although I don't fully like the idea (and arguably that should be measured).


Giovanni
Rafael J. Wysocki Oct. 4, 2019, 9:17 a.m. UTC | #9
On Fri, Oct 4, 2019 at 10:52 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
>
> On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote:
> > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <ggherdovich@suse.cz> wrote:
> > >
> > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > > > wrote:
> > > > > > From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> > > > > >
> > > > > > intel_pstate has two operating modes: active and passive. In "active"
> > > > > > mode, the in-built scaling governor is used and in "passive" mode, the
> > > > > > driver can be used with any governor like "schedutil". In "active" mode
> > > > > > the utilization values from schedutil is not used and there is a
> > > > > > requirement from high performance computing use cases, not to readas
> > > > > > well any APERF/MPERF MSRs.
> > > > >
> > > > > Well, this isn't quite convincing.
> > > > >
> > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs" argument
> > > > > applies *only* to intel_pstate in the "active" mode.  What about
> > > > > intel_pstate in the "passive" mode combined with the "performance"
> > > > > governor?  Or any other governor different from "schedutil" for that
> > > > > matter?
> > > > >
> > > > > And what about acpi_cpufreq combined with any governor different from
> > > > > "schedutil"?
> > > > >
> > > > > Scale invariance is not really needed in all of those cases right now
> > > > > AFAICS, or is it?
> > > >
> > > > Correct. This is just part of the patch to disable in active mode
> > > > (particularly in HWP and performance mode).
> > > >
> > > > But this patch is 2 years old. The folks who wanted this, disable
> > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > > > better to address those cases too.
> > >
> > > I disagree with "scale invariance is needed only by the schedutil governor";
> > > the two other users are the CPU's estimated utilization in the wakeup path,
> > > via cpu_util_without(), as well as the load-balance path, via cpu_util() which
> > > is used by update_sg_lb_stats().
> >
> > OK, so there are reasons to run the scale invariance code which are
> > not related to the cpufreq governor in use.
> >
> > I wonder then why those reasons are not relevant for intel_pstate in
> > the "active" mode.
> >
> > > Also remember that scale invariance is applied to both PELT signals util_avg
> > > and load_avg; schedutil uses the former but not the latter.
> > >
> > > I understand Srinivas patch to disable MSR accesses during the tick as a
> > > band-aid solution to address a specific use case he cares about, but I don't
> > > think that extending this approach to any non-schedutil governor is a good
> > > idea -- you'd be killing load balancing in the process.
> >
> > But that is also the case for intel_pstate in the "active" mode, isn't it?
>
> Sure it is.
>
> Now, what's the performance impact of loosing scale-invariance in PELT signals?

That needs to be measured.

> And what's the performance impact of accessing two MSRs at the scheduler tick
> on each CPU?

That would be the MSR access latency times two and I don't remember
the exact numbers from the top of my head.  It would also depend on
how much time it takes to run the tick without those two MSR accesses,
on average.

The question I have, however, is whether or not it really is necessary
to update arch_cpu_freq on every tick.  Maybe it would be sufficient
to do that every 10 ms, say (in case the tick is more frequent than
that), or similar?

> I am sporting Srinivas' patch because he expressed the concern that the losses
> don't justify the gains for a specific class of users (supercomputing),
> although I don't fully like the idea (and arguably that should be measured).

My point is that this patch doesn't even cover the entire case in
question, because the HPC people may very well be using cpufreq
driver/governor configurations different from intel_pstate in the
"active" mode.  Moreover, given the lack of data, it is even hard to
say what the potential impact is, if any.

I guess it should be a fairly straightforward exercise to compare the
results of the various benchmarks using intel_pstate in the "passive"
mode and the "performance" governor with and without patch [1/2] from
this series. If you see any perf regressions after applying the patch,
that's what the others will probably see as well.
Srinivas Pandruvada Oct. 4, 2019, 3:17 p.m. UTC | #10
On Fri, 2019-10-04 at 10:57 +0200, Giovanni Gherdovich wrote:
> On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote:
> > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <
> > ggherdovich@suse.cz> wrote:
> > > 
> > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni
> > > > > Gherdovich
> > > > > wrote:
> > > > > > From: Srinivas Pandruvada <
> > > > > > srinivas.pandruvada@linux.intel.com>
> > > > > > 
> > > > > > intel_pstate has two operating modes: active and passive.
> > > > > > In "active"
> > > > > > mode, the in-built scaling governor is used and in
> > > > > > "passive" mode, the
> > > > > > driver can be used with any governor like "schedutil". In
> > > > > > "active" mode
> > > > > > the utilization values from schedutil is not used and there
> > > > > > is a
> > > > > > requirement from high performance computing use cases, not
> > > > > > to readas
> > > > > > well any APERF/MPERF MSRs.
> > > > > 
> > > > > Well, this isn't quite convincing.
> > > > > 
> > > > > In particular, I don't see why the "don't read APERF/MPERF
> > > > > MSRs" argument
> > > > > applies *only* to intel_pstate in the "active" mode.  What
> > > > > about
> > > > > intel_pstate in the "passive" mode combined with the
> > > > > "performance"
> > > > > governor?  Or any other governor different from "schedutil"
> > > > > for that
> > > > > matter?
> > > > > 
> > > > > And what about acpi_cpufreq combined with any governor
> > > > > different from
> > > > > "schedutil"?
> > > > > 
> > > > > Scale invariance is not really needed in all of those cases
> > > > > right now
> > > > > AFAICS, or is it?
> > > > 
> > > > Correct. This is just part of the patch to disable in active
> > > > mode
> > > > (particularly in HWP and performance mode).
> > > > 
> > > > But this patch is 2 years old. The folks who wanted this,
> > > > disable
> > > > intel-pstate and use userspace governor with acpi-cpufreq. So
> > > > may be
> > > > better to address those cases too.
> > > 
> > > I disagree with "scale invariance is needed only by the schedutil
> > > governor";
> > > the two other users are the CPU's estimated utilization in the
> > > wakeup path,
> > > via cpu_util_without(), as well as the load-balance path, via
> > > cpu_util() which
> > > is used by update_sg_lb_stats().
> > 
> > OK, so there are reasons to run the scale invariance code which are
> > not related to the cpufreq governor in use.
> > 
> > I wonder then why those reasons are not relevant for intel_pstate
> > in
> > the "active" mode.
> > 
> > > Also remember that scale invariance is applied to both PELT
> > > signals util_avg
> > > and load_avg; schedutil uses the former but not the latter.
> > > 
> > > I understand Srinivas patch to disable MSR accesses during the
> > > tick as a
> > > band-aid solution to address a specific use case he cares about,
> > > but I don't
> > > think that extending this approach to any non-schedutil governor
> > > is a good
> > > idea -- you'd be killing load balancing in the process.
> > 
> > But that is also the case for intel_pstate in the "active" mode,
> > isn't it?
> 
> Sure it is.
> 
> Now, what's the performance impact of loosing scale-invariance in
> PELT signals?
> And what's the performance impact of accessing two MSRs at the
> scheduler tick
> on each CPU?
> 
> I am sporting Srinivas' patch because he expressed the concern that
> the losses
> don't justify the gains for a specific class of users
> (supercomputing),
> although I don't fully like the idea (and arguably that should be
> measured).
> 
I understand there are other impact of the scale invariance like in
deadline code, which I didn't see when I submitted this patch.
You can drop this patch at this time if you like. I can poke HPC folks
to test a released kernel.

Thanks,
Srinivas
Giovanni Gherdovich Oct. 7, 2019, 8:33 a.m. UTC | #11
On Fri, 2019-10-04 at 08:17 -0700, Srinivas Pandruvada wrote:
> On Fri, 2019-10-04 at 10:57 +0200, Giovanni Gherdovich wrote:
> > On Fri, 2019-10-04 at 10:29 +0200, Rafael J. Wysocki wrote:
> > > On Fri, Oct 4, 2019 at 10:24 AM Giovanni Gherdovich <
> > > ggherdovich@suse.cz> wrote:
> > > > 
> > > > On Thu, 2019-10-03 at 20:31 -0700, Srinivas Pandruvada wrote:
> > > > > On Thu, 2019-10-03 at 20:05 +0200, Rafael J. Wysocki wrote:
> > > > > > On Wednesday, October 2, 2019 2:29:26 PM CEST Giovanni Gherdovich
> > > > > > wrote:
> > > > > > > From: Srinivas Pandruvada < srinivas.pandruvada@linux.intel.com>
> > > > > > > 
> > > > > > > intel_pstate has two operating modes: active and passive.  In
> > > > > > > "active" mode, the in-built scaling governor is used and in
> > > > > > > "passive" mode, the driver can be used with any governor like
> > > > > > > "schedutil". In "active" mode the utilization values from
> > > > > > > schedutil is not used and there is a requirement from high
> > > > > > > performance computing use cases, not to readas well any
> > > > > > > APERF/MPERF MSRs.
> > > > > > 
> > > > > > Well, this isn't quite convincing.
> > > > > > 
> > > > > > In particular, I don't see why the "don't read APERF/MPERF MSRs"
> > > > > > argument applies *only* to intel_pstate in the "active" mode.
> > > > > > What about intel_pstate in the "passive" mode combined with the
> > > > > > "performance" governor?  Or any other governor different from
> > > > > > "schedutil" for that matter?
> > > > > > 
> > > > > > And what about acpi_cpufreq combined with any governor different
> > > > > > from "schedutil"?
> > > > > > 
> > > > > > Scale invariance is not really needed in all of those cases right
> > > > > > now AFAICS, or is it?
> > > > > 
> > > > > Correct. This is just part of the patch to disable in active mode
> > > > > (particularly in HWP and performance mode).
> > > > > 
> > > > > But this patch is 2 years old. The folks who wanted this, disable
> > > > > intel-pstate and use userspace governor with acpi-cpufreq. So may be
> > > > > better to address those cases too.
> > > > 
> > > > I disagree with "scale invariance is needed only by the schedutil
> > > > governor"; the two other users are the CPU's estimated utilization in
> > > > the wakeup path, via cpu_util_without(), as well as the load-balance
> > > > path, via cpu_util() which is used by update_sg_lb_stats().
> > > 
> > > OK, so there are reasons to run the scale invariance code which are
> > > not related to the cpufreq governor in use.
> > > 
> > > I wonder then why those reasons are not relevant for intel_pstate in the
> > > "active" mode.
> > > 
> > > > Also remember that scale invariance is applied to both PELT signals
> > > > util_avg and load_avg; schedutil uses the former but not the latter.
> > > > 
> > > > I understand Srinivas patch to disable MSR accesses during the tick as
> > > > a band-aid solution to address a specific use case he cares about, but
> > > > I don't think that extending this approach to any non-schedutil
> > > > governor is a good idea -- you'd be killing load balancing in the
> > > > process.
> > > 
> > > But that is also the case for intel_pstate in the "active" mode,
> > > isn't it?
> > 
> > Sure it is.
> > 
> > Now, what's the performance impact of loosing scale-invariance in PELT
> > signals?  And what's the performance impact of accessing two MSRs at the
> > scheduler tick on each CPU?
> > 
> > I am sporting Srinivas' patch because he expressed the concern that the
> > losses don't justify the gains for a specific class of users
> > (supercomputing), although I don't fully like the idea (and arguably that
> > should be measured).
> > 
> 
> I understand there are other impact of the scale invariance like in
> deadline code, which I didn't see when I submitted this patch.
> You can drop this patch at this time if you like. I can poke HPC folks
> to test a released kernel.

Thanks Srinivas, in v3 I'll drop the tick_disable mechanism for now.


Giovanni

Patch
diff mbox series

diff --git a/drivers/cpufreq/intel_pstate.c b/drivers/cpufreq/intel_pstate.c
index 9f02de9a1b47..c7d9149e99ee 100644
--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -2493,6 +2493,8 @@  static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 {
 	int ret;
 
+	x86_arch_scale_freq_tick_disable();
+
 	memset(&global, 0, sizeof(global));
 	global.max_perf_pct = 100;
 
@@ -2505,6 +2507,9 @@  static int intel_pstate_register_driver(struct cpufreq_driver *driver)
 
 	global.min_perf_pct = min_perf_pct_min();
 
+	if (driver == &intel_cpufreq)
+		x86_arch_scale_freq_tick_enable();
+
 	return 0;
 }