Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
To: Eero Tamminen <eero.t.tamminen@intel.com>,
	Francisco Jerez <currojerez@riseup.net>,
	linux-pm@vger.kernel.org, intel-gfx@lists.freedesktop.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Date: Mon, 16 Apr 2018 10:27:50 -0700	[thread overview]
Message-ID: <1523899670.5918.2.camel@linux.intel.com> (raw)
In-Reply-To: <02b7ceeb-f9d2-04fe-d9b5-d6749c5411fe@intel.com>

On Mon, 2018-04-16 at 17:04 +0300, Eero Tamminen wrote:
> Hi,
> 
> On 14.04.2018 07:01, Srinivas Pandruvada wrote:
> > Hi Francisco,
> > 
> > [...]
> > 
> > > Are you no longer interested in improving those aspects of the
> > > non-
> > > HWP
> > > governor?  Is it that you're planning to delete it and move back
> > > to a
> > > generic cpufreq governor for non-HWP platforms in the near
> > > future?
> > 
> > Yes that is the plan for Atom platforms, which are only non HWP
> > platforms till now. You have to show good gain for performance and
> > performance/watt to carry and maintain such big change. So we have
> > to
> > see your performance and power numbers.
> 
> For the active cases, you can look at the links at the beginning / 
> bottom of this mail thread.  Francisco provided performance results
> for 
>  >100 benchmarks.
Looks like you didn't test the idle cases, which are more important.
Systems will tend to be more idle (increased +50% by the patches). Once
you fix the idle, you have to retest and then results will be
interesting.

Once you fix this, then it is pure algorithm, whether it is done in
intel-pstate or sched-util governor is not a big different. It is
better to do in sched-util as this will benefit all architectures and
will get better test coverage and maintained.

Thanks,
Srinivas

> 
> 
> At this side of Atlantic, we've been testing different versions of
> the 
> patchset in past few months for >50 Linux 3D benchmarks on 6
> different 
> platforms.
> 
> On Geminilake and few BXT configurations (where 3D benchmarks are
> TDP 
> limited), many tests' performance improves by 5-15%, also complex
> ones. 
> And more importantly, there were no regressions.
> 
> (You can see details + links to more info in Jira ticket VIZ-12078.)
> 
> *On (fully) TDP limited cases, power usage (obviously) keeps the
> same, 
> so performance/watt improvements can be derived from the measured 
> performance improvements.*
> 
> 
> We have data also for earlier platforms from slightly older versions
> of 
> the patchset, but on those it didn't have any significant impact on 
> performance.
> 
> I think the main reason for this is that BYT & BSW NUCs that we
> have, 
> have space only for single memory module.  Without dual-memory
> channel 
> configuration, benchmarks are too memory-bottlenecked to utilized
> GPU 
> enough to make things TDP limited on those platforms.
> 
> However, now that I look at the old BYT & BSW data (for few
> benchmarks 
> which improved most on BXT & GLK), I see that there's a reduction in
> the 
> CPU power utilization according to RAPL, at least on BSW.
> 
> 
> 	- Eero
> 
> 
> > > > This will benefit all architectures including x86 + non i915.
> > > > 
> > > 
> > > The current design encourages re-use of the IO utilization
> > > statistic
> > > (see PATCH 1) by other governors as a mechanism driving the
> > > trade-off
> > > between energy efficiency and responsiveness based on whether the
> > > system
> > > is close to CPU-bound, in whatever way is applicable to each
> > > governor
> > > (e.g. it would make sense for it to be hooked up to the EPP
> > > preference
> > > knob in the case of the intel_pstate HWP governor, which would
> > > allow
> > > it
> > > to achieve better energy efficiency in IO-bound situations just
> > > like
> > > this series does for non-HWP parts).  There's nothing really x86-
> > > nor
> > > i915-specific about it.
> > > 
> > > > BTW intel-pstate can be driven by sched-util governor (passive
> > > > mode),
> > > > so if your prove benefits to Broxton, this can be a default.
> > > > As before:
> > > > - No regression to idle power at all. This is more important
> > > > than
> > > > benchmarks
> > > > - Not just score, performance/watt is important
> > > > 
> > > 
> > > Is schedutil actually on par with the intel_pstate non-HWP
> > > governor
> > > as
> > > of today, according to these metrics and the overall benchmark
> > > numbers?
> > 
> > Yes, except for few cases. I have not tested recently, so may be
> > better.
> > 
> > Thanks,
> > Srinivas
> > 
> > 
> > > > Thanks,
> > > > Srinivas
> > > > 
> > > > 
> > > > > > controller does, even though the frequent IO waits may
> > > > > > actually
> > > > > > be
> > > > > > an
> > > > > > indication that the system is IO-bound (which means that
> > > > > > the
> > > > > > large
> > > > > > energy usage increase may not be translated in any
> > > > > > performance
> > > > > > benefit
> > > > > > in practice, not to speak of performance being impacted
> > > > > > negatively
> > > > > > in
> > > > > > TDP-bound scenarios like GPU rendering).
> > > > > > 
> > > > > > Regarding run-time complexity, I haven't observed this
> > > > > > governor
> > > > > > to
> > > > > > be
> > > > > > measurably more computationally intensive than the present
> > > > > > one.  It's a
> > > > > > bunch more instructions indeed, but still within the same
> > > > > > ballpark
> > > > > > as
> > > > > > the current governor.  The average increase in CPU
> > > > > > utilization
> > > > > > on
> > > > > > my BXT
> > > > > > with this series is less than 0.03% (sampled via ftrace for
> > > > > > v1,
> > > > > > I
> > > > > > can
> > > > > > repeat the measurement for the v2 I have in the works,
> > > > > > though I
> > > > > > don't
> > > > > > expect the result to be substantially different).  If this
> > > > > > is a
> > > > > > problem
> > > > > > for you there are several optimization opportunities that
> > > > > > would
> > > > > > cut
> > > > > > down
> > > > > > the number of CPU cycles get_target_pstate_lp() takes to
> > > > > > execute by
> > > > > > a
> > > > > > large percent (most of the optimization ideas I can think
> > > > > > of
> > > > > > right
> > > > > > now
> > > > > > though would come at some
> > > > > > accuracy/maintainability/debuggability
> > > > > > cost,
> > > > > > but may still be worth pursuing), but the computational
> > > > > > overhead is
> > > > > > low
> > > > > > enough at this point that the impact on any benchmark or
> > > > > > real
> > > > > > workload
> > > > > > would be orders of magnitude lower than its variance, which
> > > > > > makes
> > > > > > it
> > > > > > kind of difficult to keep the discussion data-driven [as
> > > > > > possibly
> > > > > > any
> > > > > > performance optimization discussion should ever be ;)].
> > > > > > 
> > > > > > > 
> > > > > > > Thanks,
> > > > > > > Srinivas
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > > 
> > > > > > > > > [Absolute benchmark results are unfortunately omitted
> > > > > > > > > from
> > > > > > > > > this
> > > > > > > > > letter
> > > > > > > > > due to company policies, but the percent change and
> > > > > > > > > Student's
> > > > > > > > > T
> > > > > > > > > p-value are included above and in the referenced
> > > > > > > > > benchmark
> > > > > > > > > results]
> > > > > > > > > 
> > > > > > > > > The most obvious impact of this series will likely be
> > > > > > > > > the
> > > > > > > > > overall
> > > > > > > > > improvement in graphics performance on systems with
> > > > > > > > > an
> > > > > > > > > IGP
> > > > > > > > > integrated
> > > > > > > > > into the processor package (though for the moment
> > > > > > > > > this is
> > > > > > > > > only
> > > > > > > > > enabled
> > > > > > > > > on BXT+), because the TDP budget shared among CPU and
> > > > > > > > > GPU
> > > > > > > > > can
> > > > > > > > > frequently become a limiting factor in low-power
> > > > > > > > > devices.  On
> > > > > > > > > heavily
> > > > > > > > > TDP-bound devices this series improves performance of
> > > > > > > > > virtually any
> > > > > > > > > non-trivial graphics rendering by a significant
> > > > > > > > > amount
> > > > > > > > > (of
> > > > > > > > > the
> > > > > > > > > order
> > > > > > > > > of the energy efficiency improvement for that
> > > > > > > > > workload
> > > > > > > > > assuming the
> > > > > > > > > optimization didn't cause it to become non-TDP-
> > > > > > > > > bound).
> > > > > > > > > 
> > > > > > > > > See [1]-[5] for detailed numbers including various
> > > > > > > > > graphics
> > > > > > > > > benchmarks
> > > > > > > > > and a sample of the Phoronix daily-system-
> > > > > > > > > tracker.  Some
> > > > > > > > > popular
> > > > > > > > > graphics benchmarks like GfxBench gl_manhattan31 and
> > > > > > > > > gl_4
> > > > > > > > > improve
> > > > > > > > > between 5% and 11% on our systems.  The exact
> > > > > > > > > improvement
> > > > > > > > > can
> > > > > > > > > vary
> > > > > > > > > substantially between systems (compare the benchmark
> > > > > > > > > results
> > > > > > > > > from
> > > > > > > > > the
> > > > > > > > > two different J3455 systems [1] and [3]) due to a
> > > > > > > > > number
> > > > > > > > > of
> > > > > > > > > factors,
> > > > > > > > > including the ratio between CPU and GPU processing
> > > > > > > > > power,
> > > > > > > > > the
> > > > > > > > > behavior
> > > > > > > > > of the userspace graphics driver, the windowing
> > > > > > > > > system
> > > > > > > > > and
> > > > > > > > > resolution,
> > > > > > > > > the BIOS (which has an influence on the package TDP),
> > > > > > > > > the
> > > > > > > > > thermal
> > > > > > > > > characteristics of the system, etc.
> > > > > > > > > 
> > > > > > > > > Unigine Valley and Heaven improve by a similar factor
> > > > > > > > > on
> > > > > > > > > some
> > > > > > > > > systems
> > > > > > > > > (see the J3455 results [1]), but on others the
> > > > > > > > > improvement is
> > > > > > > > > lower
> > > > > > > > > because the benchmark fails to fully utilize the GPU,
> > > > > > > > > which
> > > > > > > > > causes
> > > > > > > > > the
> > > > > > > > > heuristic to remain in low-latency state for longer,
> > > > > > > > > which
> > > > > > > > > leaves a
> > > > > > > > > reduced TDP budget available to the GPU, which
> > > > > > > > > prevents
> > > > > > > > > performance
> > > > > > > > > from increasing further.  This can be avoided by
> > > > > > > > > using
> > > > > > > > > the
> > > > > > > > > alternative
> > > > > > > > > heuristic parameters suggested in the commit message
> > > > > > > > > of
> > > > > > > > > PATCH
> > > > > > > > > 8,
> > > > > > > > > which
> > > > > > > > > provide a lower IO utilization threshold and
> > > > > > > > > hysteresis
> > > > > > > > > for
> > > > > > > > > the
> > > > > > > > > controller to attempt to save energy.  I'm not
> > > > > > > > > proposing
> > > > > > > > > those for
> > > > > > > > > upstream (yet) because they would also increase the
> > > > > > > > > risk
> > > > > > > > > for
> > > > > > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > > > > > SynMark2
> > > > > > > > > OglTerrainFly* and some arguably poorly designed IPC-
> > > > > > > > > bound
> > > > > > > > > X11
> > > > > > > > > benchmarks).
> > > > > > > > > 
> > > > > > > > > Discrete graphics aren't likely to experience that
> > > > > > > > > much
> > > > > > > > > of a
> > > > > > > > > visible
> > > > > > > > > improvement from this, even though many non-IGP
> > > > > > > > > workloads
> > > > > > > > > *could*
> > > > > > > > > benefit by reducing the system's energy usage while
> > > > > > > > > the
> > > > > > > > > discrete
> > > > > > > > > GPU
> > > > > > > > > (or really, any other IO device) becomes a
> > > > > > > > > bottleneck,
> > > > > > > > > but
> > > > > > > > > this is
> > > > > > > > > not
> > > > > > > > > attempted in this series, since that would involve
> > > > > > > > > making
> > > > > > > > > an
> > > > > > > > > energy
> > > > > > > > > efficiency/latency trade-off that only the
> > > > > > > > > maintainers of
> > > > > > > > > the
> > > > > > > > > respective drivers are in a position to make.  The
> > > > > > > > > cpufreq
> > > > > > > > > interface
> > > > > > > > > introduced in PATCH 1 to achieve this is left as an
> > > > > > > > > opt-
> > > > > > > > > in
> > > > > > > > > for that
> > > > > > > > > reason, only the i915 DRM driver is hooked up since
> > > > > > > > > it
> > > > > > > > > will
> > > > > > > > > get the
> > > > > > > > > most direct pay-off due to the increased energy
> > > > > > > > > budget
> > > > > > > > > available to
> > > > > > > > > the GPU, but other power-hungry third-party gadgets
> > > > > > > > > built
> > > > > > > > > into the
> > > > > > > > > same package (*cough* AMD *cough* Mali *cough*
> > > > > > > > > PowerVR
> > > > > > > > > *cough*) may
> > > > > > > > > be
> > > > > > > > > able to benefit from this interface eventually by
> > > > > > > > > instrumenting the
> > > > > > > > > driver in a similar way.
> > > > > > > > > 
> > > > > > > > > The cpufreq interface is not exclusively tied to the
> > > > > > > > > intel_pstate
> > > > > > > > > driver, because other governors can make use of the
> > > > > > > > > statistic
> > > > > > > > > calculated as result to avoid over-optimizing for
> > > > > > > > > latency
> > > > > > > > > in
> > > > > > > > > scenarios
> > > > > > > > > where a lower frequency would be able to achieve
> > > > > > > > > similar
> > > > > > > > > throughput
> > > > > > > > > while using less energy.  The interpretation of this
> > > > > > > > > statistic
> > > > > > > > > relies
> > > > > > > > > on the observation that for as long as the system is
> > > > > > > > > CPU-
> > > > > > > > > bound, any
> > > > > > > > > IO
> > > > > > > > > load occurring as a result of the execution of a
> > > > > > > > > program
> > > > > > > > > will
> > > > > > > > > scale
> > > > > > > > > roughly linearly with the clock frequency the program
> > > > > > > > > is
> > > > > > > > > run
> > > > > > > > > at, so
> > > > > > > > > (assuming that the CPU has enough processing power) a
> > > > > > > > > point
> > > > > > > > > will be
> > > > > > > > > reached at which the program won't be able to execute
> > > > > > > > > faster
> > > > > > > > > with
> > > > > > > > > increasing CPU frequency because the throughput
> > > > > > > > > limits of
> > > > > > > > > some
> > > > > > > > > device
> > > > > > > > > will have been attained.  Increasing frequencies past
> > > > > > > > > that
> > > > > > > > > point
> > > > > > > > > only
> > > > > > > > > pessimizes energy usage for no real benefit -- The
> > > > > > > > > optimal
> > > > > > > > > behavior
> > > > > > > > > is
> > > > > > > > > for the CPU to lock to the minimum frequency that is
> > > > > > > > > able
> > > > > > > > > to
> > > > > > > > > keep
> > > > > > > > > the
> > > > > > > > > IO devices involved fully utilized (assuming we are
> > > > > > > > > past
> > > > > > > > > the
> > > > > > > > > maximum-efficiency inflection point of the CPU's
> > > > > > > > > power-
> > > > > > > > > to-
> > > > > > > > > frequency
> > > > > > > > > curve), which is roughly the goal of this series.
> > > > > > > > > 
> > > > > > > > > PELT could be a useful extension for this model since
> > > > > > > > > its
> > > > > > > > > largely
> > > > > > > > > heuristic assumptions would become more accurate if
> > > > > > > > > the
> > > > > > > > > IO
> > > > > > > > > and CPU
> > > > > > > > > load could be tracked separately for each scheduling
> > > > > > > > > entity,
> > > > > > > > > but
> > > > > > > > > this
> > > > > > > > > is not attempted in this series because the
> > > > > > > > > additional
> > > > > > > > > complexity
> > > > > > > > > and
> > > > > > > > > computational cost of such an approach is hard to
> > > > > > > > > justify
> > > > > > > > > at
> > > > > > > > > this
> > > > > > > > > stage, particularly since the current governor has
> > > > > > > > > similar
> > > > > > > > > limitations.
> > > > > > > > > 
> > > > > > > > > Various frequency and step-function response graphs
> > > > > > > > > are
> > > > > > > > > available
> > > > > > > > > in
> > > > > > > > > [6]-[9] for comparison (obtained empirically on a BXT
> > > > > > > > > J3455
> > > > > > > > > system).
> > > > > > > > > The response curves for the low-latency and low-power
> > > > > > > > > states
> > > > > > > > > of the
> > > > > > > > > heuristic are shown separately -- As you can see they
> > > > > > > > > roughly
> > > > > > > > > bracket
> > > > > > > > > the frequency response curve of the current
> > > > > > > > > governor.  The
> > > > > > > > > step
> > > > > > > > > response of the aggressive heuristic is within a
> > > > > > > > > single
> > > > > > > > > update
> > > > > > > > > period
> > > > > > > > > (even though it's not quite obvious from the graph
> > > > > > > > > with
> > > > > > > > > the
> > > > > > > > > levels
> > > > > > > > > of
> > > > > > > > > zoom provided).  I'll attach benchmark results from a
> > > > > > > > > slower
> > > > > > > > > but
> > > > > > > > > non-TDP-limited machine (which means there will be no
> > > > > > > > > TDP
> > > > > > > > > budget
> > > > > > > > > increase that could possibly mask a performance
> > > > > > > > > regression of
> > > > > > > > > other
> > > > > > > > > kind) as soon as they come out.
> > > > > > > > > 
> > > > > > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > > > > > intermediate
> > > > > > > > > revisions of this series (and there were quite a few
> > > > > > > > > of
> > > > > > > > > them)
> > > > > > > > > in
> > > > > > > > > more
> > > > > > > > > than half a dozen systems, they helped spot quite a
> > > > > > > > > few
> > > > > > > > > issues of
> > > > > > > > > earlier versions of this heuristic.
> > > > > > > > > 
> > > > > > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping
> > > > > > > > > track
> > > > > > > > > of
> > > > > > > > > aggregated IO active time.
> > > > > > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace
> > > > > > > > > bxt_funcs
> > > > > > > > > with
> > > > > > > > > core_funcs"
> > > > > > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a
> > > > > > > > > couple
> > > > > > > > > of long
> > > > > > > > > names"
> > > > > > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > > > > > intel_pstate_adjust_pstate()"
> > > > > > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop
> > > > > > > > > ->update_util
> > > > > > > > > from
> > > > > > > > > pstate_funcs"
> > > > > > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably
> > > > > > > > > low-
> > > > > > > > > pass
> > > > > > > > > filtering controller for small core.
> > > > > > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > > > > > controller
> > > > > > > > > based on ACPI FADT profile.
> > > > > > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > > > > > controller
> > > > > > > > > parameters via debugfs.
> > > > > > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering
> > > > > > > > > as
> > > > > > > > > IO
> > > > > > > > > activity
> > > > > > > > > to cpufreq.
> > > > > > > > > 
> > > > > > > > > [1] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J3455.log
> > > > > > > > > [2] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > > > > > [3] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J3455-1.log
> > > > > > > > > [4] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J4205.log
> > > > > > > > > [5] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /bench
> > > > > > > > > mark-perf-comparison-J5005.log
> > > > > > > > > [6] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /frequ
> > > > > > > > > ency-response-magnitude-comparison.svg
> > > > > > > > > [7] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /frequ
> > > > > > > > > ency-response-phase-comparison.svg
> > > > > > > > > [8] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /step-
> > > > > > > > > response-comparison-1.svg
> > > > > > > > > [9] http://people.freedesktop.org/~currojerez/intel_p
> > > > > > > > > stat
> > > > > > > > > e-lp
> > > > > > > > > /step-
> > > > > > > > > response-comparison-2.svg
> > > > > > 
> > > > > > _______________________________________________
> > > > > > Intel-gfx mailing list
> > > > > > Intel-gfx@lists.freedesktop.org
> > > > > > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
> 
> 
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx