Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.

From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
To: Francisco Jerez <currojerez@riseup.net>,
	linux-pm@vger.kernel.org, intel-gfx@lists.freedesktop.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	Eero Tamminen <eero.t.tamminen@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Date: Wed, 11 Apr 2018 23:17:05 -0700	[thread overview]
Message-ID: <1523513825.9016.1.camel@linux.intel.com> (raw)
In-Reply-To: <87muy97lr0.fsf@riseup.net>

On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
> 
> "just like" here is possibly somewhat unfair to the schedutil
> governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat
> less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.
> 

OK, if you think so, then improve it for sched-util governor or other
mechanisms (as Juri suggested) instead of intel-pstate. This will
benefit all architectures including x86 + non i915.

BTW intel-pstate can be driven by sched-util governor (passive mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important

Thanks,
Srinivas

> > controller does, even though the frequent IO waits may actually be
> > an
> > indication that the system is IO-bound (which means that the large
> > energy usage increase may not be translated in any performance
> > benefit
> > in practice, not to speak of performance being impacted negatively
> > in
> > TDP-bound scenarios like GPU rendering).
> > 
> > Regarding run-time complexity, I haven't observed this governor to
> > be
> > measurably more computationally intensive than the present
> > one.  It's a
> > bunch more instructions indeed, but still within the same ballpark
> > as
> > the current governor.  The average increase in CPU utilization on
> > my BXT
> > with this series is less than 0.03% (sampled via ftrace for v1, I
> > can
> > repeat the measurement for the v2 I have in the works, though I
> > don't
> > expect the result to be substantially different).  If this is a
> > problem
> > for you there are several optimization opportunities that would cut
> > down
> > the number of CPU cycles get_target_pstate_lp() takes to execute by
> > a
> > large percent (most of the optimization ideas I can think of right
> > now
> > though would come at some accuracy/maintainability/debuggability
> > cost,
> > but may still be worth pursuing), but the computational overhead is
> > low
> > enough at this point that the impact on any benchmark or real
> > workload
> > would be orders of magnitude lower than its variance, which makes
> > it
> > kind of difficult to keep the discussion data-driven [as possibly
> > any
> > performance optimization discussion should ever be ;)].
> > 
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > 
> > > > > [Absolute benchmark results are unfortunately omitted from
> > > > > this
> > > > > letter
> > > > > due to company policies, but the percent change and Student's
> > > > > T
> > > > > p-value are included above and in the referenced benchmark
> > > > > results]
> > > > > 
> > > > > The most obvious impact of this series will likely be the
> > > > > overall
> > > > > improvement in graphics performance on systems with an IGP
> > > > > integrated
> > > > > into the processor package (though for the moment this is
> > > > > only
> > > > > enabled
> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
> > > > > frequently become a limiting factor in low-power devices.  On
> > > > > heavily
> > > > > TDP-bound devices this series improves performance of
> > > > > virtually any
> > > > > non-trivial graphics rendering by a significant amount (of
> > > > > the
> > > > > order
> > > > > of the energy efficiency improvement for that workload
> > > > > assuming the
> > > > > optimization didn't cause it to become non-TDP-bound).
> > > > > 
> > > > > See [1]-[5] for detailed numbers including various graphics
> > > > > benchmarks
> > > > > and a sample of the Phoronix daily-system-tracker.  Some
> > > > > popular
> > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
> > > > > improve
> > > > > between 5% and 11% on our systems.  The exact improvement can
> > > > > vary
> > > > > substantially between systems (compare the benchmark results
> > > > > from
> > > > > the
> > > > > two different J3455 systems [1] and [3]) due to a number of
> > > > > factors,
> > > > > including the ratio between CPU and GPU processing power, the
> > > > > behavior
> > > > > of the userspace graphics driver, the windowing system and
> > > > > resolution,
> > > > > the BIOS (which has an influence on the package TDP), the
> > > > > thermal
> > > > > characteristics of the system, etc.
> > > > > 
> > > > > Unigine Valley and Heaven improve by a similar factor on some
> > > > > systems
> > > > > (see the J3455 results [1]), but on others the improvement is
> > > > > lower
> > > > > because the benchmark fails to fully utilize the GPU, which
> > > > > causes
> > > > > the
> > > > > heuristic to remain in low-latency state for longer, which
> > > > > leaves a
> > > > > reduced TDP budget available to the GPU, which prevents
> > > > > performance
> > > > > from increasing further.  This can be avoided by using the
> > > > > alternative
> > > > > heuristic parameters suggested in the commit message of PATCH
> > > > > 8,
> > > > > which
> > > > > provide a lower IO utilization threshold and hysteresis for
> > > > > the
> > > > > controller to attempt to save energy.  I'm not proposing
> > > > > those for
> > > > > upstream (yet) because they would also increase the risk for
> > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > SynMark2
> > > > > OglTerrainFly* and some arguably poorly designed IPC-bound
> > > > > X11
> > > > > benchmarks).
> > > > > 
> > > > > Discrete graphics aren't likely to experience that much of a
> > > > > visible
> > > > > improvement from this, even though many non-IGP workloads
> > > > > *could*
> > > > > benefit by reducing the system's energy usage while the
> > > > > discrete
> > > > > GPU
> > > > > (or really, any other IO device) becomes a bottleneck, but
> > > > > this is
> > > > > not
> > > > > attempted in this series, since that would involve making an
> > > > > energy
> > > > > efficiency/latency trade-off that only the maintainers of the
> > > > > respective drivers are in a position to make.  The cpufreq
> > > > > interface
> > > > > introduced in PATCH 1 to achieve this is left as an opt-in
> > > > > for that
> > > > > reason, only the i915 DRM driver is hooked up since it will
> > > > > get the
> > > > > most direct pay-off due to the increased energy budget
> > > > > available to
> > > > > the GPU, but other power-hungry third-party gadgets built
> > > > > into the
> > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR
> > > > > *cough*) may
> > > > > be
> > > > > able to benefit from this interface eventually by
> > > > > instrumenting the
> > > > > driver in a similar way.
> > > > > 
> > > > > The cpufreq interface is not exclusively tied to the
> > > > > intel_pstate
> > > > > driver, because other governors can make use of the statistic
> > > > > calculated as result to avoid over-optimizing for latency in
> > > > > scenarios
> > > > > where a lower frequency would be able to achieve similar
> > > > > throughput
> > > > > while using less energy.  The interpretation of this
> > > > > statistic
> > > > > relies
> > > > > on the observation that for as long as the system is CPU-
> > > > > bound, any
> > > > > IO
> > > > > load occurring as a result of the execution of a program will
> > > > > scale
> > > > > roughly linearly with the clock frequency the program is run
> > > > > at, so
> > > > > (assuming that the CPU has enough processing power) a point
> > > > > will be
> > > > > reached at which the program won't be able to execute faster
> > > > > with
> > > > > increasing CPU frequency because the throughput limits of
> > > > > some
> > > > > device
> > > > > will have been attained.  Increasing frequencies past that
> > > > > point
> > > > > only
> > > > > pessimizes energy usage for no real benefit -- The optimal
> > > > > behavior
> > > > > is
> > > > > for the CPU to lock to the minimum frequency that is able to
> > > > > keep
> > > > > the
> > > > > IO devices involved fully utilized (assuming we are past the
> > > > > maximum-efficiency inflection point of the CPU's power-to-
> > > > > frequency
> > > > > curve), which is roughly the goal of this series.
> > > > > 
> > > > > PELT could be a useful extension for this model since its
> > > > > largely
> > > > > heuristic assumptions would become more accurate if the IO
> > > > > and CPU
> > > > > load could be tracked separately for each scheduling entity,
> > > > > but
> > > > > this
> > > > > is not attempted in this series because the additional
> > > > > complexity
> > > > > and
> > > > > computational cost of such an approach is hard to justify at
> > > > > this
> > > > > stage, particularly since the current governor has similar
> > > > > limitations.
> > > > > 
> > > > > Various frequency and step-function response graphs are
> > > > > available
> > > > > in
> > > > > [6]-[9] for comparison (obtained empirically on a BXT J3455
> > > > > system).
> > > > > The response curves for the low-latency and low-power states
> > > > > of the
> > > > > heuristic are shown separately -- As you can see they roughly
> > > > > bracket
> > > > > the frequency response curve of the current governor.  The
> > > > > step
> > > > > response of the aggressive heuristic is within a single
> > > > > update
> > > > > period
> > > > > (even though it's not quite obvious from the graph with the
> > > > > levels
> > > > > of
> > > > > zoom provided).  I'll attach benchmark results from a slower
> > > > > but
> > > > > non-TDP-limited machine (which means there will be no TDP
> > > > > budget
> > > > > increase that could possibly mask a performance regression of
> > > > > other
> > > > > kind) as soon as they come out.
> > > > > 
> > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > intermediate
> > > > > revisions of this series (and there were quite a few of them)
> > > > > in
> > > > > more
> > > > > than half a dozen systems, they helped spot quite a few
> > > > > issues of
> > > > > earlier versions of this heuristic.
> > > > > 
> > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping track
> > > > > of
> > > > > aggregated IO active time.
> > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs
> > > > > with
> > > > > core_funcs"
> > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple
> > > > > of long
> > > > > names"
> > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > intel_pstate_adjust_pstate()"
> > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util
> > > > > from
> > > > > pstate_funcs"
> > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
> > > > > filtering controller for small core.
> > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > controller
> > > > > based on ACPI FADT profile.
> > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > controller
> > > > > parameters via debugfs.
> > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO
> > > > > activity
> > > > > to cpufreq.
> > > > > 
> > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455.log
> > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455-1.log
> > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J4205.log
> > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J5005.log
> > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-magnitude-comparison.svg
> > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-phase-comparison.svg
> > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-1.svg
> > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-2.svg
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx