All of lore.kernel.org
 help / color / mirror / Atom feed
From: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
To: Francisco Jerez <currojerez@riseup.net>,
	linux-pm@vger.kernel.org, intel-gfx@lists.freedesktop.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	Eero Tamminen <eero.t.tamminen@intel.com>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>
Subject: Re: [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver.
Date: Wed, 11 Apr 2018 23:17:05 -0700	[thread overview]
Message-ID: <1523513825.9016.1.camel@linux.intel.com> (raw)
In-Reply-To: <87muy97lr0.fsf@riseup.net>

On Wed, 2018-04-11 at 09:26 -0700, Francisco Jerez wrote:
> 
> "just like" here is possibly somewhat unfair to the schedutil
> governor,
> admittedly its progressive IOWAIT boosting behavior seems somewhat
> less
> wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
> behavior, but it's still largely unhelpful on IO-bound conditions.
> 

OK, if you think so, then improve it for sched-util governor or other
mechanisms (as Juri suggested) instead of intel-pstate. This will
benefit all architectures including x86 + non i915.

BTW intel-pstate can be driven by sched-util governor (passive mode),
so if your prove benefits to Broxton, this can be a default.
As before:
- No regression to idle power at all. This is more important than
benchmarks
- Not just score, performance/watt is important

Thanks,
Srinivas


> > controller does, even though the frequent IO waits may actually be
> > an
> > indication that the system is IO-bound (which means that the large
> > energy usage increase may not be translated in any performance
> > benefit
> > in practice, not to speak of performance being impacted negatively
> > in
> > TDP-bound scenarios like GPU rendering).
> > 
> > Regarding run-time complexity, I haven't observed this governor to
> > be
> > measurably more computationally intensive than the present
> > one.  It's a
> > bunch more instructions indeed, but still within the same ballpark
> > as
> > the current governor.  The average increase in CPU utilization on
> > my BXT
> > with this series is less than 0.03% (sampled via ftrace for v1, I
> > can
> > repeat the measurement for the v2 I have in the works, though I
> > don't
> > expect the result to be substantially different).  If this is a
> > problem
> > for you there are several optimization opportunities that would cut
> > down
> > the number of CPU cycles get_target_pstate_lp() takes to execute by
> > a
> > large percent (most of the optimization ideas I can think of right
> > now
> > though would come at some accuracy/maintainability/debuggability
> > cost,
> > but may still be worth pursuing), but the computational overhead is
> > low
> > enough at this point that the impact on any benchmark or real
> > workload
> > would be orders of magnitude lower than its variance, which makes
> > it
> > kind of difficult to keep the discussion data-driven [as possibly
> > any
> > performance optimization discussion should ever be ;)].
> > 
> > > 
> > > Thanks,
> > > Srinivas
> > > 
> > > 
> > > 
> > > > 
> > > > > [Absolute benchmark results are unfortunately omitted from
> > > > > this
> > > > > letter
> > > > > due to company policies, but the percent change and Student's
> > > > > T
> > > > > p-value are included above and in the referenced benchmark
> > > > > results]
> > > > > 
> > > > > The most obvious impact of this series will likely be the
> > > > > overall
> > > > > improvement in graphics performance on systems with an IGP
> > > > > integrated
> > > > > into the processor package (though for the moment this is
> > > > > only
> > > > > enabled
> > > > > on BXT+), because the TDP budget shared among CPU and GPU can
> > > > > frequently become a limiting factor in low-power devices.  On
> > > > > heavily
> > > > > TDP-bound devices this series improves performance of
> > > > > virtually any
> > > > > non-trivial graphics rendering by a significant amount (of
> > > > > the
> > > > > order
> > > > > of the energy efficiency improvement for that workload
> > > > > assuming the
> > > > > optimization didn't cause it to become non-TDP-bound).
> > > > > 
> > > > > See [1]-[5] for detailed numbers including various graphics
> > > > > benchmarks
> > > > > and a sample of the Phoronix daily-system-tracker.  Some
> > > > > popular
> > > > > graphics benchmarks like GfxBench gl_manhattan31 and gl_4
> > > > > improve
> > > > > between 5% and 11% on our systems.  The exact improvement can
> > > > > vary
> > > > > substantially between systems (compare the benchmark results
> > > > > from
> > > > > the
> > > > > two different J3455 systems [1] and [3]) due to a number of
> > > > > factors,
> > > > > including the ratio between CPU and GPU processing power, the
> > > > > behavior
> > > > > of the userspace graphics driver, the windowing system and
> > > > > resolution,
> > > > > the BIOS (which has an influence on the package TDP), the
> > > > > thermal
> > > > > characteristics of the system, etc.
> > > > > 
> > > > > Unigine Valley and Heaven improve by a similar factor on some
> > > > > systems
> > > > > (see the J3455 results [1]), but on others the improvement is
> > > > > lower
> > > > > because the benchmark fails to fully utilize the GPU, which
> > > > > causes
> > > > > the
> > > > > heuristic to remain in low-latency state for longer, which
> > > > > leaves a
> > > > > reduced TDP budget available to the GPU, which prevents
> > > > > performance
> > > > > from increasing further.  This can be avoided by using the
> > > > > alternative
> > > > > heuristic parameters suggested in the commit message of PATCH
> > > > > 8,
> > > > > which
> > > > > provide a lower IO utilization threshold and hysteresis for
> > > > > the
> > > > > controller to attempt to save energy.  I'm not proposing
> > > > > those for
> > > > > upstream (yet) because they would also increase the risk for
> > > > > latency-sensitive IO-heavy workloads to regress (like
> > > > > SynMark2
> > > > > OglTerrainFly* and some arguably poorly designed IPC-bound
> > > > > X11
> > > > > benchmarks).
> > > > > 
> > > > > Discrete graphics aren't likely to experience that much of a
> > > > > visible
> > > > > improvement from this, even though many non-IGP workloads
> > > > > *could*
> > > > > benefit by reducing the system's energy usage while the
> > > > > discrete
> > > > > GPU
> > > > > (or really, any other IO device) becomes a bottleneck, but
> > > > > this is
> > > > > not
> > > > > attempted in this series, since that would involve making an
> > > > > energy
> > > > > efficiency/latency trade-off that only the maintainers of the
> > > > > respective drivers are in a position to make.  The cpufreq
> > > > > interface
> > > > > introduced in PATCH 1 to achieve this is left as an opt-in
> > > > > for that
> > > > > reason, only the i915 DRM driver is hooked up since it will
> > > > > get the
> > > > > most direct pay-off due to the increased energy budget
> > > > > available to
> > > > > the GPU, but other power-hungry third-party gadgets built
> > > > > into the
> > > > > same package (*cough* AMD *cough* Mali *cough* PowerVR
> > > > > *cough*) may
> > > > > be
> > > > > able to benefit from this interface eventually by
> > > > > instrumenting the
> > > > > driver in a similar way.
> > > > > 
> > > > > The cpufreq interface is not exclusively tied to the
> > > > > intel_pstate
> > > > > driver, because other governors can make use of the statistic
> > > > > calculated as result to avoid over-optimizing for latency in
> > > > > scenarios
> > > > > where a lower frequency would be able to achieve similar
> > > > > throughput
> > > > > while using less energy.  The interpretation of this
> > > > > statistic
> > > > > relies
> > > > > on the observation that for as long as the system is CPU-
> > > > > bound, any
> > > > > IO
> > > > > load occurring as a result of the execution of a program will
> > > > > scale
> > > > > roughly linearly with the clock frequency the program is run
> > > > > at, so
> > > > > (assuming that the CPU has enough processing power) a point
> > > > > will be
> > > > > reached at which the program won't be able to execute faster
> > > > > with
> > > > > increasing CPU frequency because the throughput limits of
> > > > > some
> > > > > device
> > > > > will have been attained.  Increasing frequencies past that
> > > > > point
> > > > > only
> > > > > pessimizes energy usage for no real benefit -- The optimal
> > > > > behavior
> > > > > is
> > > > > for the CPU to lock to the minimum frequency that is able to
> > > > > keep
> > > > > the
> > > > > IO devices involved fully utilized (assuming we are past the
> > > > > maximum-efficiency inflection point of the CPU's power-to-
> > > > > frequency
> > > > > curve), which is roughly the goal of this series.
> > > > > 
> > > > > PELT could be a useful extension for this model since its
> > > > > largely
> > > > > heuristic assumptions would become more accurate if the IO
> > > > > and CPU
> > > > > load could be tracked separately for each scheduling entity,
> > > > > but
> > > > > this
> > > > > is not attempted in this series because the additional
> > > > > complexity
> > > > > and
> > > > > computational cost of such an approach is hard to justify at
> > > > > this
> > > > > stage, particularly since the current governor has similar
> > > > > limitations.
> > > > > 
> > > > > Various frequency and step-function response graphs are
> > > > > available
> > > > > in
> > > > > [6]-[9] for comparison (obtained empirically on a BXT J3455
> > > > > system).
> > > > > The response curves for the low-latency and low-power states
> > > > > of the
> > > > > heuristic are shown separately -- As you can see they roughly
> > > > > bracket
> > > > > the frequency response curve of the current governor.  The
> > > > > step
> > > > > response of the aggressive heuristic is within a single
> > > > > update
> > > > > period
> > > > > (even though it's not quite obvious from the graph with the
> > > > > levels
> > > > > of
> > > > > zoom provided).  I'll attach benchmark results from a slower
> > > > > but
> > > > > non-TDP-limited machine (which means there will be no TDP
> > > > > budget
> > > > > increase that could possibly mask a performance regression of
> > > > > other
> > > > > kind) as soon as they come out.
> > > > > 
> > > > > Thanks to Eero and Valtteri for testing a number of
> > > > > intermediate
> > > > > revisions of this series (and there were quite a few of them)
> > > > > in
> > > > > more
> > > > > than half a dozen systems, they helped spot quite a few
> > > > > issues of
> > > > > earlier versions of this heuristic.
> > > > > 
> > > > > [PATCH 1/9] cpufreq: Implement infrastructure keeping track
> > > > > of
> > > > > aggregated IO active time.
> > > > > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs
> > > > > with
> > > > > core_funcs"
> > > > > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple
> > > > > of long
> > > > > names"
> > > > > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
> > > > > intel_pstate_adjust_pstate()"
> > > > > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util
> > > > > from
> > > > > pstate_funcs"
> > > > > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
> > > > > filtering controller for small core.
> > > > > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP
> > > > > controller
> > > > > based on ACPI FADT profile.
> > > > > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP
> > > > > controller
> > > > > parameters via debugfs.
> > > > > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO
> > > > > activity
> > > > > to cpufreq.
> > > > > 
> > > > > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455.log
> > > > > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-per-watt-comparison-J3455.log
> > > > > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J3455-1.log
> > > > > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J4205.log
> > > > > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /bench
> > > > > mark-perf-comparison-J5005.log
> > > > > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-magnitude-comparison.svg
> > > > > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /frequ
> > > > > ency-response-phase-comparison.svg
> > > > > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-1.svg
> > > > > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp
> > > > > /step-
> > > > > response-comparison-2.svg
> > 
> > _______________________________________________
> > Intel-gfx mailing list
> > Intel-gfx@lists.freedesktop.org
> > https://lists.freedesktop.org/mailman/listinfo/intel-gfx
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  parent reply	other threads:[~2018-04-12  6:17 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-03-28  6:38 [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
2018-03-28  6:38 ` [PATCH 1/9] cpufreq: Implement infrastructure keeping track of aggregated IO active time Francisco Jerez
2018-03-28  6:38 ` [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with core_funcs" Francisco Jerez
2018-03-28  6:38 ` [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long names" Francisco Jerez
2018-03-28  6:38 ` [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify intel_pstate_adjust_pstate()" Francisco Jerez
2018-03-28  6:38 ` [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
2018-03-28  6:38 ` [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass filtering controller for small core Francisco Jerez
2018-03-28  6:38 ` [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller based on ACPI FADT profile Francisco Jerez
2018-03-28  6:38 ` [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller parameters via debugfs Francisco Jerez
2018-03-28  6:38 ` [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity to cpufreq Francisco Jerez
2018-03-28  8:02   ` Chris Wilson
2018-03-28 18:55     ` Francisco Jerez
2018-03-28 19:20       ` Chris Wilson
2018-03-28 23:19         ` Chris Wilson
2018-03-29  0:32           ` Francisco Jerez
2018-03-29  1:01             ` Chris Wilson
2018-03-29  1:20               ` Chris Wilson
2018-03-30 18:50 ` [PATCH 0/9] GPU-bound energy efficiency improvements for the intel_pstate driver Francisco Jerez
2018-04-10 22:28 ` Francisco Jerez
2018-04-11  3:14   ` Srinivas Pandruvada
2018-04-11 16:10     ` Francisco Jerez
2018-04-11 16:26       ` Francisco Jerez
2018-04-11 17:35         ` Juri Lelli
2018-04-12 21:38           ` Francisco Jerez
2018-04-12  6:17         ` Srinivas Pandruvada [this message]
2018-04-14  2:00           ` Francisco Jerez
2018-04-14  4:01             ` Srinivas Pandruvada
2018-04-16 14:04               ` Eero Tamminen
2018-04-16 17:27                 ` Srinivas Pandruvada
2018-04-12  8:58         ` Peter Zijlstra
2018-04-12 18:34           ` Francisco Jerez
2018-04-12 19:33             ` Peter Zijlstra
2018-04-12 19:55               ` Francisco Jerez
2018-04-13 18:15                 ` Peter Zijlstra
2018-04-14  1:57                   ` Francisco Jerez
2018-04-14  9:49                     ` Peter Zijlstra
2018-04-17 14:03 ` Chris Wilson
2018-04-17 15:34   ` Srinivas Pandruvada
2018-04-17 19:27   ` Francisco Jerez

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1523513825.9016.1.camel@linux.intel.com \
    --to=srinivas.pandruvada@linux.intel.com \
    --cc=currojerez@riseup.net \
    --cc=eero.t.tamminen@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rjw@rjwysocki.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.