Francisco Jerez <currojerez@riseup.net> writes:

> Hi Srinivas,
>
> Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com> writes:
>
>> On Tue, 2018-04-10 at 15:28 -0700, Francisco Jerez wrote:
>>> Francisco Jerez <currojerez@riseup.net> writes:
>>> 
>> [...]
>>
>>
>>> For the case anyone is wondering what's going on, Srinivas pointed me
>>> at
>>> a larger idle power usage increase off-list, ultimately caused by the
>>> low-latency heuristic as discussed in the paragraph above.  I have a
>>> v2
>>> of PATCH 6 that gives the controller a third response curve roughly
>>> intermediate between the low-latency and low-power states of this
>>> revision, which avoids the energy usage increase while C0 residency
>>> is
>>> low (e.g. during idle) expected for v1.  The low-latency behavior of
>>> this revision is still going to be available based on a heuristic (in
>>> particular when a realtime-priority task is scheduled).  We're
>>> carrying
>>> out some additional testing, I'll post the code here eventually.
>>
>> Please try sched-util governor also. There is a frequency-invariant
>> patch, which I can send you (This eventually will be pushed by Peter).
>> We want to avoid complexity to intel-pstate for non HWP power sensitive
>> platforms as far as possible.
>>
>
> Unfortunately the schedutil governor (whether frequency invariant or
> not) has the exact same energy efficiency issues as the present
> intel_pstate non-HWP governor.  Its response is severely underdamped
> leading to energy-inefficient behavior for any oscillating non-CPU-bound
> workload.  To exacerbate that problem the frequency is maxed out on
> frequent IO waiting just like the current intel_pstate cpu-load

"just like" here is possibly somewhat unfair to the schedutil governor,
admittedly its progressive IOWAIT boosting behavior seems somewhat less
wasteful than the intel_pstate non-HWP governor's IOWAIT boosting
behavior, but it's still largely unhelpful on IO-bound conditions.

> controller does, even though the frequent IO waits may actually be an
> indication that the system is IO-bound (which means that the large
> energy usage increase may not be translated in any performance benefit
> in practice, not to speak of performance being impacted negatively in
> TDP-bound scenarios like GPU rendering).
>
> Regarding run-time complexity, I haven't observed this governor to be
> measurably more computationally intensive than the present one.  It's a
> bunch more instructions indeed, but still within the same ballpark as
> the current governor.  The average increase in CPU utilization on my BXT
> with this series is less than 0.03% (sampled via ftrace for v1, I can
> repeat the measurement for the v2 I have in the works, though I don't
> expect the result to be substantially different).  If this is a problem
> for you there are several optimization opportunities that would cut down
> the number of CPU cycles get_target_pstate_lp() takes to execute by a
> large percent (most of the optimization ideas I can think of right now
> though would come at some accuracy/maintainability/debuggability cost,
> but may still be worth pursuing), but the computational overhead is low
> enough at this point that the impact on any benchmark or real workload
> would be orders of magnitude lower than its variance, which makes it
> kind of difficult to keep the discussion data-driven [as possibly any
> performance optimization discussion should ever be ;)].
>
>>
>> Thanks,
>> Srinivas
>>
>>
>>
>>> 
>>> > [Absolute benchmark results are unfortunately omitted from this
>>> > letter
>>> > due to company policies, but the percent change and Student's T
>>> > p-value are included above and in the referenced benchmark results]
>>> > 
>>> > The most obvious impact of this series will likely be the overall
>>> > improvement in graphics performance on systems with an IGP
>>> > integrated
>>> > into the processor package (though for the moment this is only
>>> > enabled
>>> > on BXT+), because the TDP budget shared among CPU and GPU can
>>> > frequently become a limiting factor in low-power devices.  On
>>> > heavily
>>> > TDP-bound devices this series improves performance of virtually any
>>> > non-trivial graphics rendering by a significant amount (of the
>>> > order
>>> > of the energy efficiency improvement for that workload assuming the
>>> > optimization didn't cause it to become non-TDP-bound).
>>> > 
>>> > See [1]-[5] for detailed numbers including various graphics
>>> > benchmarks
>>> > and a sample of the Phoronix daily-system-tracker.  Some popular
>>> > graphics benchmarks like GfxBench gl_manhattan31 and gl_4 improve
>>> > between 5% and 11% on our systems.  The exact improvement can vary
>>> > substantially between systems (compare the benchmark results from
>>> > the
>>> > two different J3455 systems [1] and [3]) due to a number of
>>> > factors,
>>> > including the ratio between CPU and GPU processing power, the
>>> > behavior
>>> > of the userspace graphics driver, the windowing system and
>>> > resolution,
>>> > the BIOS (which has an influence on the package TDP), the thermal
>>> > characteristics of the system, etc.
>>> > 
>>> > Unigine Valley and Heaven improve by a similar factor on some
>>> > systems
>>> > (see the J3455 results [1]), but on others the improvement is lower
>>> > because the benchmark fails to fully utilize the GPU, which causes
>>> > the
>>> > heuristic to remain in low-latency state for longer, which leaves a
>>> > reduced TDP budget available to the GPU, which prevents performance
>>> > from increasing further.  This can be avoided by using the
>>> > alternative
>>> > heuristic parameters suggested in the commit message of PATCH 8,
>>> > which
>>> > provide a lower IO utilization threshold and hysteresis for the
>>> > controller to attempt to save energy.  I'm not proposing those for
>>> > upstream (yet) because they would also increase the risk for
>>> > latency-sensitive IO-heavy workloads to regress (like SynMark2
>>> > OglTerrainFly* and some arguably poorly designed IPC-bound X11
>>> > benchmarks).
>>> > 
>>> > Discrete graphics aren't likely to experience that much of a
>>> > visible
>>> > improvement from this, even though many non-IGP workloads *could*
>>> > benefit by reducing the system's energy usage while the discrete
>>> > GPU
>>> > (or really, any other IO device) becomes a bottleneck, but this is
>>> > not
>>> > attempted in this series, since that would involve making an energy
>>> > efficiency/latency trade-off that only the maintainers of the
>>> > respective drivers are in a position to make.  The cpufreq
>>> > interface
>>> > introduced in PATCH 1 to achieve this is left as an opt-in for that
>>> > reason, only the i915 DRM driver is hooked up since it will get the
>>> > most direct pay-off due to the increased energy budget available to
>>> > the GPU, but other power-hungry third-party gadgets built into the
>>> > same package (*cough* AMD *cough* Mali *cough* PowerVR *cough*) may
>>> > be
>>> > able to benefit from this interface eventually by instrumenting the
>>> > driver in a similar way.
>>> > 
>>> > The cpufreq interface is not exclusively tied to the intel_pstate
>>> > driver, because other governors can make use of the statistic
>>> > calculated as result to avoid over-optimizing for latency in
>>> > scenarios
>>> > where a lower frequency would be able to achieve similar throughput
>>> > while using less energy.  The interpretation of this statistic
>>> > relies
>>> > on the observation that for as long as the system is CPU-bound, any
>>> > IO
>>> > load occurring as a result of the execution of a program will scale
>>> > roughly linearly with the clock frequency the program is run at, so
>>> > (assuming that the CPU has enough processing power) a point will be
>>> > reached at which the program won't be able to execute faster with
>>> > increasing CPU frequency because the throughput limits of some
>>> > device
>>> > will have been attained.  Increasing frequencies past that point
>>> > only
>>> > pessimizes energy usage for no real benefit -- The optimal behavior
>>> > is
>>> > for the CPU to lock to the minimum frequency that is able to keep
>>> > the
>>> > IO devices involved fully utilized (assuming we are past the
>>> > maximum-efficiency inflection point of the CPU's power-to-frequency
>>> > curve), which is roughly the goal of this series.
>>> > 
>>> > PELT could be a useful extension for this model since its largely
>>> > heuristic assumptions would become more accurate if the IO and CPU
>>> > load could be tracked separately for each scheduling entity, but
>>> > this
>>> > is not attempted in this series because the additional complexity
>>> > and
>>> > computational cost of such an approach is hard to justify at this
>>> > stage, particularly since the current governor has similar
>>> > limitations.
>>> > 
>>> > Various frequency and step-function response graphs are available
>>> > in
>>> > [6]-[9] for comparison (obtained empirically on a BXT J3455
>>> > system).
>>> > The response curves for the low-latency and low-power states of the
>>> > heuristic are shown separately -- As you can see they roughly
>>> > bracket
>>> > the frequency response curve of the current governor.  The step
>>> > response of the aggressive heuristic is within a single update
>>> > period
>>> > (even though it's not quite obvious from the graph with the levels
>>> > of
>>> > zoom provided).  I'll attach benchmark results from a slower but
>>> > non-TDP-limited machine (which means there will be no TDP budget
>>> > increase that could possibly mask a performance regression of other
>>> > kind) as soon as they come out.
>>> > 
>>> > Thanks to Eero and Valtteri for testing a number of intermediate
>>> > revisions of this series (and there were quite a few of them) in
>>> > more
>>> > than half a dozen systems, they helped spot quite a few issues of
>>> > earlier versions of this heuristic.
>>> > 
>>> > [PATCH 1/9] cpufreq: Implement infrastructure keeping track of
>>> > aggregated IO active time.
>>> > [PATCH 2/9] Revert "cpufreq: intel_pstate: Replace bxt_funcs with
>>> > core_funcs"
>>> > [PATCH 3/9] Revert "cpufreq: intel_pstate: Shorten a couple of long
>>> > names"
>>> > [PATCH 4/9] Revert "cpufreq: intel_pstate: Simplify
>>> > intel_pstate_adjust_pstate()"
>>> > [PATCH 5/9] Revert "cpufreq: intel_pstate: Drop ->update_util from
>>> > pstate_funcs"
>>> > [PATCH 6/9] cpufreq/intel_pstate: Implement variably low-pass
>>> > filtering controller for small core.
>>> > [PATCH 7/9] SQUASH: cpufreq/intel_pstate: Enable LP controller
>>> > based on ACPI FADT profile.
>>> > [PATCH 8/9] OPTIONAL: cpufreq/intel_pstate: Expose LP controller
>>> > parameters via debugfs.
>>> > [PATCH 9/9] drm/i915/execlists: Report GPU rendering as IO activity
>>> > to cpufreq.
>>> > 
>>> > [1] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J3455.log
>>> > [2] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-per-watt-comparison-J3455.log
>>> > [3] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J3455-1.log
>>> > [4] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J4205.log
>>> > [5] http://people.freedesktop.org/~currojerez/intel_pstate-lp/bench
>>> > mark-perf-comparison-J5005.log
>>> > [6] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>>> > ency-response-magnitude-comparison.svg
>>> > [7] http://people.freedesktop.org/~currojerez/intel_pstate-lp/frequ
>>> > ency-response-phase-comparison.svg
>>> > [8] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>>> > response-comparison-1.svg
>>> > [9] http://people.freedesktop.org/~currojerez/intel_pstate-lp/step-
>>> > response-comparison-2.svg
> _______________________________________________
> Intel-gfx mailing list
> Intel-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gfx