Hi Chris,

Chris Wilson <chris@chris-wilson.co.uk> writes:

> Quoting Francisco Jerez (2018-03-28 07:38:45)
>> This allows cpufreq governors to realize when the system becomes
>> non-CPU-bound due to GPU rendering activity, which will cause the
>> intel_pstate LP controller to behave more conservatively: CPU energy
>> usage will be reduced when there isn't a good chance for system
>> performance to scale with CPU frequency.  This leaves additional TDP
>> budget available for the GPU to reach higher frequencies, which is
>> translated into an improvement in graphics performance to the extent
>> that the workload remains TDP-limited (Most non-trivial graphics
>> benchmarks out there improve significantly in TDP-constrained
>> platforms, see the cover letter for some numbers).  If the workload
>> isn't (anymore) TDP-limited performance should stay roughly constant,
>> but energy usage will be divided by a similar factor.
>
> And that's what I thought IPS was already meant to be achieving;
> intelligent distribution between different units...
>  

I'm not aware of IPS ever being available on any small core systems.

>> The intel_pstate LP controller is only enabled on BXT+ small-core
>> platforms at this point, so this shouldn't have any effect on other
>> systems.
>
> Although that's probably only a feature for big core :)
>

Actually I wouldn't rule out it being useful on big core up front.  I've
been playing around with the idea of hooking up this same interface to
the EPP preference used by HWP, which would allow the HWP controller to
reduce energy usage in GPU-bound conditions, which could potentially
leave additional TDP headroom available for the GPU -- It's not uncommon
for graphics workloads to be TDP-limited even on big core, and while
non-TDP-limited (so neither IPS nor IBC will ever help you) that would
still allow them to optimize CPU energy usage for workloads that are
consistently GPU-bound (which would mean longer battery life while
gaming on a big-core laptop).

>> Signed-off-by: Francisco Jerez <currojerez@riseup.net>
>> ---
>>  drivers/gpu/drm/i915/intel_lrc.c | 4 ++++
>>  1 file changed, 4 insertions(+)
>> 
>> diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
>> index 3a69b367e565..721f915115bd 100644
>> --- a/drivers/gpu/drm/i915/intel_lrc.c
>> +++ b/drivers/gpu/drm/i915/intel_lrc.c
>> @@ -132,6 +132,7 @@
>>   *
>>   */
>>  #include <linux/interrupt.h>
>> +#include <linux/cpufreq.h>
>>  
>>  #include <drm/drmP.h>
>>  #include <drm/i915_drm.h>
>> @@ -379,11 +380,13 @@ execlists_context_schedule_in(struct i915_request *rq)
>>  {
>>         execlists_context_status_change(rq, INTEL_CONTEXT_SCHEDULE_IN);
>>         intel_engine_context_in(rq->engine);
>> +       cpufreq_io_active_begin();
>
> Since you only care about a binary value for GPU activity, we don't need
> to do this on each context, just between submitting the first request
> and the final completion, i.e. couple this to EXECLISTS_ACTIVE_USER.
>
> Haven't yet gone back to check how heavy io_active_begin/end are, but I
> trust you appreciate that this code is particularly latency sensitive.

The cpufreq_io_active_begin/end() interfaces are probably as lightweight
as they can be.  There's no locking involved.  In cases where there
already is an overlapping request, cpufreq_io_active_begin() and
cpufreq_io_active_end() should return without doing much more than an
atomic increment and an atomic cmpxchg respectively.

That said, it wouldn't hurt to call each of them once per sequence of
overlapping requests.  Do you want me to call them from
execlists_set/clear_active() themselves when bit == EXECLISTS_ACTIVE_USER,
or at each callsite of execlists_set/clear_active()? [possibly protected
with a check that execlists_is_active(EXECLISTS_ACTIVE_USER) wasn't
already the expected value]

> -Chris