All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Francisco Jerez <currojerez@riseup.net>,
	Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org, linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com>
Subject: Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Date: Thu, 12 Mar 2020 11:52:37 +0000	[thread overview]
Message-ID: <36854a07-2dd5-694c-4b9d-a3eddf8e33f9@linux.intel.com> (raw)
In-Reply-To: <878sk6acqs.fsf@riseup.net>


On 11/03/2020 19:54, Francisco Jerez wrote:
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
> 
>> On 10/03/2020 22:26, Chris Wilson wrote:
>>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> index b9b3f78f1324..a5d7a80b826d 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>>           /* we need to manually load the submit queue */
>>>>           if (execlists->ctrl_reg)
>>>>                   writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>>> +
>>>> +       if (execlists_num_ports(execlists) > 1 &&
>>> pending[1] is always defined, the minimum submission is one slot, with
>>> pending[1] as the sentinel NULL.
>>>
>>>> +           execlists->pending[1] &&
>>>> +           !atomic_xchg(&execlists->overload, 1))
>>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>>
>>> engine->gt
>>>
>>>>    }
>>>>    
>>>>    static bool ctx_single_port_submission(const struct intel_context *ce)
>>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>>           clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>>    
>>>>           WRITE_ONCE(execlists->active, execlists->inflight);
>>>> +
>>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>>> +               struct intel_engine_cs *engine =
>>>> +                       container_of(execlists, typeof(*engine), execlists);
>>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>>> +       }
>>>>    }
>>>>    
>>>>    static inline void
>>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>>                           /* port0 completed, advanced to port1 */
>>>>                           trace_ports(execlists, "completed", execlists->active);
>>>>    
>>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>>
>>> So this looses track if we preempt a dual-ELSP submission with a
>>> single-ELSP submission (and never go back to dual).
>>>
>>> If you move this to the end of the loop and check
>>>
>>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>>> 	intel_gt_pm_active_end(engine->gt);
>>>
>>> so that it covers both preemption/promotion and completion.
>>>
>>> However, that will fluctuate quite rapidly. (And runs the risk of
>>> exceeding the sentinel.)
>>>
>>> An alternative approach would be to couple along
>>> schedule_in/schedule_out
>>>
>>> atomic_set(overload, -1);
>>>
>>> __execlists_schedule_in:
>>> 	if (!atomic_fetch_inc(overload)
>>> 		intel_gt_pm_active_begin(engine->gt);
>>> __execlists_schedule_out:
>>> 	if (!atomic_dec_return(overload)
>>> 		intel_gt_pm_active_end(engine->gt);
>>>
>>> which would mean we are overloaded as soon as we try to submit an
>>> overlapping ELSP.
>>
>> Putting it this low-level into submission code also would not work well
>> with GuC.
>>
> 
> I wrote a patch at some point that added calls to
> intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC
> submission code in order to obtain a similar effect.  However people
> requested me to leave GuC submission alone for the moment in order to
> avoid interference with SLPC.  At some point it might make sense to hook
> this up in combination with SLPC, because SLPC doesn't provide much of a
> CPU energy efficiency advantage in comparison to this series.
> 
>> How about we try to keep some accounting one level higher, as the i915
>> scheduler is passing requests on to the backend for execution?
>>
>> Or number of runnable contexts, if the distinction between contexts and
>> requests is better for this purpose.
>>
>> Problematic bit in going one level higher though is that the exit point
>> is less precisely coupled to the actual state. Or maybe with aggressive
>> engine retire we have nowadays it wouldn't be a problem.
>>
> 
> The main advantage of instrumenting the execlists submission code at a
> low level is that it gives us visibility over the number of ELSP ports
> pending execution, which can cause the performance of the workload to be
> substantially more or less latency-sensitive.  GuC submission shouldn't
> care about this variable, so it kind of makes sense for its behavior to
> be slightly different.
> 
> Anyway if we're willing to give up the accuracy of keeping track of this
> at a low level (and give GuC submission exactly the same treatment) it
> should be possible to move the tracking one level up.

The results you got are certainly extremely attractive and the approach 
and code looks tidy and mature - just so you don't get me wrong that I 
am not objecting to the idea.

What I'd like to see is an easier to read breakdown of results, at 
minimum with separate perf and perf-per-Watt results. A graph with 
sorted results and error bars would also be nice.

Secondly in in the commit message of this particular patch I'd like to 
read some more thought about why ELSP[1] occupancy is thought to be the 
desired signal. Why for instance a deep ELSP[0] shouldn't benefit from 
more TDP budget towards the GPU and similar.

Also a description of the control processing "rf_qos" function do with 
this signal. What and why.

Some time ago we entertained the idea of GPU "load average", where that 
was defined as a count of runnable requests (so batch buffers). How 
that, more generic metric, would behave here if used as an input signal 
really intrigues me. Sadly I don't have a patch ready to give to you and 
ask to please test it.

Or maybe the key is count of runnable contexts as opposed to requests, 
which would more match the ELSP[1] idea.

But this is secondary, I primarily think we need to see a better 
presentation of the result and the theory of operation explained better 
in the commit message.

Regards,

Tvrtko

WARNING: multiple messages have this Message-ID (diff)
From: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
To: Francisco Jerez <currojerez@riseup.net>,
	Chris Wilson <chris@chris-wilson.co.uk>,
	intel-gfx@lists.freedesktop.org, linux-pm@vger.kernel.org
Cc: Peter Zijlstra <peterz@infradead.org>,
	"Rafael J. Wysocki" <rjw@rjwysocki.net>,
	"Pandruvada, Srinivas" <srinivas.pandruvada@intel.com>
Subject: Re: [Intel-gfx] [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load.
Date: Thu, 12 Mar 2020 11:52:37 +0000	[thread overview]
Message-ID: <36854a07-2dd5-694c-4b9d-a3eddf8e33f9@linux.intel.com> (raw)
In-Reply-To: <878sk6acqs.fsf@riseup.net>


On 11/03/2020 19:54, Francisco Jerez wrote:
> Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> writes:
> 
>> On 10/03/2020 22:26, Chris Wilson wrote:
>>> Quoting Francisco Jerez (2020-03-10 21:41:55)
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> index b9b3f78f1324..a5d7a80b826d 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
>>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine)
>>>>           /* we need to manually load the submit queue */
>>>>           if (execlists->ctrl_reg)
>>>>                   writel(EL_CTRL_LOAD, execlists->ctrl_reg);
>>>> +
>>>> +       if (execlists_num_ports(execlists) > 1 &&
>>> pending[1] is always defined, the minimum submission is one slot, with
>>> pending[1] as the sentinel NULL.
>>>
>>>> +           execlists->pending[1] &&
>>>> +           !atomic_xchg(&execlists->overload, 1))
>>>> +               intel_gt_pm_active_begin(&engine->i915->gt);
>>>
>>> engine->gt
>>>
>>>>    }
>>>>    
>>>>    static bool ctx_single_port_submission(const struct intel_context *ce)
>>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists)
>>>>           clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight));
>>>>    
>>>>           WRITE_ONCE(execlists->active, execlists->inflight);
>>>> +
>>>> +       if (atomic_xchg(&execlists->overload, 0)) {
>>>> +               struct intel_engine_cs *engine =
>>>> +                       container_of(execlists, typeof(*engine), execlists);
>>>> +               intel_gt_pm_active_end(&engine->i915->gt);
>>>> +       }
>>>>    }
>>>>    
>>>>    static inline void
>>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine)
>>>>                           /* port0 completed, advanced to port1 */
>>>>                           trace_ports(execlists, "completed", execlists->active);
>>>>    
>>>> +                       if (atomic_xchg(&execlists->overload, 0))
>>>> +                               intel_gt_pm_active_end(&engine->i915->gt);
>>>
>>> So this looses track if we preempt a dual-ELSP submission with a
>>> single-ELSP submission (and never go back to dual).
>>>
>>> If you move this to the end of the loop and check
>>>
>>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0))
>>> 	intel_gt_pm_active_end(engine->gt);
>>>
>>> so that it covers both preemption/promotion and completion.
>>>
>>> However, that will fluctuate quite rapidly. (And runs the risk of
>>> exceeding the sentinel.)
>>>
>>> An alternative approach would be to couple along
>>> schedule_in/schedule_out
>>>
>>> atomic_set(overload, -1);
>>>
>>> __execlists_schedule_in:
>>> 	if (!atomic_fetch_inc(overload)
>>> 		intel_gt_pm_active_begin(engine->gt);
>>> __execlists_schedule_out:
>>> 	if (!atomic_dec_return(overload)
>>> 		intel_gt_pm_active_end(engine->gt);
>>>
>>> which would mean we are overloaded as soon as we try to submit an
>>> overlapping ELSP.
>>
>> Putting it this low-level into submission code also would not work well
>> with GuC.
>>
> 
> I wrote a patch at some point that added calls to
> intel_gt_pm_active_begin() and intel_gt_pm_active_end() to the GuC
> submission code in order to obtain a similar effect.  However people
> requested me to leave GuC submission alone for the moment in order to
> avoid interference with SLPC.  At some point it might make sense to hook
> this up in combination with SLPC, because SLPC doesn't provide much of a
> CPU energy efficiency advantage in comparison to this series.
> 
>> How about we try to keep some accounting one level higher, as the i915
>> scheduler is passing requests on to the backend for execution?
>>
>> Or number of runnable contexts, if the distinction between contexts and
>> requests is better for this purpose.
>>
>> Problematic bit in going one level higher though is that the exit point
>> is less precisely coupled to the actual state. Or maybe with aggressive
>> engine retire we have nowadays it wouldn't be a problem.
>>
> 
> The main advantage of instrumenting the execlists submission code at a
> low level is that it gives us visibility over the number of ELSP ports
> pending execution, which can cause the performance of the workload to be
> substantially more or less latency-sensitive.  GuC submission shouldn't
> care about this variable, so it kind of makes sense for its behavior to
> be slightly different.
> 
> Anyway if we're willing to give up the accuracy of keeping track of this
> at a low level (and give GuC submission exactly the same treatment) it
> should be possible to move the tracking one level up.

The results you got are certainly extremely attractive and the approach 
and code looks tidy and mature - just so you don't get me wrong that I 
am not objecting to the idea.

What I'd like to see is an easier to read breakdown of results, at 
minimum with separate perf and perf-per-Watt results. A graph with 
sorted results and error bars would also be nice.

Secondly in in the commit message of this particular patch I'd like to 
read some more thought about why ELSP[1] occupancy is thought to be the 
desired signal. Why for instance a deep ELSP[0] shouldn't benefit from 
more TDP budget towards the GPU and similar.

Also a description of the control processing "rf_qos" function do with 
this signal. What and why.

Some time ago we entertained the idea of GPU "load average", where that 
was defined as a count of runnable requests (so batch buffers). How 
that, more generic metric, would behave here if used as an input signal 
really intrigues me. Sadly I don't have a patch ready to give to you and 
ask to please test it.

Or maybe the key is count of runnable contexts as opposed to requests, 
which would more match the ELSP[1] idea.

But this is secondary, I primarily think we need to see a better 
presentation of the result and the theory of operation explained better 
in the commit message.

Regards,

Tvrtko
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-03-12 11:52 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-03-10 21:41 [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Francisco Jerez
2020-03-10 21:41 ` [Intel-gfx] " Francisco Jerez
2020-03-10 21:41 ` [PATCH 01/10] PM: QoS: Add CPU_RESPONSE_FREQUENCY global PM QoS limit Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-11 12:42   ` Peter Zijlstra
2020-03-11 12:42     ` [Intel-gfx] " Peter Zijlstra
2020-03-11 19:23     ` Francisco Jerez
2020-03-11 19:23       ` [Intel-gfx] " Francisco Jerez
2020-03-11 19:23       ` [PATCHv2 " Francisco Jerez
2020-03-11 19:23         ` [Intel-gfx] " Francisco Jerez
2020-03-19 10:25         ` Rafael J. Wysocki
2020-03-19 10:25           ` [Intel-gfx] " Rafael J. Wysocki
2020-03-10 21:41 ` [PATCH 02/10] drm/i915: Adjust PM QoS response frequency based on GPU load Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-10 22:26   ` Chris Wilson
2020-03-10 22:26     ` Chris Wilson
2020-03-11  0:34     ` Francisco Jerez
2020-03-11  0:34       ` Francisco Jerez
2020-03-18 19:42       ` Francisco Jerez
2020-03-18 19:42         ` Francisco Jerez
2020-03-20  2:46         ` Francisco Jerez
2020-03-20  2:46           ` Francisco Jerez
2020-03-20 10:06           ` Chris Wilson
2020-03-20 10:06             ` Chris Wilson
2020-03-11 10:00     ` Tvrtko Ursulin
2020-03-11 10:00       ` Tvrtko Ursulin
2020-03-11 10:21       ` Chris Wilson
2020-03-11 10:21         ` Chris Wilson
2020-03-11 19:54       ` Francisco Jerez
2020-03-11 19:54         ` Francisco Jerez
2020-03-12 11:52         ` Tvrtko Ursulin [this message]
2020-03-12 11:52           ` Tvrtko Ursulin
2020-03-13  7:39           ` Francisco Jerez
2020-03-13  7:39             ` Francisco Jerez
2020-03-16 20:54             ` Francisco Jerez
2020-03-16 20:54               ` Francisco Jerez
2020-03-10 21:41 ` [PATCH 03/10] OPTIONAL: drm/i915: Expose PM QoS control parameters via debugfs Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-10 21:41 ` [PATCH 04/10] Revert "cpufreq: intel_pstate: Drop ->update_util from pstate_funcs" Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-19 10:45   ` Rafael J. Wysocki
2020-03-19 10:45     ` [Intel-gfx] " Rafael J. Wysocki
2020-03-10 21:41 ` [PATCH 05/10] cpufreq: intel_pstate: Implement VLP controller statistics and status calculation Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-19 11:06   ` Rafael J. Wysocki
2020-03-19 11:06     ` [Intel-gfx] " Rafael J. Wysocki
2020-03-10 21:41 ` [PATCH 06/10] cpufreq: intel_pstate: Implement VLP controller target P-state range estimation Francisco Jerez
2020-03-10 21:41   ` [Intel-gfx] " Francisco Jerez
2020-03-19 11:12   ` Rafael J. Wysocki
2020-03-19 11:12     ` [Intel-gfx] " Rafael J. Wysocki
2020-03-10 21:42 ` [PATCH 07/10] cpufreq: intel_pstate: Implement VLP controller for HWP parts Francisco Jerez
2020-03-10 21:42   ` [Intel-gfx] " Francisco Jerez
2020-03-17 23:59   ` Pandruvada, Srinivas
2020-03-17 23:59     ` [Intel-gfx] " Pandruvada, Srinivas
2020-03-18 19:51     ` Francisco Jerez
2020-03-18 19:51       ` [Intel-gfx] " Francisco Jerez
2020-03-18 20:10       ` Pandruvada, Srinivas
2020-03-18 20:10         ` [Intel-gfx] " Pandruvada, Srinivas
2020-03-18 20:22         ` Francisco Jerez
2020-03-18 20:22           ` [Intel-gfx] " Francisco Jerez
2020-03-23 20:13           ` Pandruvada, Srinivas
2020-03-23 20:13             ` [Intel-gfx] " Pandruvada, Srinivas
2020-03-10 21:42 ` [PATCH 08/10] cpufreq: intel_pstate: Enable VLP controller based on ACPI FADT profile and CPUID Francisco Jerez
2020-03-10 21:42   ` [Intel-gfx] " Francisco Jerez
2020-03-19 11:20   ` Rafael J. Wysocki
2020-03-19 11:20     ` [Intel-gfx] " Rafael J. Wysocki
2020-03-10 21:42 ` [PATCH 09/10] OPTIONAL: cpufreq: intel_pstate: Add tracing of VLP controller status Francisco Jerez
2020-03-10 21:42   ` [Intel-gfx] " Francisco Jerez
2020-03-10 21:42 ` [PATCH 10/10] OPTIONAL: cpufreq: intel_pstate: Expose VLP controller parameters via debugfs Francisco Jerez
2020-03-10 21:42   ` [Intel-gfx] " Francisco Jerez
2020-03-11  2:35 ` [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
2020-03-11  2:35   ` [Intel-gfx] " Pandruvada, Srinivas
2020-03-11  3:55   ` Francisco Jerez
2020-03-11  3:55     ` [Intel-gfx] " Francisco Jerez
2020-03-11  4:25 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for " Patchwork
2020-03-12  2:31 ` [Intel-gfx] ✗ Fi.CI.BUILD: failure for GPU-bound energy efficiency improvements for the intel_pstate driver (v2). (rev2) Patchwork
2020-03-12  2:32 ` Patchwork
2020-03-23 23:29 ` [RFC] GPU-bound energy efficiency improvements for the intel_pstate driver (v2) Pandruvada, Srinivas
2020-03-23 23:29   ` [Intel-gfx] " Pandruvada, Srinivas
2020-03-24  0:23   ` Francisco Jerez
2020-03-24  0:23     ` [Intel-gfx] " Francisco Jerez
2020-03-24 19:16     ` Francisco Jerez
2020-03-24 19:16       ` [Intel-gfx] " Francisco Jerez
2020-03-24 20:03       ` Pandruvada, Srinivas
2020-03-24 20:03         ` [Intel-gfx] " Pandruvada, Srinivas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36854a07-2dd5-694c-4b9d-a3eddf8e33f9@linux.intel.com \
    --to=tvrtko.ursulin@linux.intel.com \
    --cc=chris@chris-wilson.co.uk \
    --cc=currojerez@riseup.net \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=rjw@rjwysocki.net \
    --cc=srinivas.pandruvada@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.