Francisco Jerez writes: > Francisco Jerez writes: > >> Chris Wilson writes: >> >>> Quoting Francisco Jerez (2020-03-10 21:41:55) >>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c >>>> index b9b3f78f1324..a5d7a80b826d 100644 >>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c >>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c >>>> @@ -1577,6 +1577,11 @@ static void execlists_submit_ports(struct intel_engine_cs *engine) >>>> /* we need to manually load the submit queue */ >>>> if (execlists->ctrl_reg) >>>> writel(EL_CTRL_LOAD, execlists->ctrl_reg); >>>> + >>>> + if (execlists_num_ports(execlists) > 1 && >>> pending[1] is always defined, the minimum submission is one slot, with >>> pending[1] as the sentinel NULL. >>> >>>> + execlists->pending[1] && >>>> + !atomic_xchg(&execlists->overload, 1)) >>>> + intel_gt_pm_active_begin(&engine->i915->gt); >>> >>> engine->gt >>> >> >> Applied your suggestions above locally, will probably wait to have a few >> more changes batched up before sending a v2. >> >>>> } >>>> >>>> static bool ctx_single_port_submission(const struct intel_context *ce) >>>> @@ -2213,6 +2218,12 @@ cancel_port_requests(struct intel_engine_execlists * const execlists) >>>> clear_ports(execlists->inflight, ARRAY_SIZE(execlists->inflight)); >>>> >>>> WRITE_ONCE(execlists->active, execlists->inflight); >>>> + >>>> + if (atomic_xchg(&execlists->overload, 0)) { >>>> + struct intel_engine_cs *engine = >>>> + container_of(execlists, typeof(*engine), execlists); >>>> + intel_gt_pm_active_end(&engine->i915->gt); >>>> + } >>>> } >>>> >>>> static inline void >>>> @@ -2386,6 +2397,9 @@ static void process_csb(struct intel_engine_cs *engine) >>>> /* port0 completed, advanced to port1 */ >>>> trace_ports(execlists, "completed", execlists->active); >>>> >>>> + if (atomic_xchg(&execlists->overload, 0)) >>>> + intel_gt_pm_active_end(&engine->i915->gt); >>> >>> So this looses track if we preempt a dual-ELSP submission with a >>> single-ELSP submission (and never go back to dual). >>> >> >> Yes, good point. You're right that if a dual-ELSP submission gets >> preempted by a single-ELSP submission "overload" will remain signaled >> until the first completion interrupt arrives (e.g. from the preempting >> submission). >> >>> If you move this to the end of the loop and check >>> >>> if (!execlists->active[1] && atomic_xchg(&execlists->overload, 0)) >>> intel_gt_pm_active_end(engine->gt); >>> >>> so that it covers both preemption/promotion and completion. >>> >> >> That sounds reasonable. >> >>> However, that will fluctuate quite rapidly. (And runs the risk of >>> exceeding the sentinel.) >>> >>> An alternative approach would be to couple along >>> schedule_in/schedule_out >>> >>> atomic_set(overload, -1); >>> >>> __execlists_schedule_in: >>> if (!atomic_fetch_inc(overload) >>> intel_gt_pm_active_begin(engine->gt); >>> __execlists_schedule_out: >>> if (!atomic_dec_return(overload) >>> intel_gt_pm_active_end(engine->gt); >>> >>> which would mean we are overloaded as soon as we try to submit an >>> overlapping ELSP. >>> >> >> That sounds good to me too, and AFAICT would have roughly the same >> behavior as this metric except for the preemption corner case you >> mention above. I'll try this and verify that I get approximately the >> same performance numbers. >> > > This suggestion seems to lead to some minor regressions, I'm > investigating the issue. Will send a v2 as soon as I have something > along the lines of what you suggested running with equivalent > performance to v1. I think I've figured out why both of the alternatives we were talking about above lead to a couple percent regressions in latency-sensitive workloads: In some scenarios it's possible for execlist_dequeue() to execute after the GPU has gone idle, but before we've processed the corresponding CSB entries, particularly when called from the submit_queue() path. In that case __execlists_schedule_in() will think that the next request is overlapping, and tell CPU power management to relax, even though the GPU is starving intermittently. How about we do the same: | if (atomic_xchg(&execlists->overload, 0)) | intel_gt_pm_active_end(engine->gt); as in this patch from process_csb() in response to each completion CSB entry, which ensures that the system is considered non-GPU-bound as soon as the first context completes. Subsequently if another CSB entry signals a dual-ELSP active-to-idle transition or a dual-ELSP preemption we call intel_gt_pm_active_begin() directly from process_csb(). If we hit a single-ELSP preemption CSB entry we call intel_gt_pm_active_end() instead, in order to avoid the problem you pointed out in your previous email. How does that sound to you? [Still need to verify that it has comparable performance to this patch overall.] Thanks!