Re: [Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling

From: Matthew Brost <matthew.brost@intel.com>
To: Chris Wilson <chris@chris-wilson.co.uk>
Cc: intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 36/56] drm/i915: Fair low-latency scheduling
Date: Thu, 7 Jan 2021 09:10:35 -0800	[thread overview]
Message-ID: <20210107171035.GA9023@sdutt-i7> (raw)
In-Reply-To: <161003793126.28368.13767958122371874524@build.alporthouse.com>

On Thu, Jan 07, 2021 at 04:45:31PM +0000, Chris Wilson wrote:
> Quoting Matthew Brost (2021-01-07 16:05:07)
> > On Tue, Dec 29, 2020 at 12:01:25PM +0000, Chris Wilson wrote:
> > > The first "scheduler" was a topographical sorting of requests into
> > > priority order. The execution order was deterministic, the earliest
> > > submitted, highest priority request would be executed first. Priority
> > > inheritance ensured that inversions were kept at bay, and allowed us to
> > > dynamically boost priorities (e.g. for interactive pageflips).
> > > 
> > > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > > between long running requests, by evicting the active request at the end
> > > of a timeslice and moving it to the back of its priority queue (while
> > > ensuring that dependencies were kept in order). For short running
> > > requests from many clients of equal priority, the scheme is still very
> > > much FIFO submission ordering, and as unfair as before.
> > > 
> > > To impose fairness, we need an external metric that ensures that clients
> > > are interpersed, so we don't execute one long chain from client A before
> > > executing any of client B. This could be imposed by the clients
> > > themselves by using fences based on an external clock, that is they only
> > > submit work for a "frame" at frame-intervals, instead of submitting as
> > > much work as they are able to. The standard SwapBuffers approach is akin
> > > to double bufferring, where as one frame is being executed, the next is
> > > being submitted, such that there is always a maximum of two frames per
> > > client in the pipeline and so ideally maintains consistent input-output
> > > latency. Even this scheme exhibits unfairness under load as a single
> > > client will execute two frames back to back before the next, and with
> > > enough clients, deadlines will be missed.
> > > 
> > > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > > metering with an external clock. Every request, when it becomes ready to
> > > execute is assigned a virtual deadline, and execution order is then
> > > determined by earliest deadline. Priority is used as a hint, rather than
> > > strict ordering, where high priority requests have earlier deadlines,
> > > but not necessarily earlier than outstanding work. Thus work is executed
> > > in order of 'readiness', with timeslicing to demote long running work.
> > > 
> > > The Achille's heel of this scheduler is its strong preference for
> > > low-latency and favouring of new queues. Whereas it was easy to dominate
> > > the old scheduler by flooding it with many requests over a short period
> > > of time, the new scheduler can be dominated by a 'synchronous' client
> > > that waits for each of its requests to complete before submitting the
> > > next. As such a client has no history, it is always considered
> > > ready-to-run and receives an earlier deadline than the long running
> > > requests. This is compensated for by refreshing the current execution's
> > > deadline and by disallowing preemption for timeslice shuffling.
> > > 
> > > To check the impact on throughput (often the downfall of latency
> > > sensitive schedulers), we used gem_wsim to simulate various transcode
> > > workloads with different load balancers, and varying the number of
> > > competing [heterogenous] clients.
> > > 
> > > +delta%------------------------------------------------------------------+
> > > |                                a                                       |
> > > |                                a                                       |
> > > |                                aa                                      |
> > > |                                aa                                      |
> > > |                                aa                                      |
> > > |                                aa                                      |
> > > |                               aaa                                      |
> > > |                              aaaa                                      |
> > > |                           a  aaaaa                                     |
> > > |                           a aaaaaa                                     |
> > > |a              aa   a      aaaaaaaaaa aa               a               a|
> > > |                                A_|                                     |
> > > +------------------------------------------------------------------------+
> > >    N          Min           Max        Median           Avg        Stddev
> > >  108   -23.982194     28.421527  -0.077474828  -0.072650418    0.16179718
> > > 
> > > The impact was on average 0.1% under contention due to the change in
> > > context execution order and number of context switches. The biggest
> > > swings are due to the execution ordering favouring one client or another,
> > > and maybe room for improvement.
> > > 
> > 
> > I haven't dug into this series deeply but it does seem plausible for
> > execlist submission. However this new scheduler seems completely broken
> > for Guc submission.
> 
> Underneath it's the same topological sort, just with a different key.
> At worst, that key can be the plain priority again. But that's pretty
> much immaterial in terms of deciding the order which requests are
> transferred from one queue to another. Execution order wants to be based
> on resource prioritisation (beyond request priorisation), ideally we
> would keep our fine grained scheduling control.
> 

So maybe just hook for the key?

> I'm not convinced by the state of the fw's scheduler...
>

Can't really argue here, but I think it would be really helpful if you
raised your concerns with the fw team so that can be addressed.

> > The scheduler really isn't used that much with GuC
> > submission but when it is the old one works perfectly. The vfunc for
> > the scheduler is deleted in patch #25. I don't think that is good idea
> > as it appears we are trending to 2 separate schedulers.
> 
> You still use the i915_request_set_priority() entry point as the
> caller into the backend bypass.
>

I thought we want to avoid backend specific branches, hence vfuncs.

> Isn't the current guc.fw a second closed driver for the same HW anyway?
>

Not sure I follow this.

> > Lastly this series seems to be on the wrong list, I believe it should be
> > posted elsewhere first.
> 
> It's the right place for upstreaming bug fixes on stable platforms. And
> after a year of being here, that's a moot point.

Wow, I feel your pain if this has been out here for a year. But still it
is wrong place. I can help and will help with the reviews if it is in
the correct spot.

Matt

> -Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx