Re: [Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Thomas Hellström (Intel) <thomas_os@shipmail.org>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling
Date: Tue, 16 Jun 2020 13:44:29 +0100	[thread overview]
Message-ID: <159231146971.18853.4423648105222638065@build.alporthouse.com> (raw)
In-Reply-To: <623f4f89-1377-d18d-f611-7126a08ec85c@shipmail.org>

Quoting Thomas Hellström (Intel) (2020-06-16 13:11:25)
> Hi,
> 
> On 6/16/20 12:12 PM, Chris Wilson wrote:
> > Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
> >> Hi, Chris,
> >>
> >> Some comments and questions:
> >>
> >> On 6/8/20 12:21 AM, Chris Wilson wrote:
> >>> The first "scheduler" was a topographical sorting of requests into
> >>> priority order. The execution order was deterministic, the earliest
> >>> submitted, highest priority request would be executed first. Priority
> >>> inherited ensured that inversions were kept at bay, and allowed us to
> >>> dynamically boost priorities (e.g. for interactive pageflips).
> >>>
> >>> The minimalistic timeslicing scheme was an attempt to introduce fairness
> >>> between long running requests, by evicting the active request at the end
> >>> of a timeslice and moving it to the back of its priority queue (while
> >>> ensuring that dependencies were kept in order). For short running
> >>> requests from many clients of equal priority, the scheme is still very
> >>> much FIFO submission ordering, and as unfair as before.
> >>>
> >>> To impose fairness, we need an external metric that ensures that clients
> >>> are interpersed, we don't execute one long chain from client A before
> >>> executing any of client B. This could be imposed by the clients by using
> >>> a fences based on an external clock, that is they only submit work for a
> >>> "frame" at frame-interval, instead of submitting as much work as they
> >>> are able to. The standard SwapBuffers approach is akin to double
> >>> bufferring, where as one frame is being executed, the next is being
> >>> submitted, such that there is always a maximum of two frames per client
> >>> in the pipeline. Even this scheme exhibits unfairness under load as a
> >>> single client will execute two frames back to back before the next, and
> >>> with enough clients, deadlines will be missed.
> >>>
> >>> The idea introduced by BFS/MuQSS is that fairness is introduced by
> >>> metering with an external clock. Every request, when it becomes ready to
> >>> execute is assigned a virtual deadline, and execution order is then
> >>> determined by earliest deadline. Priority is used as a hint, rather than
> >>> strict ordering, where high priority requests have earlier deadlines,
> >>> but not necessarily earlier than outstanding work. Thus work is executed
> >>> in order of 'readiness', with timeslicing to demote long running work.
> >>>
> >>> The Achille's heel of this scheduler is its strong preference for
> >>> low-latency and favouring of new queues. Whereas it was easy to dominate
> >>> the old scheduler by flooding it with many requests over a short period
> >>> of time, the new scheduler can be dominated by a 'synchronous' client
> >>> that waits for each of its requests to complete before submitting the
> >>> next. As such a client has no history, it is always considered
> >>> ready-to-run and receives an earlier deadline than the long running
> >>> requests.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
> >>>    .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> >>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
> >>>    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
> >>>    drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
> >>>    drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
> >>>    drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
> >>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
> >>>    drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
> >>>    drivers/gpu/drm/i915/i915_request.h           |   4 +-
> >>>    drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
> >>>    drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
> >>>    drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
> >>>    .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
> >>>    drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
> >>>    .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
> >>>    16 files changed, 484 insertions(+), 362 deletions(-)
> >>>    create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
> >> Do we have timings to back this change up? Would it make sense to have a
> >> configurable scheduler choice?
> > Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling
> > decisions for various workloads. (You can guess what the impact of
> > choosing a different execution order and forcing more context switches
> > will be... About -1% to throughput with multiple clients) And
> > igt/tests/gem_exec_schedule to test basic properties, with a bunch of new
> > fairness tests to try and decide if this is the right thing. Under
> > saturated conditions, there is no contest, a fair scheduler produces
> > consistent results, and the vdeadlines allow for realtime-response under
> > load.
> 
> Yeah, it's not really to convince me, but to provide a reference for the 
> future.
> 
> Perhaps add the gem_wsim timings to the commit message?

I don't like posting such benchmarks without saying how they can be
reproduced or providing absolute values to verify future runs. Our rules
are terrible.

This trimmed down set takes about a day to run, and we've yet to
convince people that this is a fundamental requirement for CI. So it's
really frustrating, the best we can try and do is distill essential
requirements into a pass/fair test to be run on debug kernels. At least
we cover the pathological exploits, except for where they are so bad CI
complains for them taking too long.

> >> There are multiple introductions of ktime_get() in the patch. Perhaps
> >> use monotonic clock source like ktime_get_raw()? Also immediately
> >> convert to ns.
> > ktime_get() is monotonic. The only difference is that tkr_mono has an
> > wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The
> > choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for
> > our needs.
> 
> Hmm. Yes you're right. I was thinking about the NTP adjustments. But 
> given the requirement below they might be unimportant.

I never know which is the right one to use :|

Just follow the guideline that the shortest function name is the
intended one to use, unless you understand why you should use one of the
others.

> > I do like the idea of having the deadline being some recognisable
> > timestamp, as it makes it easier to play with mixing in real, albeit
> > soft, deadlines.
> 
> 
> >
> >>> @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
> >>>                GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
> >>>    
> >>>                i915_request_clear_hold(rq);
> >>> -             list_move_tail(&rq->sched.link,
> >>> -                            i915_sched_lookup_priolist(rq->engine,
> >>> -                                                       rq_prio(rq)));
> >>> -             set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> >>> +             submit |= intel_engine_queue_request(rq->engine, rq);
> >> As new to this codebase, I immediately wonder whether that bitwise or is
> >> intentional and whether you got the short-circuiting right. It looks
> >> correct to me.
> > bool submit, not many bits :)
> 
> Yes, the code is correct. My question was related to whether it was 
> accepted practice, considering a future reader may think it might have 
> been a mistake and change it anyway....

submit |= vs if () submit = true

I feel I have used both variants in this patch.

> >>>    {
> >>>        /*
> >>>         * When in use during submission, we are protected by a guarantee that
> >>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> >>> index 4c189b81cc62..30bcb6f9d99f 100644
> >>> --- a/drivers/gpu/drm/i915/i915_scheduler.c
> >>> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> >>> @@ -20,6 +20,11 @@ static struct i915_global_scheduler {
> >>>    static DEFINE_SPINLOCK(ipi_lock);
> >>>    static LIST_HEAD(ipi_list);
> >>>    
> >>> +static inline u64 rq_deadline(const struct i915_request *rq)
> >>> +{
> >>> +     return READ_ONCE(rq->sched.deadline);
> >>> +}
> >>> +
> >> Does this need a release barrier paired with an acquire barrier in
> >> __i915_request_set_deadline below?
> > No, the state can be inconsistent. If it changes as we are processing
> > the previous value, there will be another reschedule. Within
> > set_deadline, rq->sched.deadline is under the engine->active.lock, it is
> > just that rq_deadline() is used to peek before we take the lock, as well
> > as shorthand within the critical section.
> >   
> 
> OK, understood.
> 
> 
> >>> +static u64 prio_slice(int prio)
> >>>    {
> >>> -     const struct i915_request *inflight;
> >>> +     u64 slice;
> >>> +     int sf;
> >>>    
> >>>        /*
> >>> -      * We only need to kick the tasklet once for the high priority
> >>> -      * new context we add into the queue.
> >>> +      * With a 1ms scheduling quantum:
> >>> +      *
> >>> +      *   MAX USER:  ~32us deadline
> >>> +      *   0:         ~16ms deadline
> >>> +      *   MIN_USER: 1000ms deadline
> >>>         */
> >>> -     if (prio <= engine->execlists.queue_priority_hint)
> >>> -             return;
> >>>    
> >>> -     rcu_read_lock();
> >>> +     if (prio >= __I915_PRIORITY_KERNEL__)
> >>> +             return INT_MAX - prio;
> >>>    
> >>> -     /* Nothing currently active? We're overdue for a submission! */
> >>> -     inflight = execlists_active(&engine->execlists);
> >>> -     if (!inflight)
> >>> -             goto unlock;
> >>> +     slice = __I915_PRIORITY_KERNEL__ - prio;
> >>> +     if (prio >= 0)
> >>> +             sf = 20 - 6;
> >>> +     else
> >>> +             sf = 20 - 1;
> >>> +
> >>> +     return slice << sf;
> >>> +}
> >>> +
> >> Is this the same deadline calculation as used in the BFS? Could you
> >> perhaps add a pointer to some documentation?
> > It is a heuristic. The scale factor in BFS is designed for a smaller
> > range and is not effective for passing our existing priority ordering
> > tests.
> >
> > The challenge is to pick something that is fair that roughly matches
> > usage. It basically says that if client A submits 3 requests, then
> > client B, C will be able to run before the later requests of client A so
> > long as they are submitted within 16ms. Currently we get AAABC,
> > the vdeadlines turn that into ABCAA. So we would ideally like the quota
> > for each client to reflect their needs, so if client A needed all 3
> > requests within 16ms, it would have a vdeadline closer to 5ms (and so it
> > would compete for the GPU against other clients). Now with this less
> > strict priority system we can let normal userspace bump their
> > priorities, or we can use the average context runtime to try and adjust
> > priorities on the fly (i.e. do not used an unbias quota). But I suspect
> > removing any fairness will skew the scheduler once more.
> 
> OK, also for future reference, it would be good to have at least a 
> subset of this documented somewhere!

Yeah. I think prio_slice is the best spot to try and explain why
different priorities have different quotas, and the impact.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx