All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Thomas Hellström (Intel) <thomas_os@shipmail.org>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling
Date: Tue, 16 Jun 2020 13:44:29 +0100	[thread overview]
Message-ID: <159231146971.18853.4423648105222638065@build.alporthouse.com> (raw)
In-Reply-To: <623f4f89-1377-d18d-f611-7126a08ec85c@shipmail.org>

Quoting Thomas Hellström (Intel) (2020-06-16 13:11:25)
> Hi,
> 
> On 6/16/20 12:12 PM, Chris Wilson wrote:
> > Quoting Thomas Hellström (Intel) (2020-06-16 10:07:28)
> >> Hi, Chris,
> >>
> >> Some comments and questions:
> >>
> >> On 6/8/20 12:21 AM, Chris Wilson wrote:
> >>> The first "scheduler" was a topographical sorting of requests into
> >>> priority order. The execution order was deterministic, the earliest
> >>> submitted, highest priority request would be executed first. Priority
> >>> inherited ensured that inversions were kept at bay, and allowed us to
> >>> dynamically boost priorities (e.g. for interactive pageflips).
> >>>
> >>> The minimalistic timeslicing scheme was an attempt to introduce fairness
> >>> between long running requests, by evicting the active request at the end
> >>> of a timeslice and moving it to the back of its priority queue (while
> >>> ensuring that dependencies were kept in order). For short running
> >>> requests from many clients of equal priority, the scheme is still very
> >>> much FIFO submission ordering, and as unfair as before.
> >>>
> >>> To impose fairness, we need an external metric that ensures that clients
> >>> are interpersed, we don't execute one long chain from client A before
> >>> executing any of client B. This could be imposed by the clients by using
> >>> a fences based on an external clock, that is they only submit work for a
> >>> "frame" at frame-interval, instead of submitting as much work as they
> >>> are able to. The standard SwapBuffers approach is akin to double
> >>> bufferring, where as one frame is being executed, the next is being
> >>> submitted, such that there is always a maximum of two frames per client
> >>> in the pipeline. Even this scheme exhibits unfairness under load as a
> >>> single client will execute two frames back to back before the next, and
> >>> with enough clients, deadlines will be missed.
> >>>
> >>> The idea introduced by BFS/MuQSS is that fairness is introduced by
> >>> metering with an external clock. Every request, when it becomes ready to
> >>> execute is assigned a virtual deadline, and execution order is then
> >>> determined by earliest deadline. Priority is used as a hint, rather than
> >>> strict ordering, where high priority requests have earlier deadlines,
> >>> but not necessarily earlier than outstanding work. Thus work is executed
> >>> in order of 'readiness', with timeslicing to demote long running work.
> >>>
> >>> The Achille's heel of this scheduler is its strong preference for
> >>> low-latency and favouring of new queues. Whereas it was easy to dominate
> >>> the old scheduler by flooding it with many requests over a short period
> >>> of time, the new scheduler can be dominated by a 'synchronous' client
> >>> that waits for each of its requests to complete before submitting the
> >>> next. As such a client has no history, it is always considered
> >>> ready-to-run and receives an earlier deadline than the long running
> >>> requests.
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  12 +-
> >>>    .../gpu/drm/i915/gt/intel_engine_heartbeat.c  |   1 +
> >>>    drivers/gpu/drm/i915/gt/intel_engine_pm.c     |   4 +-
> >>>    drivers/gpu/drm/i915/gt/intel_engine_types.h  |  24 --
> >>>    drivers/gpu/drm/i915/gt/intel_lrc.c           | 328 +++++++-----------
> >>>    drivers/gpu/drm/i915/gt/selftest_hangcheck.c  |   5 +-
> >>>    drivers/gpu/drm/i915/gt/selftest_lrc.c        |  43 ++-
> >>>    .../gpu/drm/i915/gt/uc/intel_guc_submission.c |   6 +-
> >>>    drivers/gpu/drm/i915/i915_priolist_types.h    |   7 +-
> >>>    drivers/gpu/drm/i915/i915_request.h           |   4 +-
> >>>    drivers/gpu/drm/i915/i915_scheduler.c         | 322 ++++++++++++-----
> >>>    drivers/gpu/drm/i915/i915_scheduler.h         |  22 +-
> >>>    drivers/gpu/drm/i915/i915_scheduler_types.h   |  17 +
> >>>    .../drm/i915/selftests/i915_mock_selftests.h  |   1 +
> >>>    drivers/gpu/drm/i915/selftests/i915_request.c |   1 +
> >>>    .../gpu/drm/i915/selftests/i915_scheduler.c   |  49 +++
> >>>    16 files changed, 484 insertions(+), 362 deletions(-)
> >>>    create mode 100644 drivers/gpu/drm/i915/selftests/i915_scheduler.c
> >> Do we have timings to back this change up? Would it make sense to have a
> >> configurable scheduler choice?
> > Yes, there's igt/benchmarks/gem_wsim to show the impact on scheduling
> > decisions for various workloads. (You can guess what the impact of
> > choosing a different execution order and forcing more context switches
> > will be... About -1% to throughput with multiple clients) And
> > igt/tests/gem_exec_schedule to test basic properties, with a bunch of new
> > fairness tests to try and decide if this is the right thing. Under
> > saturated conditions, there is no contest, a fair scheduler produces
> > consistent results, and the vdeadlines allow for realtime-response under
> > load.
> 
> Yeah, it's not really to convince me, but to provide a reference for the 
> future.
> 
> Perhaps add the gem_wsim timings to the commit message?

I don't like posting such benchmarks without saying how they can be
reproduced or providing absolute values to verify future runs. Our rules
are terrible.

This trimmed down set takes about a day to run, and we've yet to
convince people that this is a fundamental requirement for CI. So it's
really frustrating, the best we can try and do is distill essential
requirements into a pass/fair test to be run on debug kernels. At least
we cover the pathological exploits, except for where they are so bad CI
complains for them taking too long.

> >> There are multiple introductions of ktime_get() in the patch. Perhaps
> >> use monotonic clock source like ktime_get_raw()? Also immediately
> >> convert to ns.
> > ktime_get() is monotonic. The only difference is that tkr_mono has an
> > wall-offset that tkr_raw does not. [I'm sure there's a good reason.] The
> > choice is really whether ktime_get_(mono|raw)_fast_ns() is sufficient for
> > our needs.
> 
> Hmm. Yes you're right. I was thinking about the NTP adjustments. But 
> given the requirement below they might be unimportant.

I never know which is the right one to use :|

Just follow the guideline that the shortest function name is the
intended one to use, unless you understand why you should use one of the
others.

> > I do like the idea of having the deadline being some recognisable
> > timestamp, as it makes it easier to play with mixing in real, albeit
> > soft, deadlines.
> 
> 
> >
> >>> @@ -2837,10 +2788,7 @@ static void __execlists_unhold(struct i915_request *rq)
> >>>                GEM_BUG_ON(!i915_sw_fence_signaled(&rq->submit));
> >>>    
> >>>                i915_request_clear_hold(rq);
> >>> -             list_move_tail(&rq->sched.link,
> >>> -                            i915_sched_lookup_priolist(rq->engine,
> >>> -                                                       rq_prio(rq)));
> >>> -             set_bit(I915_FENCE_FLAG_PQUEUE, &rq->fence.flags);
> >>> +             submit |= intel_engine_queue_request(rq->engine, rq);
> >> As new to this codebase, I immediately wonder whether that bitwise or is
> >> intentional and whether you got the short-circuiting right. It looks
> >> correct to me.
> > bool submit, not many bits :)
> 
> Yes, the code is correct. My question was related to whether it was 
> accepted practice, considering a future reader may think it might have 
> been a mistake and change it anyway....

submit |= vs if () submit = true

I feel I have used both variants in this patch.

> >>>    {
> >>>        /*
> >>>         * When in use during submission, we are protected by a guarantee that
> >>> diff --git a/drivers/gpu/drm/i915/i915_scheduler.c b/drivers/gpu/drm/i915/i915_scheduler.c
> >>> index 4c189b81cc62..30bcb6f9d99f 100644
> >>> --- a/drivers/gpu/drm/i915/i915_scheduler.c
> >>> +++ b/drivers/gpu/drm/i915/i915_scheduler.c
> >>> @@ -20,6 +20,11 @@ static struct i915_global_scheduler {
> >>>    static DEFINE_SPINLOCK(ipi_lock);
> >>>    static LIST_HEAD(ipi_list);
> >>>    
> >>> +static inline u64 rq_deadline(const struct i915_request *rq)
> >>> +{
> >>> +     return READ_ONCE(rq->sched.deadline);
> >>> +}
> >>> +
> >> Does this need a release barrier paired with an acquire barrier in
> >> __i915_request_set_deadline below?
> > No, the state can be inconsistent. If it changes as we are processing
> > the previous value, there will be another reschedule. Within
> > set_deadline, rq->sched.deadline is under the engine->active.lock, it is
> > just that rq_deadline() is used to peek before we take the lock, as well
> > as shorthand within the critical section.
> >   
> 
> OK, understood.
> 
> 
> >>> +static u64 prio_slice(int prio)
> >>>    {
> >>> -     const struct i915_request *inflight;
> >>> +     u64 slice;
> >>> +     int sf;
> >>>    
> >>>        /*
> >>> -      * We only need to kick the tasklet once for the high priority
> >>> -      * new context we add into the queue.
> >>> +      * With a 1ms scheduling quantum:
> >>> +      *
> >>> +      *   MAX USER:  ~32us deadline
> >>> +      *   0:         ~16ms deadline
> >>> +      *   MIN_USER: 1000ms deadline
> >>>         */
> >>> -     if (prio <= engine->execlists.queue_priority_hint)
> >>> -             return;
> >>>    
> >>> -     rcu_read_lock();
> >>> +     if (prio >= __I915_PRIORITY_KERNEL__)
> >>> +             return INT_MAX - prio;
> >>>    
> >>> -     /* Nothing currently active? We're overdue for a submission! */
> >>> -     inflight = execlists_active(&engine->execlists);
> >>> -     if (!inflight)
> >>> -             goto unlock;
> >>> +     slice = __I915_PRIORITY_KERNEL__ - prio;
> >>> +     if (prio >= 0)
> >>> +             sf = 20 - 6;
> >>> +     else
> >>> +             sf = 20 - 1;
> >>> +
> >>> +     return slice << sf;
> >>> +}
> >>> +
> >> Is this the same deadline calculation as used in the BFS? Could you
> >> perhaps add a pointer to some documentation?
> > It is a heuristic. The scale factor in BFS is designed for a smaller
> > range and is not effective for passing our existing priority ordering
> > tests.
> >
> > The challenge is to pick something that is fair that roughly matches
> > usage. It basically says that if client A submits 3 requests, then
> > client B, C will be able to run before the later requests of client A so
> > long as they are submitted within 16ms. Currently we get AAABC,
> > the vdeadlines turn that into ABCAA. So we would ideally like the quota
> > for each client to reflect their needs, so if client A needed all 3
> > requests within 16ms, it would have a vdeadline closer to 5ms (and so it
> > would compete for the GPU against other clients). Now with this less
> > strict priority system we can let normal userspace bump their
> > priorities, or we can use the average context runtime to try and adjust
> > priorities on the fly (i.e. do not used an unbias quota). But I suspect
> > removing any fairness will skew the scheduler once more.
> 
> OK, also for future reference, it would be good to have at least a 
> subset of this documented somewhere!

Yeah. I think prio_slice is the best spot to try and explain why
different priorities have different quotas, and the impact.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-06-16 12:44 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-06-07 22:20 [Intel-gfx] [PATCH 01/28] drm/i915: Adjust the sentinel assert to match implementation Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 02/28] drm/i915/selftests: Make the hanging request non-preemptible Chris Wilson
2020-06-08 20:58   ` Mika Kuoppala
2020-06-07 22:20 ` [Intel-gfx] [PATCH 03/28] drm/i915/selftests: Teach hang-self to target only itself Chris Wilson
2020-06-10 13:21   ` Mika Kuoppala
2020-06-07 22:20 ` [Intel-gfx] [PATCH 04/28] drm/i915/selftests: Remove live_suppress_wait_preempt Chris Wilson
2020-06-11 11:38   ` Tvrtko Ursulin
2020-06-07 22:20 ` [Intel-gfx] [PATCH 05/28] drm/i915/selftests: Trim execlists runtime Chris Wilson
2020-06-12 23:05   ` Andi Shyti
2020-06-07 22:20 ` [Intel-gfx] [PATCH 06/28] drm/i915/gt: Use virtual_engine during execlists_dequeue Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 07/28] drm/i915/gt: Decouple inflight virtual engines Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 08/28] drm/i915/gt: Resubmit the virtual engine on schedule-out Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 09/28] drm/i915: Add list_for_each_entry_safe_continue_reverse Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 10/28] drm/i915/gem: Separate reloc validation into an earlier step Chris Wilson
2020-06-09  7:47   ` Tvrtko Ursulin
2020-06-09 10:48     ` Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 11/28] drm/i915/gem: Lift GPU relocation allocation Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 12/28] drm/i915/gem: Build the reloc request first Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 13/28] drm/i915/gem: Add all GPU reloc awaits/signals en masse Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 14/28] dma-buf: Proxy fence, an unsignaled fence placeholder Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 15/28] drm/i915: Lift waiter/signaler iterators Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 16/28] drm/i915: Unpeel awaits on a proxy fence Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 17/28] drm/i915/gem: Make relocations atomic within execbuf Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 18/28] drm/i915: Strip out internal priorities Chris Wilson
2020-06-07 22:20 ` [Intel-gfx] [PATCH 19/28] drm/i915: Remove I915_USER_PRIORITY_SHIFT Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 20/28] drm/i915: Replace engine->schedule() with a known request operation Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 21/28] drm/i915/gt: Do not suspend bonded requests if one hangs Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 22/28] drm/i915: Teach the i915_dependency to use a double-lock Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 23/28] drm/i915: Restructure priority inheritance Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 24/28] ipi-dag Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 25/28] drm/i915/gt: Check for a completed last request once Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling Chris Wilson
2020-06-16  9:07   ` Thomas Hellström (Intel)
2020-06-16 10:12     ` Chris Wilson
2020-06-16 12:11       ` Thomas Hellström (Intel)
2020-06-16 12:44         ` Chris Wilson [this message]
2020-06-16 10:54     ` Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 27/28] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
2020-06-07 22:21 ` [Intel-gfx] [PATCH 28/28] drm/i915: Replace the priority boosting for the display with a deadline Chris Wilson
2020-06-07 22:49 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/28] drm/i915: Adjust the sentinel assert to match implementation Patchwork
2020-06-07 22:51 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-06-07 23:12 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2020-06-08  0:58 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2020-06-08  7:44 ` [Intel-gfx] [PATCH 01/28] " Tvrtko Ursulin
2020-06-08  9:33   ` Chris Wilson
2020-06-09  6:59     ` Tvrtko Ursulin
2020-06-09 10:29       ` Chris Wilson
2020-06-09 10:39         ` Tvrtko Ursulin
2020-06-09 10:47           ` Chris Wilson
2020-06-09 11:45             ` Tvrtko Ursulin
2020-06-08 20:43 ` Mika Kuoppala
2020-06-08 20:27 [Intel-gfx] [PATCH 26/28] drm/i915: Fair low-latency scheduling kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=159231146971.18853.4423648105222638065@build.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=thomas_os@shipmail.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.