All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	intel-gfx@lists.freedesktop.org
Cc: thomas.hellstrom@intel.com
Subject: Re: [Intel-gfx] [PATCH 22/41] drm/i915: Fair low-latency scheduling
Date: Thu, 28 Jan 2021 12:32:50 +0000	[thread overview]
Message-ID: <161183717090.2943.2300525814758303137@build.alporthouse.com> (raw)
In-Reply-To: <3624beac-15d5-ed2f-ab0f-2444feab7131@linux.intel.com>

Quoting Tvrtko Ursulin (2021-01-28 11:35:59)
> 
> On 25/01/2021 14:01, Chris Wilson wrote:
> > The first "scheduler" was a topographical sorting of requests into
> > priority order. The execution order was deterministic, the earliest
> > submitted, highest priority request would be executed first. Priority
> > inheritance ensured that inversions were kept at bay, and allowed us to
> > dynamically boost priorities (e.g. for interactive pageflips).
> > 
> > The minimalistic timeslicing scheme was an attempt to introduce fairness
> > between long running requests, by evicting the active request at the end
> > of a timeslice and moving it to the back of its priority queue (while
> > ensuring that dependencies were kept in order). For short running
> > requests from many clients of equal priority, the scheme is still very
> > much FIFO submission ordering, and as unfair as before.
> > 
> > To impose fairness, we need an external metric that ensures that clients
> > are interpersed, so we don't execute one long chain from client A before
> > executing any of client B. This could be imposed by the clients
> > themselves by using fences based on an external clock, that is they only
> > submit work for a "frame" at frame-intervals, instead of submitting as
> > much work as they are able to. The standard SwapBuffers approach is akin
> > to double bufferring, where as one frame is being executed, the next is
> > being submitted, such that there is always a maximum of two frames per
> > client in the pipeline and so ideally maintains consistent input-output
> > latency. Even this scheme exhibits unfairness under load as a single
> > client will execute two frames back to back before the next, and with
> > enough clients, deadlines will be missed.
> > 
> > The idea introduced by BFS/MuQSS is that fairness is introduced by
> > metering with an external clock. Every request, when it becomes ready to
> > execute is assigned a virtual deadline, and execution order is then
> > determined by earliest deadline. Priority is used as a hint, rather than
> > strict ordering, where high priority requests have earlier deadlines,
> > but not necessarily earlier than outstanding work. Thus work is executed
> > in order of 'readiness', with timeslicing to demote long running work.
> > 
> > The Achille's heel of this scheduler is its strong preference for
> > low-latency and favouring of new queues. Whereas it was easy to dominate
> > the old scheduler by flooding it with many requests over a short period
> > of time, the new scheduler can be dominated by a 'synchronous' client
> > that waits for each of its requests to complete before submitting the
> > next. As such a client has no history, it is always considered
> > ready-to-run and receives an earlier deadline than the long running
> > requests. This is compensated for by refreshing the current execution's
> > deadline and by disallowing preemption for timeslice shuffling.
> > 
> > In contrast, one key advantage of disconnecting the sort key from the
> > priority value is that we can freely adjust the deadline to compensate
> > for other factors. This is used in conjunction with submitting requests
> > ahead-of-schedule that then busywait on the GPU using semaphores. Since
> > we don't want to spend a timeslice busywaiting instead of doing real
> > work when available, we deprioritise work by giving the semaphore waits
> > a later virtual deadline. The priority deboost is applied to semaphore
> > workloads after they miss a semaphore wait and a new context is pending.
> > The request is then restored to its normal priority once the semaphores
> > are signaled so that it not unfairly penalised under contention by
> > remaining at a far future deadline. This is a much improved and cleaner
> > version of commit f9e9e9de58c7 ("drm/i915: Prioritise non-busywait
> > semaphore workloads").
> > 
> > To check the impact on throughput (often the downfall of latency
> > sensitive schedulers), we used gem_wsim to simulate various transcode
> > workloads with different load balancers, and varying the number of
> > competing [heterogenous] clients. On Kabylake gt3e running at fixed
> > clocks,
> > 
> > +delta%------------------------------------------------------------------+
> > |       a                                                                |
> > |       a                                                                |
> > |       a                                                                |
> > |       a                                                                |
> > |       aa                                                               |
> > |      aaa                                                               |
> > |      aaaa                                                              |
> > |     aaaaaa                                                             |
> > |     aaaaaa                                                             |
> > |     aaaaaa   a                a                                        |
> > | aa  aaaaaa a a      a  a   aa a       a         a       a             a|
> > ||______M__A__________|                                                  |
> > +------------------------------------------------------------------------+
> >      N           Min           Max        Median          Avg       Stddev
> >    108    -4.6326643     47.797855 -0.00069639128     2.116185   7.6764049
> 
> +47% is aggregate throughput or 47% less variance between worst-best 
> clients from the group?

Each point is relative change in throughput, wsim work-per-second B/A.

That +47% is due to the improved semaphore deprioritisation.
If you look at earlier results, it used to be range like -20%,20% where
sometimes we did better with avoiding the busywaits and sometimes worse.
The fix for the -20% was to apply the semaphore deprioritisation after a
miss rather than upfront (as we previously did).

> > @@ -549,9 +559,12 @@ static void __execlists_schedule_out(struct i915_request * const rq,
> >        * If we have just completed this context, the engine may now be
> >        * idle and we want to re-enter powersaving.
> >        */
> > -     if (intel_timeline_is_last(ce->timeline, rq) &&
> > -         __i915_request_is_complete(rq))
> > -             intel_engine_add_retire(engine, ce->timeline);
> > +     if (__i915_request_is_complete(rq)) {
> > +             if (!intel_timeline_is_last(ce->timeline, rq))
> > +                     i915_request_update_deadline(list_next_entry(rq, link));
> 
> Comment here explaining why it is important to update the deadline for 
> the following request once previous completes?
> 
> And this is just for the last request of the coalesced bunch right?

Yes. It follows on from the consideration that a deadline is set when
the request becomes ready. As we submit work ahead of the completion
signals, we may unfairly postpone further submissions along an active
context as the accumulated deadline far exceeds a new client, but both
pieces of work are ready to be executed.

From a bandwidth pov, this is still a reasonable hack as the executing
context finished early and did not consume all of its timeslice/budget.
So we award the next request in the context with the remainder of the
budget, and a fresh client will have its full budget.

Without this quirk, we always favour new clients versus long running
work.

> > @@ -892,10 +892,7 @@ release_queue(struct intel_engine_cs *engine,
> >       i915_request_get(rq);
> >       i915_request_add(rq);
> >   
> > -     local_bh_disable();
> > -     i915_request_set_priority(rq, prio);
> > -     local_bh_enable(); /* kick tasklet */
> > -
> > +     i915_request_set_deadline(rq, deadline);
> 
> I am thinking some underscores to this API could be beneficial to 
> emphasise how high level callers should not use it on their requests. 
> Thinking about things like tests and in kernel clients - my 
> understanding is API is not for them.

Ah, this is intended to be used just like changing priority, e.g., in
the display we set a deadline for the pageflip. So although the deadline
is soft, it is still a meaningful ktime_t.

That extra information will, of course, only be carried as far as it is
understood.

> >       switch (state) {
> >       case FENCE_COMPLETE:
> > +             i915_request_update_deadline(rq);
> 
> This will pull the deadline in or push out in practice?

In, or be ignored.

This signal corresponds to when the request would normally be submitted
as being ready. So we re-evaluate the request afresh.

As it is also after the semaphore, the new deadline is not only computed
relative the current time, but it is also without the semaphore
deboosting.

> > +static u64 prio_slice(int prio)
> > +{
> > +     u64 slice;
> > +     int sf;
> > +
> > +     /*
> > +      * This is the central heuristic to the virtual deadlines. By
> > +      * imposing that each task takes an equal amount of time, we
> > +      * let each client have an equal slice of the GPU time. By
> > +      * bringing the virtual deadline forward, that client will then
> > +      * have more GPU time, and vice versa a lower priority client will
> > +      * have a later deadline and receive less GPU time.
> > +      *
> > +      * In BFS/MuQSS, the prio_ratios[] are based on the task nice range of
> > +      * [-20, 20], with each lower priority having a ~10% longer deadline,
> > +      * with the note that the proportion of CPU time between two clients
> > +      * of different priority will be the square of the relative prio_slice.
> > +      *
> > +      * In contrast, this prio_slice() curve was chosen because it gave good
> > +      * results with igt/gem_exec_schedule. It may not be the best choice!
> > +      *
> > +      * With a 1ms scheduling quantum:
> > +      *
> > +      *   MAX USER:  ~32us deadline
> > +      *   0:         ~16ms deadline
> 
> Interesting centre/default point. Relates to 60Hz? If so how about 
> exporting some sysfs controls?

It's expected that we will definitely have input from cgroup here to
determine relative bandwidth budgets. The nice thing about the deadline
design is that it directly translates into bandwidth budgets :)

(But it will definitely take many tests to prove we get the right
factors for relative workload distribution.)

sysfs is a possibility, but for the difficulty in naming the controls.
So mostly kept as an ace up the sleeve until Joonas asks "can we...?"

> > @@ -545,21 +756,15 @@ static void __i915_request_set_priority(struct i915_request *rq, int prio)
> >                * any preemption required, be dealt with upon submission.
> >                * See engine->submit_request()
> >                */
> > -             if (!i915_request_is_ready(rq))
> > -                     continue;
> > -
> >               GEM_BUG_ON(rq->engine != engine);
> > -             if (i915_request_in_priority_queue(rq)) {
> > -                     struct list_head *prev = rq->sched.link.prev;
> > +             if (i915_request_is_ready(rq) &&
> > +                 set_earliest_deadline(rq, rq_deadline(rq)))
> 
> Inside here it walks the signalers list for rq, while this is inside the 
> loop which already walks the whole signalers tree for each rq. I wonder 
> if there is scope to somehow eliminate this another sub-walk. But to be 
> honest it makes my head spin how to do it so probably best to leave it 
> for later, if even possible.

Yes. 'nuff said. :)

The inner dfs should be short as it should not have to descend into the
tree again. But there's some freedom as each set-priority may pick a
different deadline and so different subtrees may require re-traversing.

> >   int i915_scheduler_perf_selftests(struct drm_i915_private *i915)
> >   {
> >       static const struct i915_subtest tests[] = {
> > +             SUBTEST(single_deadline),
> > +             SUBTEST(wide_deadline),
> > +             SUBTEST(inv_deadline),
> > +             SUBTEST(sparse_deadline),
> > +
> >               SUBTEST(single_priority),
> >               SUBTEST(wide_priority),
> >               SUBTEST(inv_priority),
> > 
> 
> Numbers talk for themselves (who hasn't played with intel_gpu_top and 
> clients stats enough probably can't appreciate how bad current code can 
> schedule), design looks elegant, code is tidy. I'd say go for it and 
> tweak/fix in situ if something pops up. So r-b in waiting effectively, 
> just want to finish the series.

Aye. And wsim thoughput/deadline modes proved invaluable.

I have not been able to measure any difference in game benchmarks (except
if you look at them in intel_gpu_top) as they are dominated by a single
client on a single engine, but the small sample of media transcode
benchmarks I have saw a very nice uptick.

Where this matters most will be in saturated multi-client systems,
especially when asked for more precise budgets. The interactive desktop
being a simple example, but since we always had very aggressive priority
boosting for flips, I doubt anyone would notice [if we couldn't maintain
vrefresh in the first place, the system will always feel laggy].
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2021-01-28 12:32 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-01-25 14:00 [Intel-gfx] [PATCH 01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds Chris Wilson
2021-01-25 14:00 ` [Intel-gfx] [PATCH 02/41] drm/i915/gt: Move the defer_request waiter active assertion Chris Wilson
2021-01-25 14:53   ` Tvrtko Ursulin
2021-01-25 14:00 ` [Intel-gfx] [PATCH 03/41] drm/i915: Replace engine->schedule() with a known request operation Chris Wilson
2021-01-25 15:14   ` Tvrtko Ursulin
2021-01-25 14:00 ` [Intel-gfx] [PATCH 04/41] drm/i915: Teach the i915_dependency to use a double-lock Chris Wilson
2021-01-25 15:34   ` Tvrtko Ursulin
2021-01-25 21:37     ` Chris Wilson
2021-01-26  9:40       ` Tvrtko Ursulin
2021-01-25 14:01 ` [Intel-gfx] [PATCH 05/41] drm/i915: Restructure priority inheritance Chris Wilson
2021-01-26 11:12   ` Tvrtko Ursulin
2021-01-26 11:30     ` Chris Wilson
2021-01-26 11:40       ` Tvrtko Ursulin
2021-01-26 11:55         ` Chris Wilson
2021-01-26 13:15           ` Tvrtko Ursulin
2021-01-26 13:24             ` Chris Wilson
2021-01-26 13:45               ` Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 06/41] drm/i915/selftests: Measure set-priority duration Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 07/41] drm/i915/selftests: Exercise priority inheritance around an engine loop Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 08/41] drm/i915: Improve DFS for priority inheritance Chris Wilson
2021-01-26 16:22   ` Tvrtko Ursulin
2021-01-26 16:26     ` Chris Wilson
2021-01-26 16:42       ` Tvrtko Ursulin
2021-01-26 16:51         ` Tvrtko Ursulin
2021-01-26 16:51         ` Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 09/41] drm/i915/selftests: Exercise relative mmio paths to non-privileged registers Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 10/41] drm/i915/selftests: Exercise cross-process context isolation Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 11/41] drm/i915: Extract request submission from execlists Chris Wilson
2021-01-26 16:28   ` Tvrtko Ursulin
2021-01-25 14:01 ` [Intel-gfx] [PATCH 12/41] drm/i915: Extract request rewinding " Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 13/41] drm/i915: Extract request suspension from the execlists Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 14/41] drm/i915: Extract the ability to defer and rerun a request later Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 15/41] drm/i915: Fix the iterative dfs for defering requests Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 16/41] drm/i915: Move common active lists from engine to i915_scheduler Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 17/41] drm/i915: Move scheduler queue Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 18/41] drm/i915: Move tasklet from execlists to sched Chris Wilson
2021-01-27 14:10   ` Tvrtko Ursulin
2021-01-27 14:24     ` Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 19/41] drm/i915/gt: Show scheduler queues when dumping state Chris Wilson
2021-01-27 14:13   ` Tvrtko Ursulin
2021-01-27 14:35     ` Chris Wilson
2021-01-27 14:50       ` Tvrtko Ursulin
2021-01-27 14:55         ` Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 20/41] drm/i915: Replace priolist rbtree with a skiplist Chris Wilson
2021-01-27 15:10   ` Tvrtko Ursulin
2021-01-27 15:33     ` Chris Wilson
2021-01-27 15:44       ` Chris Wilson
2021-01-27 15:58         ` Tvrtko Ursulin
2021-01-28  9:50           ` Chris Wilson
2021-01-28 15:56   ` Tvrtko Ursulin
2021-01-28 16:26     ` Chris Wilson
2021-01-28 16:42       ` Tvrtko Ursulin
2021-01-28 22:20         ` Chris Wilson
2021-01-28 22:44         ` Chris Wilson
2021-01-29  9:24           ` Tvrtko Ursulin
2021-01-29  9:37       ` Tvrtko Ursulin
2021-01-29 10:26         ` Chris Wilson
2021-01-28 22:56   ` Matthew Brost
2021-01-29 10:30     ` Chris Wilson
2021-01-29 17:01       ` Matthew Brost
2021-01-29 10:22   ` Tvrtko Ursulin
2021-01-25 14:01 ` [Intel-gfx] [PATCH 21/41] drm/i915: Wrap cmpxchg64 with try_cmpxchg64() helper Chris Wilson
2021-01-27 15:28   ` Tvrtko Ursulin
2021-01-25 14:01 ` [Intel-gfx] [PATCH 22/41] drm/i915: Fair low-latency scheduling Chris Wilson
2021-01-28 11:35   ` Tvrtko Ursulin
2021-01-28 12:32     ` Chris Wilson [this message]
2021-01-25 14:01 ` [Intel-gfx] [PATCH 23/41] drm/i915/gt: Specify a deadline for the heartbeat Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 24/41] drm/i915: Extend the priority boosting for the display with a deadline Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 25/41] drm/i915/gt: Support virtual engine queues Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 26/41] drm/i915: Move saturated workload detection back to the context Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 27/41] drm/i915: Bump default timeslicing quantum to 5ms Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 28/41] drm/i915/gt: Wrap intel_timeline.has_initial_breadcrumb Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 29/41] drm/i915/gt: Track timeline GGTT offset separately from subpage offset Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 30/41] drm/i915/gt: Add timeline "mode" Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 31/41] drm/i915/gt: Use indices for writing into relative timelines Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 32/41] drm/i915/selftests: Exercise relative timeline modes Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 33/41] drm/i915/gt: Use ppHWSP for unshared non-semaphore related timelines Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 34/41] Restore "drm/i915: drop engine_pin/unpin_breadcrumbs_irq" Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 35/41] drm/i915/gt: Couple tasklet scheduling for all CS interrupts Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 36/41] drm/i915/gt: Support creation of 'internal' rings Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 37/41] drm/i915/gt: Use client timeline address for seqno writes Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 38/41] drm/i915/gt: Infrastructure for ring scheduling Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 39/41] drm/i915/gt: Implement ring scheduler for gen4-7 Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 40/41] drm/i915/gt: Enable ring scheduling for gen5-7 Chris Wilson
2021-01-25 14:01 ` [Intel-gfx] [PATCH 41/41] drm/i915: Support secure dispatch on gen6/gen7 Chris Wilson
2021-01-25 14:40 ` [Intel-gfx] [PATCH 01/41] drm/i915/selftests: Check for engine-reset errors in the middle of workarounds Tvrtko Ursulin
2021-01-25 17:08 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/41] " Patchwork
2021-01-25 17:10 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2021-01-25 17:38 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2021-01-25 22:45 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=161183717090.2943.2300525814758303137@build.alporthouse.com \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=thomas.hellstrom@intel.com \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.