Re: [PATCH v2] drm/i915/execlists: Cancel banned contexts on schedule-out

From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [PATCH v2] drm/i915/execlists: Cancel banned contexts on schedule-out
Date: Fri, 11 Oct 2019 15:10:42 +0100	[thread overview]
Message-ID: <157080304275.31572.6006894956600550133@skylake-alporthouse-com> (raw)
In-Reply-To: <cdf1bb4b-b134-0ef7-f59a-9a7c5b679061@linux.intel.com>

Quoting Tvrtko Ursulin (2019-10-11 14:10:21)
> 
> On 11/10/2019 12:16, Chris Wilson wrote:
> > On schedule-out (CS completion) of a banned context, scrub the context
> > image so that we do not replay the active payload. The intent is that we
> > skip banned payloads on request submission so that the timeline
> > advancement continues on in the background. However, if we are returning
> > to a preempted request, i915_request_skip() is ineffective and instead we
> > need to patch up the context image so that it continues from the start
> > of the next request.
> > 
> > v2: Fixup cancellation so that we only scrub the payload of the active
> > request and do not short-circuit the breadcrumbs (which might cause
> > other contexts to execute out of order).
> > 
> > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> > Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> > ---
> >   drivers/gpu/drm/i915/gt/intel_lrc.c    |  91 ++++++---
> >   drivers/gpu/drm/i915/gt/selftest_lrc.c | 273 +++++++++++++++++++++++++
> >   2 files changed, 341 insertions(+), 23 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > index 09fc5ecfdd09..809a5dd97c14 100644
> > --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> > +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> > @@ -234,6 +234,9 @@ static void execlists_init_reg_state(u32 *reg_state,
> >                                    const struct intel_engine_cs *engine,
> >                                    const struct intel_ring *ring,
> >                                    bool close);
> > +static void
> > +__execlists_update_reg_state(const struct intel_context *ce,
> > +                          const struct intel_engine_cs *engine);
> >   
> >   static void __context_pin_acquire(struct intel_context *ce)
> >   {
> > @@ -256,6 +259,29 @@ static void mark_eio(struct i915_request *rq)
> >       i915_request_mark_complete(rq);
> >   }
> >   
> > +static struct i915_request *active_request(struct i915_request *rq)
> > +{
> > +     const struct intel_context * const ce = rq->hw_context;
> > +     struct i915_request *active = NULL;
> > +     struct list_head *list;
> > +
> > +     if (!i915_request_is_active(rq)) /* unwound, but incomplete! */
> > +             return rq;
> > +
> > +     list = &i915_request_active_timeline(rq)->requests;
> > +     list_for_each_entry_from_reverse(rq, list, link) {
> > +             if (i915_request_completed(rq))
> > +                     break;
> > +
> > +             if (rq->hw_context != ce)
> > +                     break;
> 
> Would it be of any value here to also check the initial breadcrumb matches?

Not currently. I don't it makes any difference whether or not we are
inside the payload on cancel_active() path as we know we an active
context. More fun and games for the reset path as we need to minimise
collateral damage.

> > +static void cancel_active(struct i915_request *rq,
> > +                       struct intel_engine_cs *engine)
> > +{
> > +     struct intel_context * const ce = rq->hw_context;
> > +     u32 *regs = ce->lrc_reg_state;
> > +
> > +     /*
> > +      * The executing context has been cancelled. Fixup the context so that
> > +      * it continues on from the breadcrumb after the batch and will be
> > +      * marked as incomplete [-EIO] upon signaling. We preserve the
> 
> Where does the -EIO marking happen now?

On the next __i915_request_submit()

> > +      * breadcrumbs and semaphores of the subsequent requests so that
> > +      * inter-timeline dependencies remain correctly ordered.
> > +      */
> > +     GEM_TRACE("%s(%s): { rq=%llx:%lld }\n",
> > +               __func__, engine->name, rq->fence.context, rq->fence.seqno);
> > +
> > +     __context_pin_acquire(ce);
> > +
> > +     /* On resubmission of the active request, it's payload be scrubbed */
> > +     rq = active_request(rq);
> > +     if (rq)
> > +             ce->ring->head = intel_ring_wrap(ce->ring, rq->head);
> > +     else
> > +             ce->ring->head = ce->ring->tail;
> 
> I don't quite understand yet.
> 
> If a context was banned I'd expect all requests on the tl->requests to 
> be zapped and we only move to execute the last breadcrumb, no?

We do zap them all, on __i915_request_submit(). What we are preserving
is the dependency chains as we don't want to emit the final breadcrumb
before its dependencies have been signaled. (Otherwise our optimisation
of only waiting for the end of the chain will be broken, as that context
will begin before its prerequisites have run.)

> So if you find the active_request and you set ring head to 
> active_rq->head how does that skip the payload?

We do memset(rq->infix, 0, rq->postfix-rq->infix) in
__i915_request_submit() if (context_is_banned)

> Furthermore, if I try to sketch the rq->requests timeline like this:
> 
>    R0 r1 r2 r[elsp] r4 r5
> 
> 'R' = completed; 'r' = incomplete
> 
> On schedule_out(r[elsp]) I'd expect you want to find r5 and set ring 
> head to final breadcrumb of it. And mark r1-r5 and -EIO. Am I completely 
> on the wrong track?
> 
> (Bear with me with r4 and r5, assuming someone has set the context as 
> single submission for future proofing the code.)

If we only had to be concerned about this timeline, sure, we could just
skip to the end. It's timeline C that was waiting on timeline A via
timeline B, we have to be concerned about when cancelling timeline B.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx