Re: [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis

From: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>
To: "tvrtko.ursulin@linux.intel.com" <tvrtko.ursulin@linux.intel.com>,
	"Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Date: Wed, 21 Sep 2022 07:59:49 +0000	[thread overview]
Message-ID: <820df17d125d8a0cb090b1c05f59be361b31b429.camel@intel.com> (raw)
In-Reply-To: <c6e968c8-aaa0-949c-61c9-857ef5d9f617@intel.com>


On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
> 
> 
alan: [snip]
> > > > > +    /*
> > > > > +     * If the context gets closed while the execbuf is ongoing, 
> > > > > the context
> > > > > +     * close code will race with the below code to cancel the 
> > > > > delayed work.
> > > > > +     * If the context close wins the race and cancels the work, it 
> > > > > will
> > > > > +     * immediately call the sched disable (see guc_context_close), 
> > > > > so there
> > > > > +     * is a chance we can get past this check while the 
> > > > > sched_disable code
> > > > > +     * is being executed. To make sure that code completes before 
> > > > > we check
> > > > > +     * the status further down, we wait for the close process to 
> > > > > complete.
> > > > > +     */
> > > > > +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > +        intel_context_sched_disable_unpin(ce);
> > > > > +    else if (intel_context_is_closed(ce))
> > > > > +        wait_for(context_close_done(ce), 1);
> > > > 
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > > 
> > > will add the consequence (i believe the consequence is that we could 
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and 
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > > 
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent 
> > > sometime trying to
> > 
> > One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
alan: I'll increase to 1.5 secs

> > What would be the consequence of prematurely reading the still 
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
> > Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
> > And why cant' this race with context close happen at any other point 
> > than this particular spot in guc_request_alloc? Immediately after the 
> > added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.
> 
alan: Then i'll add a warn - especially after a 1.5 sec delay and still waiting
for that close.