All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Teres Alexis, Alan Previn" <alan.previn.teres.alexis@intel.com>
To: "tvrtko.ursulin@linux.intel.com" <tvrtko.ursulin@linux.intel.com>,
	"Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>,
	"intel-gfx@lists.freedesktop.org"
	<intel-gfx@lists.freedesktop.org>
Subject: Re: [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis
Date: Wed, 21 Sep 2022 07:59:49 +0000	[thread overview]
Message-ID: <820df17d125d8a0cb090b1c05f59be361b31b429.camel@intel.com> (raw)
In-Reply-To: <c6e968c8-aaa0-949c-61c9-857ef5d9f617@intel.com>


On Fri, 2022-09-16 at 08:36 -0700, Ceraolo Spurio, Daniele wrote:
> 
> 
alan: [snip]
> > > > > +    /*
> > > > > +     * If the context gets closed while the execbuf is ongoing, 
> > > > > the context
> > > > > +     * close code will race with the below code to cancel the 
> > > > > delayed work.
> > > > > +     * If the context close wins the race and cancels the work, it 
> > > > > will
> > > > > +     * immediately call the sched disable (see guc_context_close), 
> > > > > so there
> > > > > +     * is a chance we can get past this check while the 
> > > > > sched_disable code
> > > > > +     * is being executed. To make sure that code completes before 
> > > > > we check
> > > > > +     * the status further down, we wait for the close process to 
> > > > > complete.
> > > > > +     */
> > > > > +    if (cancel_delayed_work_sync(&ce->guc_state.sched_disable_delay))
> > > > > +        intel_context_sched_disable_unpin(ce);
> > > > > +    else if (intel_context_is_closed(ce))
> > > > > +        wait_for(context_close_done(ce), 1);
> > > > 
> > > > Comment makes it sounds important to handle the race, althought it
> > > > doesn't really explain the consequences. But most importantly, what if
> > > > close doesn't complete in 1ms?
> > > 
> > > will add the consequence (i believe the consequence is that we could 
> > > prematurely
> > > read context flags bit indicating its gucid is still registered and 
> > > after skipping
> > > re-registration, find that context gets closed and guc-id freed).
> > > 
> > > Yes the 1 second is arbitrary and unnervingly short. Just spent 
> > > sometime trying to
> > 
> > One millisecond even. :)
> 
> Normally 1ms is not a slow time for this. We can only hit the wait if 
> the context_close call has already called the cancel_delayed_work, so 
> the only thing left to do in that function is to send the H2G, which is 
> relatively fast. However, I've just realized that if the H2G buffer is 
> full the H2G will stall, so in that case it can take longer (maximum 
> stall time before a hang is declared is 1.5s).
> 
alan: I'll increase to 1.5 secs

> > What would be the consequence of prematurely reading the still 
> > registered context flag? Execbuf fails? GuC hangs? Kernel crashes? 
> > Something else?
> 
> i915 thinks that a request pending with the GuC, while GuC thinks we 
> disabled it (because the completion of the delayed_disable happens after 
> the new request has been submitted). The heartbeat will still go 
> through, so no reset will be triggered, the request is just lost. A new 
> request submission on the same context should be able to recover it, but 
> we didn't test that.
> 
> 
> > And why cant' this race with context close happen at any other point 
> > than this particular spot in guc_request_alloc? Immediately after the 
> > added checks? After atomic_add_unless?
> 
> The race is tied to the canceling of the delayed work. context_close 
> only does something if it cancels the delayed work, so if the 
> cancel_delayed_work_sync here does the cancelling then context_close is 
> a no-op.
> 
alan: Then i'll add a warn - especially after a 1.5 sec delay and still waiting
for that close.


  parent reply	other threads:[~2022-09-21  7:59 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-15  2:12 [Intel-gfx] [PATCH 0/1] Delay disabling GuC scheduling of an idle context Alan Previn
2022-09-15  2:12 ` [Intel-gfx] [PATCH 1/1] drm/i915/guc: Delay disabling guc_id scheduling for better hysteresis Alan Previn
2022-09-15  8:48   ` Tvrtko Ursulin
2022-09-16  7:53     ` Teres Alexis, Alan Previn
2022-09-16  8:58       ` Tvrtko Ursulin
2022-09-16 15:36         ` Ceraolo Spurio, Daniele
2022-09-20  6:22           ` Teres Alexis, Alan Previn
2022-09-20 14:33           ` Tvrtko Ursulin
2022-09-21  7:59           ` Teres Alexis, Alan Previn [this message]
2022-09-21 22:30           ` Teres Alexis, Alan Previn
2022-09-21 22:30         ` Teres Alexis, Alan Previn
2022-09-15  3:22 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Delay disabling GuC scheduling of an idle context Patchwork
2022-09-15  3:41 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2022-09-15 20:41 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=820df17d125d8a0cb090b1c05f59be361b31b429.camel@intel.com \
    --to=alan.previn.teres.alexis@intel.com \
    --cc=daniele.ceraolospurio@intel.com \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.