All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wilson <chris@chris-wilson.co.uk>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	intel-gfx@lists.freedesktop.org
Subject: Re: [Intel-gfx] [PATCH 09/20] drm/i915/gt: Reset queue_priority_hint after wedging
Date: Fri, 28 Feb 2020 13:10:17 +0000	[thread overview]
Message-ID: <158289541748.24106.14903286456113120245@skylake-alporthouse-com> (raw)
In-Reply-To: <c36f4167-f06a-1b59-b5f9-e1efee20d634@linux.intel.com>

Quoting Tvrtko Ursulin (2020-02-28 12:59:37)
> 
> On 28/02/2020 12:31, Chris Wilson wrote:
> > Quoting Tvrtko Ursulin (2020-02-28 12:10:23)
> >>
> >> On 27/02/2020 08:57, Chris Wilson wrote:
> >>> An odd and highly unlikely path caught us out. On delayed submission
> >>> (due to an asynchronous reset handler), we poked the priority_hint and
> >>> kicked the tasklet. However, we had already marked the device as wedged
> >>> and swapped out the tasklet for a no-op. The result was that we never
> >>> cleared the priority hint and became upset when we later checked.
> >>>
> >>> <0> [574.303565] i915_sel-6278    2.... 481822445us : __i915_subtests: Running intel_execlists_live_selftests/live_error_interrupt
> >>> <0> [574.303565] i915_sel-6278    2.... 481822472us : __engine_unpark: 0000:00:02.0 rcs0:
> >>> <0> [574.303565] i915_sel-6278    2.... 481822491us : __gt_unpark: 0000:00:02.0
> >>> <0> [574.303565] i915_sel-6278    2.... 481823220us : execlists_context_reset: 0000:00:02.0 rcs0: context:f4ee reset
> >>> <0> [574.303565] i915_sel-6278    2.... 481824830us : __intel_context_active: 0000:00:02.0 rcs0: context:f51b active
> >>> <0> [574.303565] i915_sel-6278    2.... 481825258us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51b pin ring:{start:00006000, head:0000, tail:0000}
> >>> <0> [574.303565] i915_sel-6278    2.... 481825311us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51b:2, current 0
> >>> <0> [574.303565] i915_sel-6278    2d..1 481825347us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51b:2, current 0
> >>> <0> [574.303565] i915_sel-6278    2d..1 481825363us : trace_ports: 0000:00:02.0 rcs0: submit { f51b:2, 0:0 }
> >>> <0> [574.303565] i915_sel-6278    2.... 481826809us : __intel_context_active: 0000:00:02.0 rcs0: context:f51c active
> >>> <0> [574.303565]   <idle>-0       7d.h2 481827326us : cs_irq_handler: 0000:00:02.0 rcs0: CS error: 1
> >>> <0> [574.303565]   <idle>-0       7..s1 481827377us : process_csb: 0000:00:02.0 rcs0: cs-irq head=3, tail=4
> >>> <0> [574.303565]   <idle>-0       7..s1 481827379us : process_csb: 0000:00:02.0 rcs0: csb[4]: status=0x10000001:0x00000000
> >>> <0> [574.305593]   <idle>-0       7..s1 481827385us : trace_ports: 0000:00:02.0 rcs0: promote { f51b:2*, 0:0 }
> >>> <0> [574.305611]   <idle>-0       7..s1 481828179us : execlists_reset: 0000:00:02.0 rcs0: reset for CS error
> >>> <0> [574.305611] i915_sel-6278    2.... 481828284us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51c pin ring:{start:00007000, head:0000, tail:0000}
> >>> <0> [574.305611] i915_sel-6278    2.... 481828345us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51c:2, current 0
> >>> <0> [574.305611]   <idle>-0       7dNs2 481847823us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence f51b:2, current 1
> >>> <0> [574.305611]   <idle>-0       7dNs2 481847857us : execlists_hold: 0000:00:02.0 rcs0: fence f51b:2, current 1 on hold
> >>> <0> [574.305611]   <idle>-0       7.Ns1 481847863us : intel_engine_reset: 0000:00:02.0 rcs0: flags=4
> >>> <0> [574.305611]   <idle>-0       7.Ns1 481847945us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-1
> >>> <0> [574.305611]   <idle>-0       7.Ns1 481847946us : intel_engine_stop_cs: 0000:00:02.0 rcs0:
> >>> <0> [574.305611]   <idle>-0       7.Ns1 538584284us : intel_engine_stop_cs: 0000:00:02.0 rcs0: timed out on STOP_RING -> IDLE
> >>> <0> [574.305611]   <idle>-0       7.Ns1 538584347us : __intel_gt_reset: 0000:00:02.0 engine_mask=1
> >>> <0> [574.305611]   <idle>-0       7.Ns1 538584406us : execlists_reset_rewind: 0000:00:02.0 rcs0:
> >>> <0> [574.305611]   <idle>-0       7dNs2 538585050us : __i915_request_reset: 0000:00:02.0 rcs0: fence f51b:2, current 1 guilty? yes
> >>> <0> [574.305611]   <idle>-0       7dNs2 538585063us : __execlists_reset: 0000:00:02.0 rcs0: replay {head:0000, tail:0068}
> >>> <0> [574.306565]   <idle>-0       7.Ns1 538588457us : intel_engine_cancel_stop_cs: 0000:00:02.0 rcs0:
> >>> <0> [574.306565]   <idle>-0       7dNs2 538588462us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51c:2, current 0
> >>> <0> [574.306565]   <idle>-0       7dNs2 538588471us : trace_ports: 0000:00:02.0 rcs0: submit { f51c:2, 0:0 }
> >>> <0> [574.306565]   <idle>-0       7.Ns1 538588474us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->1
> >>> <0> [574.306565] kworker/-202     2.... 538588755us : i915_request_retire: 0000:00:02.0 rcs0: fence f51c:2, current 2
> >>> <0> [574.306565] ksoftirq-46      7..s. 538588773us : process_csb: 0000:00:02.0 rcs0: cs-irq head=11, tail=1
> >>> <0> [574.306565] ksoftirq-46      7..s. 538588774us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x10000001:0x00000000
> >>> <0> [574.306565] ksoftirq-46      7..s. 538588776us : trace_ports: 0000:00:02.0 rcs0: promote { f51c:2!, 0:0 }
> >>> <0> [574.306565] ksoftirq-46      7..s. 538588778us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x10000018:0x00000020
> >>> <0> [574.306565] ksoftirq-46      7..s. 538588779us : trace_ports: 0000:00:02.0 rcs0: completed { f51c:2!, 0:0 }
> >>> <0> [574.306565] kworker/-202     2.... 538588826us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51c unpin
> >>> <0> [574.306565] i915_sel-6278    6.... 538589663us : __intel_gt_set_wedged.part.32: 0000:00:02.0 start
> >>> <0> [574.306565] i915_sel-6278    6.... 538589667us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-0
> >>> <0> [574.306565] i915_sel-6278    6.... 538589710us : intel_engine_stop_cs: 0000:00:02.0 rcs0:
> >>> <0> [574.306565] i915_sel-6278    6.... 538589732us : execlists_reset_prepare: 0000:00:02.0 bcs0: depth<-0
> >>> <0> [574.307591] i915_sel-6278    6.... 538589733us : intel_engine_stop_cs: 0000:00:02.0 bcs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538589757us : execlists_reset_prepare: 0000:00:02.0 vcs0: depth<-0
> >>> <0> [574.307591] i915_sel-6278    6.... 538589758us : intel_engine_stop_cs: 0000:00:02.0 vcs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538589771us : execlists_reset_prepare: 0000:00:02.0 vcs1: depth<-0
> >>> <0> [574.307591] i915_sel-6278    6.... 538589772us : intel_engine_stop_cs: 0000:00:02.0 vcs1:
> >>> <0> [574.307591] i915_sel-6278    6.... 538589778us : execlists_reset_prepare: 0000:00:02.0 vecs0: depth<-0
> >>> <0> [574.307591] i915_sel-6278    6.... 538589780us : intel_engine_stop_cs: 0000:00:02.0 vecs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538589786us : __intel_gt_reset: 0000:00:02.0 engine_mask=ff
> >>> <0> [574.307591] i915_sel-6278    6.... 538591175us : execlists_reset_cancel: 0000:00:02.0 rcs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538591970us : execlists_reset_cancel: 0000:00:02.0 bcs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538591982us : execlists_reset_cancel: 0000:00:02.0 vcs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538591996us : execlists_reset_cancel: 0000:00:02.0 vcs1:
> >>> <0> [574.307591] i915_sel-6278    6.... 538592759us : execlists_reset_cancel: 0000:00:02.0 vecs0:
> >>> <0> [574.307591] i915_sel-6278    6.... 538592977us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->0
> >>> <0> [574.307591] i915_sel-6278    6.N.. 538592996us : execlists_reset_finish: 0000:00:02.0 bcs0: depth->0
> >>> <0> [574.307591] i915_sel-6278    6.N.. 538593023us : execlists_reset_finish: 0000:00:02.0 vcs0: depth->0
> >>> <0> [574.307591] i915_sel-6278    6.N.. 538593037us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->0
> >>> <0> [574.307591] i915_sel-6278    6.N.. 538593051us : execlists_reset_finish: 0000:00:02.0 vecs0: depth->0
> >>> <0> [574.307591] i915_sel-6278    6.... 538593407us : __intel_gt_set_wedged.part.32: 0000:00:02.0 end
> >>> <0> [574.307591] kworker/-210     7d..1 551958381us : execlists_unhold: 0000:00:02.0 rcs0: fence f51b:2, current 2 hold release
> >>> <0> [574.307591] i915_sel-6278    0.... 559490788us : i915_request_retire: 0000:00:02.0 rcs0: fence f51b:2, current 2
> >>> <0> [574.307591] i915_sel-6278    0.... 559490793us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51b unpin
> >>> <0> [574.307591] i915_sel-6278    0.... 559490798us : __engine_park: 0000:00:02.0 rcs0: parked
> >>> <0> [574.307591] i915_sel-6278    0.... 559490982us : __intel_context_retire: 0000:00:02.0 rcs0: context:f51c retire runtime: { total:30004ns, avg:30004ns }
> >>> <0> [574.307591] i915_sel-6278    0.... 559491372us : __engine_park: __engine_park:261 GEM_BUG_ON(engine->execlists.queue_priority_hint != (-((int)(~0U >> 1)) - 1))
> >>>
> >>> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
> >>> ---
> >>>    drivers/gpu/drm/i915/gt/intel_lrc.c | 3 +++
> >>>    1 file changed, 3 insertions(+)
> >>>
> >>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> index 39b0125b7143..35c5cf786726 100644
> >>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c
> >>> @@ -3724,7 +3724,10 @@ static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled)
> >>>    
> >>>    static void nop_submission_tasklet(unsigned long data)
> >>>    {
> >>> +     struct intel_engine_cs * const engine = (struct intel_engine_cs *)data;
> >>> +
> >>>        /* The driver is wedged; don't process any more events. */
> >>> +     WRITE_ONCE(engine->execlists.queue_priority_hint, INT_MIN);
> >>
> >> Why from the tasklet and not the place which clears the queue?
> > 
> > That would be the list_move within nop_submit_request()
> > [i915_request_submit]
> > 
> > I chose this tasklet as we do the reset in execlists_submission_tasklet()
> > on clearing the queue there, and so thought this was analogous.
> 
> It actually looks to me it is unhold which is causing this, so it is not 
> true we never reset the hint, it was probably overwritten:
> 
> execlists_reset_cancel, at the end of it:
> 
>         /* Remaining _unready_ requests will be nop'ed when submitted */
> 
>         execlists->queue_priority_hint = INT_MIN;
> 
> Just who overwrote it.. someone called unhold after 
> execlists_reset_cancel finished.
> 
> Should unhold not restore the priority hint if the requests on the hold 
> list are -EIO?

It is the unhold callback, it does

        if (rq_prio(rq) > engine->execlists.queue_priority_hint) {
                engine->execlists.queue_priority_hint = rq_prio(rq);
                tasklet_hi_schedule(&engine->execlists.tasklet);
        }

and queues the [now] nop_submission_tasklet. Which would be fine if it
behaved similarly.
-Chris
_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

  reply	other threads:[~2020-02-28 13:10 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-27  8:57 [Intel-gfx] [PATCH 01/20] drm/i915: Skip barriers inside waits Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 02/20] drm/i915/perf: Mark up the racy use of perf->exclusive_stream Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 03/20] drm/i915/perf: Manually acquire engine-wakeref around use of kernel_context Chris Wilson
2020-02-28 11:53   ` Mika Kuoppala
2020-02-28 11:56     ` Chris Wilson
2020-02-28 12:18       ` Mika Kuoppala
2020-02-27  8:57 ` [Intel-gfx] [PATCH 04/20] drm/i915/perf: Wait for lrc_reconfigure on disable Chris Wilson
2020-02-27 11:17   ` [Intel-gfx] [PATCH] " Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 05/20] drm/i915/gem: Consolidate ctx->engines[] release Chris Wilson
2020-02-27  9:51   ` [Intel-gfx] [PATCH] " Chris Wilson
2020-02-27 11:01   ` Chris Wilson
2020-02-28 12:08     ` Tvrtko Ursulin
2020-02-28 12:19       ` Chris Wilson
2020-03-02 14:20         ` Tvrtko Ursulin
2020-02-27  8:57 ` [Intel-gfx] [PATCH 06/20] drm/i915/gt: Prevent allocation on a banned context Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 07/20] drm/i915/gem: Check that the context wasn't closed during setup Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 08/20] drm/i915/selftests: Disable heartbeat around manual pulse tests Chris Wilson
2020-02-27 22:51   ` Andi Shyti
2020-02-27  8:57 ` [Intel-gfx] [PATCH 09/20] drm/i915/gt: Reset queue_priority_hint after wedging Chris Wilson
2020-02-28 12:10   ` Tvrtko Ursulin
2020-02-28 12:31     ` Chris Wilson
2020-02-28 12:59       ` Tvrtko Ursulin
2020-02-28 13:10         ` Chris Wilson [this message]
2020-02-28 13:20           ` Tvrtko Ursulin
2020-02-28 13:34             ` Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 10/20] drm/i915/gt: Pull marking vm as closed underneath the vm->mutex Chris Wilson
2020-02-28 12:12   ` Tvrtko Ursulin
2020-02-27  8:57 ` [Intel-gfx] [PATCH 11/20] drm/i915: Protect i915_request_await_start from early waits Chris Wilson
2020-02-28 12:41   ` Tvrtko Ursulin
2020-02-27  8:57 ` [Intel-gfx] [PATCH 12/20] drm/i915/selftests: Verify LRC isolation Chris Wilson
2020-02-28 11:30   ` Mika Kuoppala
2020-02-28 11:52     ` Chris Wilson
2020-02-28 12:13       ` Mika Kuoppala
2020-02-27  8:57 ` [Intel-gfx] [PATCH 13/20] drm/i915/selftests: Check recovery from corrupted LRC Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 14/20] drm/i915/selftests: Wait for the kernel context switch Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 15/20] drm/i915/selftests: Be a little more lenient for reset workers Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 16/20] drm/i915/selftests: Add request throughput measurement to perf Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 17/20] drm/i915/gt: Declare when we enabled timeslicing Chris Wilson
2020-02-28 12:45   ` Tvrtko Ursulin
2020-02-28 13:14     ` Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 18/20] drm/i915/gt: Yield the timeslice if caught waiting on a user semaphore Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 19/20] drm/i915/execlists: Check the sentinel is alone in the ELSP Chris Wilson
2020-02-27  8:57 ` [Intel-gfx] [PATCH 20/20] drm/i915/execlists: Reduce preempt-to-busy roundtrip delay Chris Wilson
2020-02-27  9:14 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/20] drm/i915: Skip barriers inside waits Patchwork
2020-02-27  9:45 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2020-02-27 15:06 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [01/20] drm/i915: Skip barriers inside waits (rev4) Patchwork
2020-02-27 15:37 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2020-02-27 22:38 ` [Intel-gfx] [PATCH 01/20] drm/i915: Skip barriers inside waits Andi Shyti
2020-02-28 11:53 ` Tvrtko Ursulin
2020-02-28 12:08   ` Chris Wilson
2020-02-28 16:33 ` [Intel-gfx] ✗ Fi.CI.IGT: failure for series starting with [01/20] drm/i915: Skip barriers inside waits (rev4) Patchwork

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=158289541748.24106.14903286456113120245@skylake-alporthouse-com \
    --to=chris@chris-wilson.co.uk \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.