From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D0A1DC3F2D2 for ; Fri, 28 Feb 2020 13:34:34 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 84D512469D for ; Fri, 28 Feb 2020 13:34:34 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 84D512469D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 1A34C6F402; Fri, 28 Feb 2020 13:34:34 +0000 (UTC) Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 2C01E6F402 for ; Fri, 28 Feb 2020 13:34:32 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from localhost (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP (TLS) id 20385526-1500050 for multiple; Fri, 28 Feb 2020 13:34:29 +0000 MIME-Version: 1.0 From: Chris Wilson User-Agent: alot/0.6 To: Tvrtko Ursulin , intel-gfx@lists.freedesktop.org References: <20200227085723.1961649-1-chris@chris-wilson.co.uk> <20200227085723.1961649-9-chris@chris-wilson.co.uk> <158289309991.24106.11560132712313715212@skylake-alporthouse-com> <158289541748.24106.14903286456113120245@skylake-alporthouse-com> In-Reply-To: Message-ID: <158289686714.24106.5257888211971651218@skylake-alporthouse-com> Date: Fri, 28 Feb 2020 13:34:27 +0000 Subject: Re: [Intel-gfx] [PATCH 09/20] drm/i915/gt: Reset queue_priority_hint after wedging X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Quoting Tvrtko Ursulin (2020-02-28 13:20:24) > > On 28/02/2020 13:10, Chris Wilson wrote: > > Quoting Tvrtko Ursulin (2020-02-28 12:59:37) > >> > >> On 28/02/2020 12:31, Chris Wilson wrote: > >>> Quoting Tvrtko Ursulin (2020-02-28 12:10:23) > >>>> > >>>> On 27/02/2020 08:57, Chris Wilson wrote: > >>>>> An odd and highly unlikely path caught us out. On delayed submission > >>>>> (due to an asynchronous reset handler), we poked the priority_hint and > >>>>> kicked the tasklet. However, we had already marked the device as wedged > >>>>> and swapped out the tasklet for a no-op. The result was that we never > >>>>> cleared the priority hint and became upset when we later checked. > >>>>> > >>>>> <0> [574.303565] i915_sel-6278 2.... 481822445us : __i915_subtests: Running intel_execlists_live_selftests/live_error_interrupt > >>>>> <0> [574.303565] i915_sel-6278 2.... 481822472us : __engine_unpark: 0000:00:02.0 rcs0: > >>>>> <0> [574.303565] i915_sel-6278 2.... 481822491us : __gt_unpark: 0000:00:02.0 > >>>>> <0> [574.303565] i915_sel-6278 2.... 481823220us : execlists_context_reset: 0000:00:02.0 rcs0: context:f4ee reset > >>>>> <0> [574.303565] i915_sel-6278 2.... 481824830us : __intel_context_active: 0000:00:02.0 rcs0: context:f51b active > >>>>> <0> [574.303565] i915_sel-6278 2.... 481825258us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51b pin ring:{start:00006000, head:0000, tail:0000} > >>>>> <0> [574.303565] i915_sel-6278 2.... 481825311us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51b:2, current 0 > >>>>> <0> [574.303565] i915_sel-6278 2d..1 481825347us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51b:2, current 0 > >>>>> <0> [574.303565] i915_sel-6278 2d..1 481825363us : trace_ports: 0000:00:02.0 rcs0: submit { f51b:2, 0:0 } > >>>>> <0> [574.303565] i915_sel-6278 2.... 481826809us : __intel_context_active: 0000:00:02.0 rcs0: context:f51c active > >>>>> <0> [574.303565] -0 7d.h2 481827326us : cs_irq_handler: 0000:00:02.0 rcs0: CS error: 1 > >>>>> <0> [574.303565] -0 7..s1 481827377us : process_csb: 0000:00:02.0 rcs0: cs-irq head=3, tail=4 > >>>>> <0> [574.303565] -0 7..s1 481827379us : process_csb: 0000:00:02.0 rcs0: csb[4]: status=0x10000001:0x00000000 > >>>>> <0> [574.305593] -0 7..s1 481827385us : trace_ports: 0000:00:02.0 rcs0: promote { f51b:2*, 0:0 } > >>>>> <0> [574.305611] -0 7..s1 481828179us : execlists_reset: 0000:00:02.0 rcs0: reset for CS error > >>>>> <0> [574.305611] i915_sel-6278 2.... 481828284us : __intel_context_do_pin: 0000:00:02.0 rcs0: context:f51c pin ring:{start:00007000, head:0000, tail:0000} > >>>>> <0> [574.305611] i915_sel-6278 2.... 481828345us : __i915_request_commit: 0000:00:02.0 rcs0: fence f51c:2, current 0 > >>>>> <0> [574.305611] -0 7dNs2 481847823us : __i915_request_unsubmit: 0000:00:02.0 rcs0: fence f51b:2, current 1 > >>>>> <0> [574.305611] -0 7dNs2 481847857us : execlists_hold: 0000:00:02.0 rcs0: fence f51b:2, current 1 on hold > >>>>> <0> [574.305611] -0 7.Ns1 481847863us : intel_engine_reset: 0000:00:02.0 rcs0: flags=4 > >>>>> <0> [574.305611] -0 7.Ns1 481847945us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-1 > >>>>> <0> [574.305611] -0 7.Ns1 481847946us : intel_engine_stop_cs: 0000:00:02.0 rcs0: > >>>>> <0> [574.305611] -0 7.Ns1 538584284us : intel_engine_stop_cs: 0000:00:02.0 rcs0: timed out on STOP_RING -> IDLE > >>>>> <0> [574.305611] -0 7.Ns1 538584347us : __intel_gt_reset: 0000:00:02.0 engine_mask=1 > >>>>> <0> [574.305611] -0 7.Ns1 538584406us : execlists_reset_rewind: 0000:00:02.0 rcs0: > >>>>> <0> [574.305611] -0 7dNs2 538585050us : __i915_request_reset: 0000:00:02.0 rcs0: fence f51b:2, current 1 guilty? yes > >>>>> <0> [574.305611] -0 7dNs2 538585063us : __execlists_reset: 0000:00:02.0 rcs0: replay {head:0000, tail:0068} > >>>>> <0> [574.306565] -0 7.Ns1 538588457us : intel_engine_cancel_stop_cs: 0000:00:02.0 rcs0: > >>>>> <0> [574.306565] -0 7dNs2 538588462us : __i915_request_submit: 0000:00:02.0 rcs0: fence f51c:2, current 0 > >>>>> <0> [574.306565] -0 7dNs2 538588471us : trace_ports: 0000:00:02.0 rcs0: submit { f51c:2, 0:0 } > >>>>> <0> [574.306565] -0 7.Ns1 538588474us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->1 > >>>>> <0> [574.306565] kworker/-202 2.... 538588755us : i915_request_retire: 0000:00:02.0 rcs0: fence f51c:2, current 2 > >>>>> <0> [574.306565] ksoftirq-46 7..s. 538588773us : process_csb: 0000:00:02.0 rcs0: cs-irq head=11, tail=1 > >>>>> <0> [574.306565] ksoftirq-46 7..s. 538588774us : process_csb: 0000:00:02.0 rcs0: csb[0]: status=0x10000001:0x00000000 > >>>>> <0> [574.306565] ksoftirq-46 7..s. 538588776us : trace_ports: 0000:00:02.0 rcs0: promote { f51c:2!, 0:0 } > >>>>> <0> [574.306565] ksoftirq-46 7..s. 538588778us : process_csb: 0000:00:02.0 rcs0: csb[1]: status=0x10000018:0x00000020 > >>>>> <0> [574.306565] ksoftirq-46 7..s. 538588779us : trace_ports: 0000:00:02.0 rcs0: completed { f51c:2!, 0:0 } > >>>>> <0> [574.306565] kworker/-202 2.... 538588826us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51c unpin > >>>>> <0> [574.306565] i915_sel-6278 6.... 538589663us : __intel_gt_set_wedged.part.32: 0000:00:02.0 start > >>>>> <0> [574.306565] i915_sel-6278 6.... 538589667us : execlists_reset_prepare: 0000:00:02.0 rcs0: depth<-0 > >>>>> <0> [574.306565] i915_sel-6278 6.... 538589710us : intel_engine_stop_cs: 0000:00:02.0 rcs0: > >>>>> <0> [574.306565] i915_sel-6278 6.... 538589732us : execlists_reset_prepare: 0000:00:02.0 bcs0: depth<-0 > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589733us : intel_engine_stop_cs: 0000:00:02.0 bcs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589757us : execlists_reset_prepare: 0000:00:02.0 vcs0: depth<-0 > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589758us : intel_engine_stop_cs: 0000:00:02.0 vcs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589771us : execlists_reset_prepare: 0000:00:02.0 vcs1: depth<-0 > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589772us : intel_engine_stop_cs: 0000:00:02.0 vcs1: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589778us : execlists_reset_prepare: 0000:00:02.0 vecs0: depth<-0 > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589780us : intel_engine_stop_cs: 0000:00:02.0 vecs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538589786us : __intel_gt_reset: 0000:00:02.0 engine_mask=ff > >>>>> <0> [574.307591] i915_sel-6278 6.... 538591175us : execlists_reset_cancel: 0000:00:02.0 rcs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538591970us : execlists_reset_cancel: 0000:00:02.0 bcs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538591982us : execlists_reset_cancel: 0000:00:02.0 vcs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538591996us : execlists_reset_cancel: 0000:00:02.0 vcs1: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538592759us : execlists_reset_cancel: 0000:00:02.0 vecs0: > >>>>> <0> [574.307591] i915_sel-6278 6.... 538592977us : execlists_reset_finish: 0000:00:02.0 rcs0: depth->0 > >>>>> <0> [574.307591] i915_sel-6278 6.N.. 538592996us : execlists_reset_finish: 0000:00:02.0 bcs0: depth->0 > >>>>> <0> [574.307591] i915_sel-6278 6.N.. 538593023us : execlists_reset_finish: 0000:00:02.0 vcs0: depth->0 > >>>>> <0> [574.307591] i915_sel-6278 6.N.. 538593037us : execlists_reset_finish: 0000:00:02.0 vcs1: depth->0 > >>>>> <0> [574.307591] i915_sel-6278 6.N.. 538593051us : execlists_reset_finish: 0000:00:02.0 vecs0: depth->0 > >>>>> <0> [574.307591] i915_sel-6278 6.... 538593407us : __intel_gt_set_wedged.part.32: 0000:00:02.0 end > >>>>> <0> [574.307591] kworker/-210 7d..1 551958381us : execlists_unhold: 0000:00:02.0 rcs0: fence f51b:2, current 2 hold release > >>>>> <0> [574.307591] i915_sel-6278 0.... 559490788us : i915_request_retire: 0000:00:02.0 rcs0: fence f51b:2, current 2 > >>>>> <0> [574.307591] i915_sel-6278 0.... 559490793us : intel_context_unpin: 0000:00:02.0 rcs0: context:f51b unpin > >>>>> <0> [574.307591] i915_sel-6278 0.... 559490798us : __engine_park: 0000:00:02.0 rcs0: parked > >>>>> <0> [574.307591] i915_sel-6278 0.... 559490982us : __intel_context_retire: 0000:00:02.0 rcs0: context:f51c retire runtime: { total:30004ns, avg:30004ns } > >>>>> <0> [574.307591] i915_sel-6278 0.... 559491372us : __engine_park: __engine_park:261 GEM_BUG_ON(engine->execlists.queue_priority_hint != (-((int)(~0U >> 1)) - 1)) > >>>>> > >>>>> Signed-off-by: Chris Wilson > >>>>> --- > >>>>> drivers/gpu/drm/i915/gt/intel_lrc.c | 3 +++ > >>>>> 1 file changed, 3 insertions(+) > >>>>> > >>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_lrc.c b/drivers/gpu/drm/i915/gt/intel_lrc.c > >>>>> index 39b0125b7143..35c5cf786726 100644 > >>>>> --- a/drivers/gpu/drm/i915/gt/intel_lrc.c > >>>>> +++ b/drivers/gpu/drm/i915/gt/intel_lrc.c > >>>>> @@ -3724,7 +3724,10 @@ static void execlists_reset_rewind(struct intel_engine_cs *engine, bool stalled) > >>>>> > >>>>> static void nop_submission_tasklet(unsigned long data) > >>>>> { > >>>>> + struct intel_engine_cs * const engine = (struct intel_engine_cs *)data; > >>>>> + > >>>>> /* The driver is wedged; don't process any more events. */ > >>>>> + WRITE_ONCE(engine->execlists.queue_priority_hint, INT_MIN); > >>>> > >>>> Why from the tasklet and not the place which clears the queue? > >>> > >>> That would be the list_move within nop_submit_request() > >>> [i915_request_submit] > >>> > >>> I chose this tasklet as we do the reset in execlists_submission_tasklet() > >>> on clearing the queue there, and so thought this was analogous. > >> > >> It actually looks to me it is unhold which is causing this, so it is not > >> true we never reset the hint, it was probably overwritten: > >> > >> execlists_reset_cancel, at the end of it: > >> > >> /* Remaining _unready_ requests will be nop'ed when submitted */ > >> > >> execlists->queue_priority_hint = INT_MIN; > >> > >> Just who overwrote it.. someone called unhold after > >> execlists_reset_cancel finished. > >> > >> Should unhold not restore the priority hint if the requests on the hold > >> list are -EIO? > > > > It is the unhold callback, it does > > > > if (rq_prio(rq) > engine->execlists.queue_priority_hint) { > > engine->execlists.queue_priority_hint = rq_prio(rq); > > tasklet_hi_schedule(&engine->execlists.tasklet); > > } > > > > and queues the [now] nop_submission_tasklet. Which would be fine if it > > behaved similarly. > > That I found, but who calls it? Currently, a reset worker. Intention is that any path that first called execlists_hold(). (E.g. holding requests across suspend is one possible idea I have. No concrete second use case yet, just more of a feeling this might be useful again.) > I was thinking that since execlists_reset_cancel bothers to mark held > requests with -EIO then unhold could be smarter and not consider them > when setting the priority hint. Could do, trying to keep it simple :) My preference is to push the complexity to the edge case of wedge handling; and I'm not keen on having to worry about -EIO here. The intent is remove from queue, place in holding area, put back onto queue. Trying to do that correctly is hard enough [apparently] :) > That way the hint is immediately true, not only after tasklet is > flushed. But okay, no big deal either way. On the other hand, if the assertion is more troublesome than it's worth, we just set the hint = INT_MIN on resume. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx