From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.7 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id DDB72C433DF for ; Thu, 28 May 2020 16:52:21 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id BAC45207D3 for ; Thu, 28 May 2020 16:52:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BAC45207D3 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=chris-wilson.co.uk Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=intel-gfx-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 4C1336E5A5; Thu, 28 May 2020 16:52:21 +0000 (UTC) Received: from fireflyinternet.com (mail.fireflyinternet.com [109.228.58.192]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5DB506E5A5 for ; Thu, 28 May 2020 16:52:19 +0000 (UTC) X-Default-Received-SPF: pass (skip=forwardok (res=PASS)) x-ip-name=78.156.65.138; Received: from localhost (unverified [78.156.65.138]) by fireflyinternet.com (Firefly Internet (M1)) with ESMTP (TLS) id 21325284-1500050 for multiple; Thu, 28 May 2020 17:52:15 +0100 MIME-Version: 1.0 In-Reply-To: <159068465505.10651.12715126559491848988@build.alporthouse.com> References: <20200528074324.5765-1-chris@chris-wilson.co.uk> <20200528074324.5765-2-chris@chris-wilson.co.uk> <871rn4jafd.fsf@gaia.fi.intel.com> <159068465505.10651.12715126559491848988@build.alporthouse.com> From: Chris Wilson To: Mika Kuoppala , intel-gfx@lists.freedesktop.org Message-ID: <159068473554.10651.2928642743441484729@build.alporthouse.com> User-Agent: alot/0.8.1 Date: Thu, 28 May 2020 17:52:15 +0100 Subject: Re: [Intel-gfx] [PATCH 2/3] drm/i915/gt: Don't declare hangs if engine is stalled X-BeenThere: intel-gfx@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Intel graphics driver community testing & development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: intel-gfx-bounces@lists.freedesktop.org Sender: "Intel-gfx" Quoting Chris Wilson (2020-05-28 17:50:55) > Quoting Mika Kuoppala (2020-05-28 17:23:18) > > Chris Wilson writes: > > > > > If the ring submission is stalled on an external request, nothing can be > > > submitted, not even the heartbeat in the kernel context. Since nothing > > > is running, resetting the engine/device does not unblock the system and > > > is pointless. We can see if the heartbeat is supposed to be running > > > before declaring foul. > > > > > > Signed-off-by: Chris Wilson > > > --- > > > .../gpu/drm/i915/gt/intel_engine_heartbeat.c | 19 ++++++++++++++++--- > > > 1 file changed, 16 insertions(+), 3 deletions(-) > > > > > > diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > > > index 5136c8bf112d..f67ad937eefb 100644 > > > --- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > > > +++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c > > > @@ -48,8 +48,10 @@ static void show_heartbeat(const struct i915_request *rq, > > > struct drm_printer p = drm_debug_printer("heartbeat"); > > > > > > intel_engine_dump(engine, &p, > > > - "%s heartbeat {prio:%d} not ticking\n", > > > + "%s heartbeat {seqno:%llx:%lld, prio:%d} not ticking\n", > > > engine->name, > > > + rq->fence.context, > > > + rq->fence.seqno, > > > rq->sched.attr.priority); > > > } > > > > > > @@ -76,8 +78,19 @@ static void heartbeat(struct work_struct *wrk) > > > goto out; > > > > > > if (engine->heartbeat.systole) { > > > - if (engine->schedule && > > > - rq->sched.attr.priority < I915_PRIORITY_BARRIER) { > > > + if (!i915_sw_fence_signaled(&rq->submit)) { > > > + /* > > > + * Not yet submitted, system is stalled. > > > + * > > > + * This more often happens for ring submission, > > > + * where all contexts are funnelled into a common > > > + * ringbuffer. If one context is blocked on an > > > + * external fence, not only is it not submitted, > > > + * but all other contexts, including the kernel > > > + * context are stuck waiting for the signal. > > > + */ > > > > The solution how to save the system evades me. > > But piling the heartbeat on top does not help with it in > > any case. > > Last resort could be hangcheck again, but over a much much longer > interval, say 2 minutes with work queued to the engine, but it remains > idle, mark the device as wedged (and stop using it altogether). We have > to be really confident that the cure is worth it. To be effective we would also need to brute force complete the requests waiting on external fences so that we could power down the device. Hmm, that reminds me I need something similar to power down an active device at suspend. -Chris _______________________________________________ Intel-gfx mailing list Intel-gfx@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/intel-gfx