Intel-GFX Archive on lore.kernel.org
 help / color / Atom feed
* [Intel-gfx] [PATCH 1/2] drm/i915/gt: Harden the heartbeat against a stuck driver
@ 2020-07-01  8:39 Chris Wilson
  2020-07-01  8:39 ` [Intel-gfx] [PATCH 2/2] drm/i915/gt: Move the heartbeat into the highprio system wq Chris Wilson
                   ` (8 more replies)
  0 siblings, 9 replies; 10+ messages in thread
From: Chris Wilson @ 2020-07-01  8:39 UTC (permalink / raw)
  To: intel-gfx; +Cc: Chris Wilson

If the driver get stuck holding the kernel timeline, we cannot issue a
heartbeat and so fail to discover that the driver is indeed stuck and do
not issue a GPU reset (which would hopefully unstick the driver!).
Switch to using a trylock so that we can query if the heartbeat's
timelin mutex is locked elsewhere, and then use the timer to probe if it
remains stuck at the same spot for consecutive heartbeats, indicating
that the mutex has not been released and the engine has not progressed.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
---
 drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c | 14 ++++++++++++--
 drivers/gpu/drm/i915/gt/intel_engine_types.h     |  1 +
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
index 8db7e93abde5..1663ab5c68a5 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_heartbeat.c
@@ -65,6 +65,7 @@ static void heartbeat(struct work_struct *wrk)
 		container_of(wrk, typeof(*engine), heartbeat.work.work);
 	struct intel_context *ce = engine->kernel_context;
 	struct i915_request *rq;
+	unsigned long serial;
 
 	/* Just in case everything has gone horribly wrong, give it a kick */
 	intel_engine_flush_submission(engine);
@@ -122,10 +123,19 @@ static void heartbeat(struct work_struct *wrk)
 		goto out;
 	}
 
-	if (engine->wakeref_serial == engine->serial)
+	serial = READ_ONCE(engine->serial);
+	if (engine->wakeref_serial == serial)
 		goto out;
 
-	mutex_lock(&ce->timeline->mutex);
+	if (!mutex_trylock(&ce->timeline->mutex)) {
+		/* Unable to lock the kernel timeline, is the engine stuck? */
+		if (xchg(&engine->heartbeat.blocked, serial) == serial)
+			intel_gt_handle_error(engine->gt, engine->mask,
+					      I915_ERROR_CAPTURE,
+					      "stopped heartbeat on %s",
+					      engine->name);
+		goto out;
+	}
 
 	intel_context_enter(ce);
 	rq = __i915_request_create(ce, GFP_NOWAIT | __GFP_NOWARN);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_types.h b/drivers/gpu/drm/i915/gt/intel_engine_types.h
index 073c3769e8cc..490af81bd6f3 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_types.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine_types.h
@@ -348,6 +348,7 @@ struct intel_engine_cs {
 	struct {
 		struct delayed_work work;
 		struct i915_request *systole;
+		unsigned long blocked;
 	} heartbeat;
 
 	unsigned long serial;
-- 
2.20.1

_______________________________________________
Intel-gfx mailing list
Intel-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/intel-gfx

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, back to index

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-01  8:39 [Intel-gfx] [PATCH 1/2] drm/i915/gt: Harden the heartbeat against a stuck driver Chris Wilson
2020-07-01  8:39 ` [Intel-gfx] [PATCH 2/2] drm/i915/gt: Move the heartbeat into the highprio system wq Chris Wilson
2020-07-01  8:45 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with [1/2] drm/i915/gt: Harden the heartbeat against a stuck driver Patchwork
2020-07-01  8:46 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-07-01  8:50 ` [Intel-gfx] [PATCH] " Chris Wilson
2020-07-01  9:06 ` [Intel-gfx] ✓ Fi.CI.BAT: success for series starting with [1/2] " Patchwork
2020-07-01 10:54 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for series starting with drm/i915/gt: Harden the heartbeat against a stuck driver (rev2) Patchwork
2020-07-01 10:56 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2020-07-01 11:16 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2020-07-01 16:46 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork

Intel-GFX Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/intel-gfx/0 intel-gfx/git/0.git
	git clone --mirror https://lore.kernel.org/intel-gfx/1 intel-gfx/git/1.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 intel-gfx intel-gfx/ https://lore.kernel.org/intel-gfx \
		intel-gfx@lists.freedesktop.org
	public-inbox-index intel-gfx

Example config snippet for mirrors

Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.freedesktop.lists.intel-gfx


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git