[PATCH 0/4] Allow error capture without a request / on reset failure

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] Allow error capture without a request / on reset failure
@ 2023-01-12  2:53 ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (4):
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 ++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 59 +++++++++++++------
 2 files changed, 60 insertions(+), 20 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH 0/4] Allow error capture without a request / on reset failure
@ 2023-01-12  2:53 ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (4):
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 21 ++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 59 +++++++++++++------
 2 files changed, 60 insertions(+), 20 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-12  2:53   ` John.C.Harrison
  -1 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Umesh Nerlige Ramappa, John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if an engine is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++--------
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..bd2cf7d235df0 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
+		if (rq && !i915_request_started(rq)) {
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+				 engine->name);
+			rq = NULL;
+		}
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
@@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs *engine,
 					       flags);
 		}
 	}
-	if (rq)
+	if (rq) {
 		rq = i915_request_get_rcu(rq);
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
+	}
 
-	if (!rq)
-		goto no_request_capture;
-
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
 	if (!capture) {
-		i915_request_put(rq);
+		if (rq)
+			i915_request_put(rq);
 		goto no_request_capture;
 	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
 	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (rq)
+		i915_request_put(rq);
 
 	return ee;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
@ 2023-01-12  2:53   ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if an engine is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++--------
 1 file changed, 43 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..bd2cf7d235df0 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
+		if (rq && !i915_request_started(rq)) {
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+				 engine->name);
+			rq = NULL;
+		}
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
@@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs *engine,
 					       flags);
 		}
 	}
-	if (rq)
+	if (rq) {
 		rq = i915_request_get_rcu(rq);
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
+	}
 
-	if (!rq)
-		goto no_request_capture;
-
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
 	if (!capture) {
-		i915_request_put(rq);
+		if (rq)
+			i915_request_put(rq);
 		goto no_request_capture;
 	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
 	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (rq)
+		i915_request_put(rq);
 
 	return ee;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 2/4] drm/i915: Allow error capture of a pending request
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-12  2:53   ` John.C.Harrison
  -1 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index bd2cf7d235df0..2e338a9667a4b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (rq && !i915_request_started(rq)) {
-			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-				 engine->name);
-			rq = NULL;
-		}
+		if (rq && !i915_request_started(rq))
+			drm_info(&engine->gt->i915->drm, "Confused - active request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
+				 rq->fence.context, rq->fence.seqno, ce->guc_id.id, engine->name);
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
@ 2023-01-12  2:53   ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index bd2cf7d235df0..2e338a9667a4b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (rq && !i915_request_started(rq)) {
-			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-				 engine->name);
-			rq = NULL;
-		}
+		if (rq && !i915_request_started(rq))
+			drm_info(&engine->gt->i915->drm, "Confused - active request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
+				 rq->fence.context, rq->fence.seqno, ce->guc_id.id, engine->name);
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-12  2:53   ` John.C.Harrison
  -1 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..99d09e3394597 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be safe/reliable to do this here because the GuC
+		 * is toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
@ 2023-01-12  2:53   ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..99d09e3394597 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be safe/reliable to do this here because the GuC
+		 * is toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-12  2:53   ` John.C.Harrison
  -1 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 99d09e3394597..0be7c27a436dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset
@ 2023-01-12  2:53   ` John.C.Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John.C.Harrison @ 2023-01-12  2:53 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 99d09e3394597..0be7c27a436dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2)
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
                   ` (4 preceding siblings ...)
  (?)
@ 2023-01-12  3:21 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2023-01-12  3:21 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev2)
URL   : https://patchwork.freedesktop.org/series/111454/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:237:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:239:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./drivers/gpu/drm/i915/intel_uncore.h:346:1: warning: trying to copy expression type 31
+./include/asm-generic/bitops/generic-non-atomic.h:100:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:100:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:105:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:109:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:111:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:112:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:121:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:128:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:166:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:168:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:169:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:170:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:172:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:28:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:31:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:37:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:40:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:55:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:58:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:73:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:77:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:79:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:80:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:93:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:97:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:99:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'



^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Allow error capture without a request / on reset failure (rev2)
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
                   ` (5 preceding siblings ...)
  (?)
@ 2023-01-12  3:36 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2023-01-12  3:36 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 1835 bytes --]

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev2)
URL   : https://patchwork.freedesktop.org/series/111454/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_12574 -> Patchwork_111454v2
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/index.html

Participating hosts (35 -> 33)
------------------------------

  Missing    (2): fi-bsw-kefka fi-snb-2520m 

Known issues
------------

  Here are the changes found in Patchwork_111454v2 that come from known issues:

### IGT changes ###

#### Possible fixes ####

  * igt@i915_pm_rpm@module-reload:
    - bat-dg1-5:          [SKIP][1] -> [PASS][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/bat-dg1-5/igt@i915_pm_rpm@module-reload.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/bat-dg1-5/igt@i915_pm_rpm@module-reload.html

  


Build changes
-------------

  * Linux: CI_DRM_12574 -> Patchwork_111454v2

  CI-20190529: 20190529
  CI_DRM_12574: bf7f7c53ac622a3f6d6738d062e59dd21ce28bd7 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7116: 79eb8984acd309108be713a8831e60667db67e21 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111454v2: bf7f7c53ac622a3f6d6738d062e59dd21ce28bd7 @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

7174352a1af9 drm/i915/guc: Add a debug print on GuC triggered reset
64b28fa8d1a2 drm/i915/guc: Look for a guilty context when an engine reset fails
c5e5df132fda drm/i915: Allow error capture of a pending request
876a26a4e1cb drm/i915: Allow error capture without a request

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/index.html

[-- Attachment #2: Type: text/html, Size: 2451 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Allow error capture without a request / on reset failure (rev2)
  2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
                   ` (6 preceding siblings ...)
  (?)
@ 2023-01-12  5:36 ` Patchwork
  -1 siblings, 0 replies; 31+ messages in thread
From: Patchwork @ 2023-01-12  5:36 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 17761 bytes --]

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev2)
URL   : https://patchwork.freedesktop.org/series/111454/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_12574_full -> Patchwork_111454v2_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/index.html

Participating hosts (13 -> 9)
------------------------------

  Missing    (4): shard-rkl0 pig-kbl-iris pig-glk-j5005 pig-skl-6260u 

Known issues
------------

  Here are the changes found in Patchwork_111454v2_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_fair@basic-none@vcs0:
    - shard-glk:          [PASS][1] -> [FAIL][2] ([i915#2842])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-glk7/igt@gem_exec_fair@basic-none@vcs0.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk4/igt@gem_exec_fair@basic-none@vcs0.html

  * igt@gem_exec_fair@basic-pace-solo@rcs0:
    - shard-glk:          NOTRUN -> [FAIL][3] ([i915#2842])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@gem_exec_fair@basic-pace-solo@rcs0.html

  * igt@gem_lmem_swapping@heavy-multi:
    - shard-glk:          NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#4613])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@gem_lmem_swapping@heavy-multi.html

  * igt@gem_pread@exhaustion:
    - shard-glk:          NOTRUN -> [WARN][5] ([i915#2658])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@gem_pread@exhaustion.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-glk:          [PASS][6] -> [DMESG-WARN][7] ([i915#5566] / [i915#716])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-glk9/igt@gen9_exec_parse@allowed-single.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@gen9_exec_parse@allowed-single.html

  * igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc:
    - shard-glk:          NOTRUN -> [SKIP][8] ([fdo#109271] / [i915#3886]) +3 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@kms_ccs@pipe-a-crc-primary-basic-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_chamelium_color@ctm-blue-to-red:
    - shard-glk:          NOTRUN -> [SKIP][9] ([fdo#109271]) +33 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@kms_chamelium_color@ctm-blue-to-red.html

  * igt@runner@aborted:
    - shard-glk:          NOTRUN -> [FAIL][10] ([i915#4312])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk5/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@drm_fdinfo@virtual-idle:
    - {shard-rkl}:        [FAIL][11] ([i915#7742]) -> [PASS][12] +1 similar issue
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@drm_fdinfo@virtual-idle.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-5/igt@drm_fdinfo@virtual-idle.html

  * igt@drm_read@short-buffer-nonblock:
    - {shard-rkl}:        [SKIP][13] ([i915#4098]) -> [PASS][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@drm_read@short-buffer-nonblock.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-6/igt@drm_read@short-buffer-nonblock.html

  * igt@fbdev@unaligned-read:
    - {shard-rkl}:        [SKIP][15] ([i915#2582]) -> [PASS][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@fbdev@unaligned-read.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-6/igt@fbdev@unaligned-read.html

  * igt@gem_ctx_persistence@hang:
    - {shard-rkl}:        [SKIP][17] ([i915#6252]) -> [PASS][18] +1 similar issue
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-5/igt@gem_ctx_persistence@hang.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-4/igt@gem_ctx_persistence@hang.html

  * igt@gem_exec_fair@basic-pace@vcs0:
    - shard-glk:          [FAIL][19] ([i915#2842]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-glk3/igt@gem_exec_fair@basic-pace@vcs0.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-glk8/igt@gem_exec_fair@basic-pace@vcs0.html

  * igt@gem_exec_flush@basic-batch-kernel-default-cmd:
    - {shard-rkl}:        [SKIP][21] ([fdo#109313]) -> [PASS][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-5/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html

  * igt@gem_exec_reloc@basic-write-read-noreloc:
    - {shard-rkl}:        [SKIP][23] ([i915#3281]) -> [PASS][24] +7 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@gem_exec_reloc@basic-write-read-noreloc.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-5/igt@gem_exec_reloc@basic-write-read-noreloc.html

  * igt@gem_pread@uncached:
    - {shard-rkl}:        [SKIP][25] ([i915#3282]) -> [PASS][26] +1 similar issue
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-4/igt@gem_pread@uncached.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-5/igt@gem_pread@uncached.html

  * igt@gen9_exec_parse@secure-batches:
    - {shard-rkl}:        [SKIP][27] ([i915#2527]) -> [PASS][28] +4 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-1/igt@gen9_exec_parse@secure-batches.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-5/igt@gen9_exec_parse@secure-batches.html

  * igt@i915_pm_dc@dc6-dpms:
    - {shard-rkl}:        [SKIP][29] ([i915#3361]) -> [PASS][30]
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-5/igt@i915_pm_dc@dc6-dpms.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-2/igt@i915_pm_dc@dc6-dpms.html

  * igt@i915_pm_rc6_residency@rc6-idle@vcs0:
    - {shard-dg1}:        [FAIL][31] ([i915#3591]) -> [PASS][32]
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-dg1-19/igt@i915_pm_rc6_residency@rc6-idle@vcs0.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-dg1-15/igt@i915_pm_rc6_residency@rc6-idle@vcs0.html

  * igt@kms_frontbuffer_tracking@fbc-badstride:
    - {shard-rkl}:        [SKIP][33] ([i915#1849] / [i915#4098]) -> [PASS][34] +9 similar issues
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-4/igt@kms_frontbuffer_tracking@fbc-badstride.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-6/igt@kms_frontbuffer_tracking@fbc-badstride.html

  * igt@kms_psr@sprite_plane_move:
    - {shard-rkl}:        [SKIP][35] ([i915#1072]) -> [PASS][36] +1 similar issue
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@kms_psr@sprite_plane_move.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-6/igt@kms_psr@sprite_plane_move.html

  * igt@kms_vblank@pipe-b-query-forked:
    - {shard-rkl}:        [SKIP][37] ([i915#1845] / [i915#4098]) -> [PASS][38] +16 similar issues
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-rkl-3/igt@kms_vblank@pipe-b-query-forked.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-rkl-6/igt@kms_vblank@pipe-b-query-forked.html

  * igt@syncobj_timeline@reset-during-wait-for-submit:
    - {shard-dg1}:        [DMESG-WARN][39] ([i915#1982]) -> [PASS][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-dg1-13/igt@syncobj_timeline@reset-during-wait-for-submit.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-dg1-17/igt@syncobj_timeline@reset-during-wait-for-submit.html

  * igt@sysfs_heartbeat_interval@precise@rcs0:
    - {shard-dg1}:        [FAIL][41] ([i915#1755]) -> [PASS][42]
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12574/shard-dg1-18/igt@sysfs_heartbeat_interval@precise@rcs0.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/shard-dg1-18/igt@sysfs_heartbeat_interval@precise@rcs0.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109283]: https://bugs.freedesktop.org/show_bug.cgi?id=109283
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#109303]: https://bugs.freedesktop.org/show_bug.cgi?id=109303
  [fdo#109307]: https://bugs.freedesktop.org/show_bug.cgi?id=109307
  [fdo#109313]: https://bugs.freedesktop.org/show_bug.cgi?id=109313
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
  [fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
  [fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111614]: https://bugs.freedesktop.org/show_bug.cgi?id=111614
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111644]: https://bugs.freedesktop.org/show_bug.cgi?id=111644
  [fdo#111656]: https://bugs.freedesktop.org/show_bug.cgi?id=111656
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [fdo#112054]: https://bugs.freedesktop.org/show_bug.cgi?id=112054
  [fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#132]: https://gitlab.freedesktop.org/drm/intel/issues/132
  [i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
  [i915#1755]: https://gitlab.freedesktop.org/drm/intel/issues/1755
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#1825]: https://gitlab.freedesktop.org/drm/intel/issues/1825
  [i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2232]: https://gitlab.freedesktop.org/drm/intel/issues/2232
  [i915#2434]: https://gitlab.freedesktop.org/drm/intel/issues/2434
  [i915#2436]: https://gitlab.freedesktop.org/drm/intel/issues/2436
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
  [i915#2658]: https://gitlab.freedesktop.org/drm/intel/issues/2658
  [i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
  [i915#2681]: https://gitlab.freedesktop.org/drm/intel/issues/2681
  [i915#2705]: https://gitlab.freedesktop.org/drm/intel/issues/2705
  [i915#280]: https://gitlab.freedesktop.org/drm/intel/issues/280
  [i915#284]: https://gitlab.freedesktop.org/drm/intel/issues/284
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3116]: https://gitlab.freedesktop.org/drm/intel/issues/3116
  [i915#315]: https://gitlab.freedesktop.org/drm/intel/issues/315
  [i915#3281]: https://gitlab.freedesktop.org/drm/intel/issues/3281
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3299]: https://gitlab.freedesktop.org/drm/intel/issues/3299
  [i915#3318]: https://gitlab.freedesktop.org/drm/intel/issues/3318
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3361]: https://gitlab.freedesktop.org/drm/intel/issues/3361
  [i915#3469]: https://gitlab.freedesktop.org/drm/intel/issues/3469
  [i915#3546]: https://gitlab.freedesktop.org/drm/intel/issues/3546
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3591]: https://gitlab.freedesktop.org/drm/intel/issues/3591
  [i915#3637]: https://gitlab.freedesktop.org/drm/intel/issues/3637
  [i915#3638]: https://gitlab.freedesktop.org/drm/intel/issues/3638
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#3734]: https://gitlab.freedesktop.org/drm/intel/issues/3734
  [i915#3742]: https://gitlab.freedesktop.org/drm/intel/issues/3742
  [i915#3826]: https://gitlab.freedesktop.org/drm/intel/issues/3826
  [i915#3840]: https://gitlab.freedesktop.org/drm/intel/issues/3840
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#4070]: https://gitlab.freedesktop.org/drm/intel/issues/4070
  [i915#4078]: https://gitlab.freedesktop.org/drm/intel/issues/4078
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#426]: https://gitlab.freedesktop.org/drm/intel/issues/426
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4767]: https://gitlab.freedesktop.org/drm/intel/issues/4767
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
  [i915#5288]: https://gitlab.freedesktop.org/drm/intel/issues/5288
  [i915#5289]: https://gitlab.freedesktop.org/drm/intel/issues/5289
  [i915#5325]: https://gitlab.freedesktop.org/drm/intel/issues/5325
  [i915#5327]: https://gitlab.freedesktop.org/drm/intel/issues/5327
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5439]: https://gitlab.freedesktop.org/drm/intel/issues/5439
  [i915#5566]: https://gitlab.freedesktop.org/drm/intel/issues/5566
  [i915#5723]: https://gitlab.freedesktop.org/drm/intel/issues/5723
  [i915#6095]: https://gitlab.freedesktop.org/drm/intel/issues/6095
  [i915#6230]: https://gitlab.freedesktop.org/drm/intel/issues/6230
  [i915#6245]: https://gitlab.freedesktop.org/drm/intel/issues/6245
  [i915#6247]: https://gitlab.freedesktop.org/drm/intel/issues/6247
  [i915#6248]: https://gitlab.freedesktop.org/drm/intel/issues/6248
  [i915#6252]: https://gitlab.freedesktop.org/drm/intel/issues/6252
  [i915#6268]: https://gitlab.freedesktop.org/drm/intel/issues/6268
  [i915#6301]: https://gitlab.freedesktop.org/drm/intel/issues/6301
  [i915#6334]: https://gitlab.freedesktop.org/drm/intel/issues/6334
  [i915#6335]: https://gitlab.freedesktop.org/drm/intel/issues/6335
  [i915#6344]: https://gitlab.freedesktop.org/drm/intel/issues/6344
  [i915#6403]: https://gitlab.freedesktop.org/drm/intel/issues/6403
  [i915#6412]: https://gitlab.freedesktop.org/drm/intel/issues/6412
  [i915#6497]: https://gitlab.freedesktop.org/drm/intel/issues/6497
  [i915#6524]: https://gitlab.freedesktop.org/drm/intel/issues/6524
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
  [i915#6768]: https://gitlab.freedesktop.org/drm/intel/issues/6768
  [i915#6944]: https://gitlab.freedesktop.org/drm/intel/issues/6944
  [i915#6946]: https://gitlab.freedesktop.org/drm/intel/issues/6946
  [i915#7037]: https://gitlab.freedesktop.org/drm/intel/issues/7037
  [i915#7116]: https://gitlab.freedesktop.org/drm/intel/issues/7116
  [i915#7118]: https://gitlab.freedesktop.org/drm/intel/issues/7118
  [i915#716]: https://gitlab.freedesktop.org/drm/intel/issues/716
  [i915#7276]: https://gitlab.freedesktop.org/drm/intel/issues/7276
  [i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561
  [i915#7697]: https://gitlab.freedesktop.org/drm/intel/issues/7697
  [i915#7701]: https://gitlab.freedesktop.org/drm/intel/issues/7701
  [i915#7707]: https://gitlab.freedesktop.org/drm/intel/issues/7707
  [i915#7711]: https://gitlab.freedesktop.org/drm/intel/issues/7711
  [i915#7742]: https://gitlab.freedesktop.org/drm/intel/issues/7742
  [i915#7828]: https://gitlab.freedesktop.org/drm/intel/issues/7828


Build changes
-------------

  * Linux: CI_DRM_12574 -> Patchwork_111454v2
  * Piglit: piglit_4509 -> None

  CI-20190529: 20190529
  CI_DRM_12574: bf7f7c53ac622a3f6d6738d062e59dd21ce28bd7 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7116: 79eb8984acd309108be713a8831e60667db67e21 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111454v2: bf7f7c53ac622a3f6d6738d062e59dd21ce28bd7 @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v2/index.html

[-- Attachment #2: Type: text/html, Size: 12517 bytes --]

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-12 10:01   ` Tvrtko Ursulin
  2023-01-12 20:40     ` John Harrison
  -1 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 10:01 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> There was a report of error captures occurring without any hung
> context being indicated despite the capture being initiated by a 'hung
> context notification' from GuC. The problem was not reproducible.
> However, it is possible to happen if the context in question has no
> active requests. For example, if the hang was in the context switch
> itself then the breadcrumb write would have occurred and the KMD would
> see an idle context.
> 
> In the interests of attempting to provide as much information as
> possible about a hang, it seems wise to include the engine info
> regardless of whether a request was found or not. As opposed to just
> prentending there was no hang at all.
> 
> So update the error capture code to always record engine information
> if an engine is given. Which means updating record_context() to take a
> context instead of a request (which it only ever used to find the
> context anyway). And split the request agnostic parts of
> intel_engine_coredump_add_request() out into a seaprate function.
> 
> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
> pointer.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++--------
>   1 file changed, 43 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..bd2cf7d235df0 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
>   }
>   
>   static bool record_context(struct i915_gem_context_coredump *e,
> -			   const struct i915_request *rq)
> +			   struct intel_context *ce)
>   {
>   	struct i915_gem_context *ctx;
>   	struct task_struct *task;
>   	bool simulated;
>   
>   	rcu_read_lock();
> -	ctx = rcu_dereference(rq->context->gem_context);
> +	ctx = rcu_dereference(ce->gem_context);
>   	if (ctx && !kref_get_unless_zero(&ctx->ref))
>   		ctx = NULL;
>   	rcu_read_unlock();
> @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
>   	e->guilty = atomic_read(&ctx->guilty_count);
>   	e->active = atomic_read(&ctx->active_count);
>   
> -	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
> -	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
> +	e->total_runtime = intel_context_get_total_runtime_ns(ce);
> +	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>   
>   	simulated = i915_gem_context_no_error_capture(ctx);
>   
> @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
>   	return ee;
>   }
>   
> +static struct intel_engine_capture_vma *
> +engine_coredump_add_context(struct intel_engine_coredump *ee,
> +			    struct intel_context *ce,
> +			    gfp_t gfp)
> +{
> +	struct intel_engine_capture_vma *vma = NULL;
> +
> +	ee->simulated |= record_context(&ee->context, ce);
> +	if (ee->simulated)
> +		return NULL;
> +
> +	/*
> +	 * We need to copy these to an anonymous buffer
> +	 * as the simplest method to avoid being overwritten
> +	 * by userspace.
> +	 */
> +	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> +	vma = capture_vma(vma, ce->state, "HW context", gfp);
> +
> +	return vma;
> +}
> +
>   struct intel_engine_capture_vma *
>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   				  struct i915_request *rq,
>   				  gfp_t gfp)
>   {
> -	struct intel_engine_capture_vma *vma = NULL;
> +	struct intel_engine_capture_vma *vma;
>   
> -	ee->simulated |= record_context(&ee->context, rq);
> -	if (ee->simulated)
> +	vma = engine_coredump_add_context(ee, rq->context, gfp);
> +	if (!vma)
>   		return NULL;
>   
>   	/*
> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   	 */
>   	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>   	vma = capture_user(vma, rq, gfp);
> -	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> -	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>   
>   	ee->rq_head = rq->head;
>   	ee->rq_post = rq->postfix;
> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
>   		rq = intel_context_find_active_request(ce);
> -		if (!rq || !i915_request_started(rq))
> -			goto no_request_capture;
> +		if (rq && !i915_request_started(rq)) {
> +			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",

Suggest s/active/started/ since we have both i915_request_active and 
i915_request_started, so to align the terminology.

> +				 engine->name);
> +			rq = NULL;
> +		}
>   	} else {
>   		/*
>   		 * Getting here with GuC enabled means it is a forced error capture
> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs *engine,
>   					       flags);
>   		}
>   	}
> -	if (rq)
> +	if (rq) {
>   		rq = i915_request_get_rcu(rq);
> +		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> +	} else if (ce) {
> +		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
> +	}
>   
> -	if (!rq)
> -		goto no_request_capture;
> -
> -	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
>   	if (!capture) {
> -		i915_request_put(rq);
> +		if (rq)
> +			i915_request_put(rq);
>   		goto no_request_capture;
>   	}
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);

This step requires non-NULL ce, so if you move it under the "else if 
(ce)" above then I *think* exit from the function can be consolidated to 
just:

if (capture) {
	intel_engine_coredump_add_vma(ee, capture, compress);
	if (rq)
		i915_request_put(rq);
} else {
	kfree(ee);
	ee = NULL;
}

return ee;

No "if (rq) i915_request_put()" twice, and goto label can be completely 
removed.

Regards,

Tvrtko

>   
>   	intel_engine_coredump_add_vma(ee, capture, compress);
> -	i915_request_put(rq);
> +	if (rq)
> +		i915_request_put(rq);
>   
>   	return ee;
>   

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
  2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-12 10:06   ` Tvrtko Ursulin
  2023-01-12 20:46     ` John Harrison
  -1 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 10:06 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> A hang situation has been observed where the only requests on the
> context were either completed or not yet started according to the
> breaadcrumbs. However, the register state claimed a batch was (maybe)
> in progress. So, allow capture of the pending request on the grounds
> that this might be better than nothing.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>   1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index bd2cf7d235df0..2e338a9667a4b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
>   		rq = intel_context_find_active_request(ce);
> -		if (rq && !i915_request_started(rq)) {
> -			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
> -				 engine->name);
> -			rq = NULL;
> -		}
> +		if (rq && !i915_request_started(rq))
> +			drm_info(&engine->gt->i915->drm, "Confused - active request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
> +				 rq->fence.context, rq->fence.seqno, ce->guc_id.id, engine->name);

Ah you change active to started in this patch! :)

I suggest no "ce" in user visible messages and maybe stick with the 
convention grep suggest is already established:

"Hung context with active request %lld:%lld [0x%04X] not started!"

Regards,

Tvrtko

>   	} else {
>   		/*
>   		 * Getting here with GuC enabled means it is a forced error capture

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset
  2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-12 10:11   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 10:11 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> For understanding bug reports, it can be useful to have an explicit
> dmesg print when a reset notification is received from GuC. As opposed
> to simply inferring that this happened from other messages.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
>   1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 99d09e3394597..0be7c27a436dd 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>   {
>   	trace_intel_context_reset(ce);
>   
> +	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
> +		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
> +		test_bit(CONTEXT_BANNED, &ce->flags));
> +
>   	if (likely(intel_context_is_schedulable(ce))) {
>   		capture_error_state(guc, ce);
>   		guc_context_replay(ce);

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-12 10:15   ` Tvrtko Ursulin
  2023-01-12 20:59     ` John Harrison
  -1 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-12 10:15 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..99d09e3394597 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be safe/reliable to do this here because the GuC
> +		 * is toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);
> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
> -				      "GuC failed to reset engine mask=0x%x\n",
> +				      "GuC failed to reset engine mask=0x%x",
>   				      reset_fail_mask);
> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,

This one I don't feel "at home" enough to r-b. Just a question - can we 
be sure at this point that GuC is 100% stuck and there isn't a chance it 
somehow comes alive and starts running in parallel (being driven in 
parallel by a different "thread" in i915), interfering with the 
assumption made in the comment?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-12 10:01   ` Tvrtko Ursulin
@ 2023-01-12 20:40     ` John Harrison
  2023-01-13  9:51       ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: John Harrison @ 2023-01-12 20:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/12/2023 02:01, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> There was a report of error captures occurring without any hung
>> context being indicated despite the capture being initiated by a 'hung
>> context notification' from GuC. The problem was not reproducible.
>> However, it is possible to happen if the context in question has no
>> active requests. For example, if the hang was in the context switch
>> itself then the breadcrumb write would have occurred and the KMD would
>> see an idle context.
>>
>> In the interests of attempting to provide as much information as
>> possible about a hang, it seems wise to include the engine info
>> regardless of whether a request was found or not. As opposed to just
>> prentending there was no hang at all.
>>
>> So update the error capture code to always record engine information
>> if an engine is given. Which means updating record_context() to take a
>> context instead of a request (which it only ever used to find the
>> context anyway). And split the request agnostic parts of
>> intel_engine_coredump_add_request() out into a seaprate function.
>>
>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
>> pointer.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++--------
>>   1 file changed, 43 insertions(+), 18 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 9d5d5a397b64e..bd2cf7d235df0 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>> intel_engine_coredump *ee)
>>   }
>>     static bool record_context(struct i915_gem_context_coredump *e,
>> -               const struct i915_request *rq)
>> +               struct intel_context *ce)
>>   {
>>       struct i915_gem_context *ctx;
>>       struct task_struct *task;
>>       bool simulated;
>>         rcu_read_lock();
>> -    ctx = rcu_dereference(rq->context->gem_context);
>> +    ctx = rcu_dereference(ce->gem_context);
>>       if (ctx && !kref_get_unless_zero(&ctx->ref))
>>           ctx = NULL;
>>       rcu_read_unlock();
>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>> i915_gem_context_coredump *e,
>>       e->guilty = atomic_read(&ctx->guilty_count);
>>       e->active = atomic_read(&ctx->active_count);
>>   -    e->total_runtime = 
>> intel_context_get_total_runtime_ns(rq->context);
>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>         simulated = i915_gem_context_no_error_capture(ctx);
>>   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>       return ee;
>>   }
>>   +static struct intel_engine_capture_vma *
>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>> +                struct intel_context *ce,
>> +                gfp_t gfp)
>> +{
>> +    struct intel_engine_capture_vma *vma = NULL;
>> +
>> +    ee->simulated |= record_context(&ee->context, ce);
>> +    if (ee->simulated)
>> +        return NULL;
>> +
>> +    /*
>> +     * We need to copy these to an anonymous buffer
>> +     * as the simplest method to avoid being overwritten
>> +     * by userspace.
>> +     */
>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>> +
>> +    return vma;
>> +}
>> +
>>   struct intel_engine_capture_vma *
>>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>                     struct i915_request *rq,
>>                     gfp_t gfp)
>>   {
>> -    struct intel_engine_capture_vma *vma = NULL;
>> +    struct intel_engine_capture_vma *vma;
>>   -    ee->simulated |= record_context(&ee->context, rq);
>> -    if (ee->simulated)
>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>> +    if (!vma)
>>           return NULL;
>>         /*
>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>> intel_engine_coredump *ee,
>>        */
>>       vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>       vma = capture_user(vma, rq, gfp);
>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>         ee->rq_head = rq->head;
>>       ee->rq_post = rq->postfix;
>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>>           rq = intel_context_find_active_request(ce);
>> -        if (!rq || !i915_request_started(rq))
>> -            goto no_request_capture;
>> +        if (rq && !i915_request_started(rq)) {
>> +            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with no active request!\n",
>
> Suggest s/active/started/ since we have both i915_request_active and 
> i915_request_started, so to align the terminology.
The message text was based on the intent of the activity not the naming 
of some internal helper function. Can change it if you really want but 
"with no started request" just reads like bad English to me. Plus it 
gets removed in the next patch anyway...


>
>> +                 engine->name);
>> +            rq = NULL;
>> +        }
>>       } else {
>>           /*
>>            * Getting here with GuC enabled means it is a forced error 
>> capture
>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs *engine,
>>                              flags);
>>           }
>>       }
>> -    if (rq)
>> +    if (rq) {
>>           rq = i915_request_get_rcu(rq);
>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> +    } else if (ce) {
>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>> +    }
>>   -    if (!rq)
>> -        goto no_request_capture;
>> -
>> -    capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>>       if (!capture) {
>> -        i915_request_put(rq);
>> +        if (rq)
>> +            i915_request_put(rq);
>>           goto no_request_capture;
>>       }
>>       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>           intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>
> This step requires non-NULL ce, so if you move it under the "else if 
> (ce)" above then I *think* exit from the function can be consolidated 
> to just:
>
> if (capture) {
>     intel_engine_coredump_add_vma(ee, capture, compress);
>     if (rq)
>         i915_request_put(rq);
Is there any reason the rq ref needs to be held during the add_vma call? 
Can it now just be moved earlier to be:
     if (rq) {
         rq = i915_request_get_rcu(rq);
         capture = intel_engine_coredump_add_request(ee, rq, 
ATOMIC_MAYFAIL);
         i915_request_put(rq);
     }

The internals of the request object are only touched in the above 
_add_request() code. The later _add_vma() call fiddles around with vmas 
that pulled from the request but the capture_vma code inside 
_add_request() has already copied everything, hasn't it? Or rather, it 
has grabbed its own private vma resource locks. So there is no 
requirement to keep the request itself around still?

John.


> } else {
>     kfree(ee);
>     ee = NULL;
> }
>
> return ee;
>
> No "if (rq) i915_request_put()" twice, and goto label can be 
> completely removed.
>
> Regards,
>
> Tvrtko
>
>>         intel_engine_coredump_add_vma(ee, capture, compress);
>> -    i915_request_put(rq);
>> +    if (rq)
>> +        i915_request_put(rq);
>>         return ee;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
  2023-01-12 10:06   ` Tvrtko Ursulin
@ 2023-01-12 20:46     ` John Harrison
  2023-01-13  9:10       ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: John Harrison @ 2023-01-12 20:46 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/12/2023 02:06, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> A hang situation has been observed where the only requests on the
>> context were either completed or not yet started according to the
>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>> in progress. So, allow capture of the pending request on the grounds
>> that this might be better than nothing.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index bd2cf7d235df0..2e338a9667a4b 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>>           rq = intel_context_find_active_request(ce);
>> -        if (rq && !i915_request_started(rq)) {
>> -            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with no active request!\n",
>> -                 engine->name);
>> -            rq = NULL;
>> -        }
>> +        if (rq && !i915_request_started(rq))
>> +            drm_info(&engine->gt->i915->drm, "Confused - active 
>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>> +                 rq->fence.context, rq->fence.seqno, ce->guc_id.id, 
>> engine->name);
>
> Ah you change active to started in this patch! :)
Yeah, I'm wanting to keep these two patches separate. This one is a more 
questionable change in actual behaviour. The previous patch just allows 
capturing the context when the request has been rejected. Whereas this 
one changes the request acceptance criteria. With the potential to start 
blaming innocent requests. It seems plausible to me, especially with the 
warning message. We know the context owning the request is guilty so why 
wouldn't we blame that request just because the tracking is off (maybe 
due to some driver bug). But I could see someone objecting on grounds of 
being super strict about who/what gets blamed for a hang and either 
nacks or maybe wants this change reverted some time later.

>
> I suggest no "ce" in user visible messages and maybe stick with the 
> convention grep suggest is already established:
>
> "Hung context with active request %lld:%lld [0x%04X] not started!"
>
Are you also meaning to drop the engine name? I think it is important to 
keep the '%s' in there somewhere.

John.


> Regards,
>
> Tvrtko
>
>>       } else {
>>           /*
>>            * Getting here with GuC enabled means it is a forced error 
>> capture


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-12 10:15   ` Tvrtko Ursulin
@ 2023-01-12 20:59     ` John Harrison
  2023-01-13  9:22       ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: John Harrison @ 2023-01-12 20:59 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/12/2023 02:15, Tvrtko Ursulin wrote:
> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Engine resets are supposed to never fail. But in the case when one
>> does (due to unknown reasons that normally come down to a missing
>> w/a), it is useful to get as much information out of the system as
>> possible. Given that the GuC effectively dies on such a situation, it
>> is not possible to get a guilty context notification back. So do a
>> manual search instead. Given that GuC is dead, this is safe because
>> GuC won't be changing the engine state asynchronously.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b436dd7f12e42..99d09e3394597 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>> work_struct *w)
>>       guc->submission_state.reset_fail_mask = 0;
>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   -    if (likely(reset_fail_mask))
>> +    if (likely(reset_fail_mask)) {
>> +        struct intel_engine_cs *engine;
>> +        enum intel_engine_id id;
>> +
>> +        /*
>> +         * GuC is toast at this point - it dead loops after sending 
>> the failed
>> +         * reset notification. So need to manually determine the 
>> guilty context.
>> +         * Note that it should be safe/reliable to do this here 
>> because the GuC
>> +         * is toast and will not be scheduling behind the KMD's back.
>> +         */
>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>> +            intel_guc_find_hung_context(engine);
>> +
>>           intel_gt_handle_error(gt, reset_fail_mask,
>>                         I915_ERROR_CAPTURE,
>> -                      "GuC failed to reset engine mask=0x%x\n",
>> +                      "GuC failed to reset engine mask=0x%x",
>>                         reset_fail_mask);
>> +    }
>>   }
>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>
> This one I don't feel "at home" enough to r-b. Just a question - can 
> we be sure at this point that GuC is 100% stuck and there isn't a 
> chance it somehow comes alive and starts running in parallel (being 
> driven in parallel by a different "thread" in i915), interfering with 
> the assumption made in the comment?
The GuC API definition for the engine reset failure notification is that 
GuC will dead loop itself after sending - to quote "This is a 
catastrophic failure that requires a full GT reset, or FLR to recover.". 
So yes, GuC is 100% stuck and is not going to self recover. Guaranteed. 
If that changes in the future then that would be a backwards breaking 
API change and would require a corresponding driver update to go with 
supporting the new GuC firmware version.

There is the potential for a GT reset to maybe occur in parallel and 
resurrect the GuC that way. Not sure how that could happen though. The 
heartbeat timeout is significantly longer than the GuC's pre-emption 
timeout + engine reset timeout. That just leaves manual resets from the 
user or maybe from a selftest. If the user is manually poking reset 
debugfs files then it is already known that all bets are off in terms of 
getting an accurate error capture. And if a selftest is triggering GT 
resets in parallel with engine resets then either it is a broken test or 
it is attempting to test an evil corner case in which it is expected 
that error capture results will be unreliable. Having said all that, 
given that the submission_state lock is held here, such a GT reset would 
not get very far in bring the GuC back up anyway. Certainly, it would 
not be able to get as far as submitting new work and thus potentially 
changing the engine state.

So yes, if multiple impossible events occur back to back then the error 
capture may be wonky. Where wonky means a potentially innocent 
context/request gets blamed for breaking the hardware. Oh dear. I can 
live with that.

John.


>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 2/4] drm/i915: Allow error capture of a pending request
  2023-01-12 20:46     ` John Harrison
@ 2023-01-13  9:10       ` Tvrtko Ursulin
  0 siblings, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-13  9:10 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 20:46, John Harrison wrote:
> On 1/12/2023 02:06, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> A hang situation has been observed where the only requests on the
>>> context were either completed or not yet started according to the
>>> breaadcrumbs. However, the register state claimed a batch was (maybe)
>>> in progress. So, allow capture of the pending request on the grounds
>>> that this might be better than nothing.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 8 +++-----
>>>   1 file changed, 3 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index bd2cf7d235df0..2e338a9667a4b 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1628,11 +1628,9 @@ capture_engine(struct intel_engine_cs *engine,
>>>       if (ce) {
>>>           intel_engine_clear_hung_context(engine);
>>>           rq = intel_context_find_active_request(ce);
>>> -        if (rq && !i915_request_started(rq)) {
>>> -            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>>> with no active request!\n",
>>> -                 engine->name);
>>> -            rq = NULL;
>>> -        }
>>> +        if (rq && !i915_request_started(rq))
>>> +            drm_info(&engine->gt->i915->drm, "Confused - active 
>>> request not yet started: %lld:%lld, ce = 0x%04X/%s!\n",
>>> +                 rq->fence.context, rq->fence.seqno, ce->guc_id.id, 
>>> engine->name);
>>
>> Ah you change active to started in this patch! :)
> Yeah, I'm wanting to keep these two patches separate. This one is a more 
> questionable change in actual behaviour. The previous patch just allows 
> capturing the context when the request has been rejected. Whereas this 
> one changes the request acceptance criteria. With the potential to start 
> blaming innocent requests. It seems plausible to me, especially with the 
> warning message. We know the context owning the request is guilty so why 
> wouldn't we blame that request just because the tracking is off (maybe 
> due to some driver bug). But I could see someone objecting on grounds of 
> being super strict about who/what gets blamed for a hang and either 
> nacks or maybe wants this change reverted some time later.
> 
>>
>> I suggest no "ce" in user visible messages and maybe stick with the 
>> convention grep suggest is already established:
>>
>> "Hung context with active request %lld:%lld [0x%04X] not started!"
>>
> Are you also meaning to drop the engine name? I think it is important to 
> keep the '%s' in there somewhere.

No sorry, just an oversight.

"Hung context on %s with active request %lld:%lld [0x%04X] not started!"

Doesn't have to be exactly that, only trying to illustrate what style 
looks better to me when user facing - not mentioning confusing and fewer 
special characters.

Regards,

Tvrtko

> 
> John.
> 
> 
>> Regards,
>>
>> Tvrtko
>>
>>>       } else {
>>>           /*
>>>            * Getting here with GuC enabled means it is a forced error 
>>> capture
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-12 20:59     ` John Harrison
@ 2023-01-13  9:22       ` Tvrtko Ursulin
  2023-01-14  1:27         ` John Harrison
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-13  9:22 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 12/01/2023 20:59, John Harrison wrote:
> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> Engine resets are supposed to never fail. But in the case when one
>>> does (due to unknown reasons that normally come down to a missing
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index b436dd7f12e42..99d09e3394597 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>> work_struct *w)
>>>       guc->submission_state.reset_fail_mask = 0;
>>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>   -    if (likely(reset_fail_mask))
>>> +    if (likely(reset_fail_mask)) {
>>> +        struct intel_engine_cs *engine;
>>> +        enum intel_engine_id id;
>>> +
>>> +        /*
>>> +         * GuC is toast at this point - it dead loops after sending 
>>> the failed
>>> +         * reset notification. So need to manually determine the 
>>> guilty context.
>>> +         * Note that it should be safe/reliable to do this here 
>>> because the GuC
>>> +         * is toast and will not be scheduling behind the KMD's back.
>>> +         */
>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> +            intel_guc_find_hung_context(engine);
>>> +
>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>                         I915_ERROR_CAPTURE,
>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>> +                      "GuC failed to reset engine mask=0x%x",
>>>                         reset_fail_mask);
>>> +    }
>>>   }
>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>
>> This one I don't feel "at home" enough to r-b. Just a question - can 
>> we be sure at this point that GuC is 100% stuck and there isn't a 
>> chance it somehow comes alive and starts running in parallel (being 
>> driven in parallel by a different "thread" in i915), interfering with 
>> the assumption made in the comment?
> The GuC API definition for the engine reset failure notification is that 
> GuC will dead loop itself after sending - to quote "This is a 
> catastrophic failure that requires a full GT reset, or FLR to recover.". 
> So yes, GuC is 100% stuck and is not going to self recover. Guaranteed. 
> If that changes in the future then that would be a backwards breaking 
> API change and would require a corresponding driver update to go with 
> supporting the new GuC firmware version.
> 
> There is the potential for a GT reset to maybe occur in parallel and 
> resurrect the GuC that way. Not sure how that could happen though. The 
> heartbeat timeout is significantly longer than the GuC's pre-emption 
> timeout + engine reset timeout. That just leaves manual resets from the 
> user or maybe from a selftest. If the user is manually poking reset 
> debugfs files then it is already known that all bets are off in terms of 
> getting an accurate error capture. And if a selftest is triggering GT 
> resets in parallel with engine resets then either it is a broken test or 
> it is attempting to test an evil corner case in which it is expected 
> that error capture results will be unreliable. Having said all that, 
> given that the submission_state lock is held here, such a GT reset would 
> not get very far in bring the GuC back up anyway. Certainly, it would 
> not be able to get as far as submitting new work and thus potentially 
> changing the engine state.
> 
> So yes, if multiple impossible events occur back to back then the error 
> capture may be wonky. Where wonky means a potentially innocent 
> context/request gets blamed for breaking the hardware. Oh dear. I can 
> live with that.

Okay, so I was triggered by the "safe/reliable" qualification from the 
comment. I agree "reliable" does not have to be and was mostly worried 
about the "safe" part.

 From what you explain if short heartbeat, or manual reset invocation, 
could actually mess up any of the data structures which added 
intel_guc_find_hung_context walks and so crash the kernel.

Looking inside, there is some lock dropping going on (and undocumented 
irqsave games), and walking the list while unlocked. So whether or not 
that can go bang if a full reset happens in parallel and re-activates 
the normal driver flows.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-12 20:40     ` John Harrison
@ 2023-01-13  9:51       ` Tvrtko Ursulin
  2023-01-13 17:46         ` Hellstrom, Thomas
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-13  9:51 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: Thomas Hellstrom, Matthew Auld, DRI-Devel


On 12/01/2023 20:40, John Harrison wrote:
> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> There was a report of error captures occurring without any hung
>>> context being indicated despite the capture being initiated by a 'hung
>>> context notification' from GuC. The problem was not reproducible.
>>> However, it is possible to happen if the context in question has no
>>> active requests. For example, if the hang was in the context switch
>>> itself then the breadcrumb write would have occurred and the KMD would
>>> see an idle context.
>>>
>>> In the interests of attempting to provide as much information as
>>> possible about a hang, it seems wise to include the engine info
>>> regardless of whether a request was found or not. As opposed to just
>>> prentending there was no hang at all.
>>>
>>> So update the error capture code to always record engine information
>>> if an engine is given. Which means updating record_context() to take a
>>> context instead of a request (which it only ever used to find the
>>> context anyway). And split the request agnostic parts of
>>> intel_engine_coredump_add_request() out into a seaprate function.
>>>
>>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
>>> pointer.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 61 +++++++++++++++++++--------
>>>   1 file changed, 43 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 9d5d5a397b64e..bd2cf7d235df0 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>>> intel_engine_coredump *ee)
>>>   }
>>>     static bool record_context(struct i915_gem_context_coredump *e,
>>> -               const struct i915_request *rq)
>>> +               struct intel_context *ce)
>>>   {
>>>       struct i915_gem_context *ctx;
>>>       struct task_struct *task;
>>>       bool simulated;
>>>         rcu_read_lock();
>>> -    ctx = rcu_dereference(rq->context->gem_context);
>>> +    ctx = rcu_dereference(ce->gem_context);
>>>       if (ctx && !kref_get_unless_zero(&ctx->ref))
>>>           ctx = NULL;
>>>       rcu_read_unlock();
>>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>>> i915_gem_context_coredump *e,
>>>       e->guilty = atomic_read(&ctx->guilty_count);
>>>       e->active = atomic_read(&ctx->active_count);
>>>   -    e->total_runtime = 
>>> intel_context_get_total_runtime_ns(rq->context);
>>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>>         simulated = i915_gem_context_no_error_capture(ctx);
>>>   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>>       return ee;
>>>   }
>>>   +static struct intel_engine_capture_vma *
>>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>>> +                struct intel_context *ce,
>>> +                gfp_t gfp)
>>> +{
>>> +    struct intel_engine_capture_vma *vma = NULL;
>>> +
>>> +    ee->simulated |= record_context(&ee->context, ce);
>>> +    if (ee->simulated)
>>> +        return NULL;
>>> +
>>> +    /*
>>> +     * We need to copy these to an anonymous buffer
>>> +     * as the simplest method to avoid being overwritten
>>> +     * by userspace.
>>> +     */
>>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>>> +
>>> +    return vma;
>>> +}
>>> +
>>>   struct intel_engine_capture_vma *
>>>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>>                     struct i915_request *rq,
>>>                     gfp_t gfp)
>>>   {
>>> -    struct intel_engine_capture_vma *vma = NULL;
>>> +    struct intel_engine_capture_vma *vma;
>>>   -    ee->simulated |= record_context(&ee->context, rq);
>>> -    if (ee->simulated)
>>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>>> +    if (!vma)
>>>           return NULL;
>>>         /*
>>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>>> intel_engine_coredump *ee,
>>>        */
>>>       vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>>       vma = capture_user(vma, rq, gfp);
>>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>>         ee->rq_head = rq->head;
>>>       ee->rq_post = rq->postfix;
>>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
>>>       if (ce) {
>>>           intel_engine_clear_hung_context(engine);
>>>           rq = intel_context_find_active_request(ce);
>>> -        if (!rq || !i915_request_started(rq))
>>> -            goto no_request_capture;
>>> +        if (rq && !i915_request_started(rq)) {
>>> +            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>>> with no active request!\n",
>>
>> Suggest s/active/started/ since we have both i915_request_active and 
>> i915_request_started, so to align the terminology.
> The message text was based on the intent of the activity not the naming 
> of some internal helper function. Can change it if you really want but 
> "with no started request" just reads like bad English to me. Plus it 
> gets removed in the next patch anyway...
> 
> 
>>
>>> +                 engine->name);
>>> +            rq = NULL;
>>> +        }
>>>       } else {
>>>           /*
>>>            * Getting here with GuC enabled means it is a forced error 
>>> capture
>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs *engine,
>>>                              flags);
>>>           }
>>>       }
>>> -    if (rq)
>>> +    if (rq) {
>>>           rq = i915_request_get_rcu(rq);
>>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>>> ATOMIC_MAYFAIL);
>>> +    } else if (ce) {
>>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>>> +    }
>>>   -    if (!rq)
>>> -        goto no_request_capture;
>>> -
>>> -    capture = intel_engine_coredump_add_request(ee, rq, 
>>> ATOMIC_MAYFAIL);
>>>       if (!capture) {
>>> -        i915_request_put(rq);
>>> +        if (rq)
>>> +            i915_request_put(rq);
>>>           goto no_request_capture;
>>>       }
>>>       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>           intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>>
>> This step requires non-NULL ce, so if you move it under the "else if 
>> (ce)" above then I *think* exit from the function can be consolidated 
>> to just:
>>
>> if (capture) {
>>     intel_engine_coredump_add_vma(ee, capture, compress);
>>     if (rq)
>>         i915_request_put(rq);
> Is there any reason the rq ref needs to be held during the add_vma call? 
> Can it now just be moved earlier to be:
>      if (rq) {
>          rq = i915_request_get_rcu(rq);
>          capture = intel_engine_coredump_add_request(ee, rq, 
> ATOMIC_MAYFAIL);
>          i915_request_put(rq);
>      }
> 
> The internals of the request object are only touched in the above 
> _add_request() code. The later _add_vma() call fiddles around with vmas 
> that pulled from the request but the capture_vma code inside 
> _add_request() has already copied everything, hasn't it? Or rather, it 
> has grabbed its own private vma resource locks. So there is no 
> requirement to keep the request itself around still?

Don't know.. it is a question if changes from 60dc43d1190d ("drm/i915: 
Use struct vma_resource instead of struct vma_snapshot") removed the 
need for holding the rq reference that "long" I guess? Adding Thomas and 
Matt to perhaps comment.

Regards,

Tvrtko


> John.
> 
> 
>> } else {
>>     kfree(ee);
>>     ee = NULL;
>> }
>>
>> return ee;
>>
>> No "if (rq) i915_request_put()" twice, and goto label can be 
>> completely removed.
>>
>> Regards,
>>
>> Tvrtko
>>
>>>         intel_engine_coredump_add_vma(ee, capture, compress);
>>> -    i915_request_put(rq);
>>> +    if (rq)
>>> +        i915_request_put(rq);
>>>         return ee;
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-13  9:51       ` Tvrtko Ursulin
@ 2023-01-13 17:46         ` Hellstrom, Thomas
  2023-01-13 21:29           ` John Harrison
  2023-01-16 12:13           ` Tvrtko Ursulin
  0 siblings, 2 replies; 31+ messages in thread
From: Hellstrom, Thomas @ 2023-01-13 17:46 UTC (permalink / raw)
  To: Harrison, John C, tvrtko.ursulin, Intel-GFX; +Cc: Auld, Matthew, DRI-Devel

On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
> 
> On 12/01/2023 20:40, John Harrison wrote:
> > On 1/12/2023 02:01, Tvrtko Ursulin wrote:
> > > On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
> > > > From: John Harrison <John.C.Harrison@Intel.com>
> > > > 
> > > > There was a report of error captures occurring without any hung
> > > > context being indicated despite the capture being initiated by
> > > > a 'hung
> > > > context notification' from GuC. The problem was not
> > > > reproducible.
> > > > However, it is possible to happen if the context in question
> > > > has no
> > > > active requests. For example, if the hang was in the context
> > > > switch
> > > > itself then the breadcrumb write would have occurred and the
> > > > KMD would
> > > > see an idle context.
> > > > 
> > > > In the interests of attempting to provide as much information
> > > > as
> > > > possible about a hang, it seems wise to include the engine info
> > > > regardless of whether a request was found or not. As opposed to
> > > > just
> > > > prentending there was no hang at all.
> > > > 
> > > > So update the error capture code to always record engine
> > > > information
> > > > if an engine is given. Which means updating record_context() to
> > > > take a
> > > > context instead of a request (which it only ever used to find
> > > > the
> > > > context anyway). And split the request agnostic parts of
> > > > intel_engine_coredump_add_request() out into a seaprate
> > > > function.
> > > > 
> > > > v2: Remove a duplicate 'if' statement (Umesh) and fix a put of
> > > > a null
> > > > pointer.
> > > > 
> > > > Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> > > > Reviewed-by: Umesh Nerlige Ramappa
> > > > <umesh.nerlige.ramappa@intel.com>
> > > > ---
> > > >   drivers/gpu/drm/i915/i915_gpu_error.c | 61
> > > > +++++++++++++++++++--------
> > > >   1 file changed, 43 insertions(+), 18 deletions(-)
> > > > 
> > > > diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
> > > > b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > index 9d5d5a397b64e..bd2cf7d235df0 100644
> > > > --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> > > > @@ -1370,14 +1370,14 @@ static void
> > > > engine_record_execlists(struct 
> > > > intel_engine_coredump *ee)
> > > >   }
> > > >     static bool record_context(struct i915_gem_context_coredump
> > > > *e,
> > > > -               const struct i915_request *rq)
> > > > +               struct intel_context *ce)
> > > >   {
> > > >       struct i915_gem_context *ctx;
> > > >       struct task_struct *task;
> > > >       bool simulated;
> > > >         rcu_read_lock();
> > > > -    ctx = rcu_dereference(rq->context->gem_context);
> > > > +    ctx = rcu_dereference(ce->gem_context);
> > > >       if (ctx && !kref_get_unless_zero(&ctx->ref))
> > > >           ctx = NULL;
> > > >       rcu_read_unlock();
> > > > @@ -1396,8 +1396,8 @@ static bool record_context(struct 
> > > > i915_gem_context_coredump *e,
> > > >       e->guilty = atomic_read(&ctx->guilty_count);
> > > >       e->active = atomic_read(&ctx->active_count);
> > > >   -    e->total_runtime = 
> > > > intel_context_get_total_runtime_ns(rq->context);
> > > > -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq-
> > > > >context);
> > > > +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
> > > > +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
> > > >         simulated = i915_gem_context_no_error_capture(ctx);
> > > >   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
> > > > intel_engine_cs *engine, gfp_t gfp, u32 dump_
> > > >       return ee;
> > > >   }
> > > >   +static struct intel_engine_capture_vma *
> > > > +engine_coredump_add_context(struct intel_engine_coredump *ee,
> > > > +                struct intel_context *ce,
> > > > +                gfp_t gfp)
> > > > +{
> > > > +    struct intel_engine_capture_vma *vma = NULL;
> > > > +
> > > > +    ee->simulated |= record_context(&ee->context, ce);
> > > > +    if (ee->simulated)
> > > > +        return NULL;
> > > > +
> > > > +    /*
> > > > +     * We need to copy these to an anonymous buffer
> > > > +     * as the simplest method to avoid being overwritten
> > > > +     * by userspace.
> > > > +     */
> > > > +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> > > > +    vma = capture_vma(vma, ce->state, "HW context", gfp);
> > > > +
> > > > +    return vma;
> > > > +}
> > > > +
> > > >   struct intel_engine_capture_vma *
> > > >   intel_engine_coredump_add_request(struct
> > > > intel_engine_coredump *ee,
> > > >                     struct i915_request *rq,
> > > >                     gfp_t gfp)
> > > >   {
> > > > -    struct intel_engine_capture_vma *vma = NULL;
> > > > +    struct intel_engine_capture_vma *vma;
> > > >   -    ee->simulated |= record_context(&ee->context, rq);
> > > > -    if (ee->simulated)
> > > > +    vma = engine_coredump_add_context(ee, rq->context, gfp);
> > > > +    if (!vma)
> > > >           return NULL;
> > > >         /*
> > > > @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
> > > > intel_engine_coredump *ee,
> > > >        */
> > > >       vma = capture_vma_snapshot(vma, rq->batch_res, gfp,
> > > > "batch");
> > > >       vma = capture_user(vma, rq, gfp);
> > > > -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> > > > -    vma = capture_vma(vma, rq->context->state, "HW context",
> > > > gfp);
> > > >         ee->rq_head = rq->head;
> > > >       ee->rq_post = rq->postfix;
> > > > @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs
> > > > *engine,
> > > >       if (ce) {
> > > >           intel_engine_clear_hung_context(engine);
> > > >           rq = intel_context_find_active_request(ce);
> > > > -        if (!rq || !i915_request_started(rq))
> > > > -            goto no_request_capture;
> > > > +        if (rq && !i915_request_started(rq)) {
> > > > +            drm_info(&engine->gt->i915->drm, "Got hung context
> > > > on %s 
> > > > with no active request!\n",
> > > 
> > > Suggest s/active/started/ since we have both i915_request_active
> > > and 
> > > i915_request_started, so to align the terminology.
> > The message text was based on the intent of the activity not the
> > naming 
> > of some internal helper function. Can change it if you really want
> > but 
> > "with no started request" just reads like bad English to me. Plus
> > it 
> > gets removed in the next patch anyway...
> > 
> > 
> > > 
> > > > +                 engine->name);
> > > > +            rq = NULL;
> > > > +        }
> > > >       } else {
> > > >           /*
> > > >            * Getting here with GuC enabled means it is a forced
> > > > error 
> > > > capture
> > > > @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
> > > > *engine,
> > > >                              flags);
> > > >           }
> > > >       }
> > > > -    if (rq)
> > > > +    if (rq) {
> > > >           rq = i915_request_get_rcu(rq);
> > > > +        capture = intel_engine_coredump_add_request(ee, rq, 
> > > > ATOMIC_MAYFAIL);
> > > > +    } else if (ce) {
> > > > +        capture = engine_coredump_add_context(ee, ce,
> > > > ATOMIC_MAYFAIL);
> > > > +    }
> > > >   -    if (!rq)
> > > > -        goto no_request_capture;
> > > > -
> > > > -    capture = intel_engine_coredump_add_request(ee, rq, 
> > > > ATOMIC_MAYFAIL);
> > > >       if (!capture) {
> > > > -        i915_request_put(rq);
> > > > +        if (rq)
> > > > +            i915_request_put(rq);
> > > >           goto no_request_capture;
> > > >       }
> > > >       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
> > > >           intel_guc_capture_get_matching_node(engine->gt, ee,
> > > > ce);
> > > 
> > > This step requires non-NULL ce, so if you move it under the "else
> > > if 
> > > (ce)" above then I *think* exit from the function can be
> > > consolidated 
> > > to just:
> > > 
> > > if (capture) {
> > >     intel_engine_coredump_add_vma(ee, capture, compress);
> > >     if (rq)
> > >         i915_request_put(rq);
> > Is there any reason the rq ref needs to be held during the add_vma
> > call? 
> > Can it now just be moved earlier to be:
> >      if (rq) {
> >          rq = i915_request_get_rcu(rq);
> >          capture = intel_engine_coredump_add_request(ee, rq, 
> > ATOMIC_MAYFAIL);
> >          i915_request_put(rq);
> >      }
> > 
> > The internals of the request object are only touched in the above 
> > _add_request() code. The later _add_vma() call fiddles around with
> > vmas 
> > that pulled from the request but the capture_vma code inside 
> > _add_request() has already copied everything, hasn't it? Or rather,
> > it 
> > has grabbed its own private vma resource locks. So there is no 
> > requirement to keep the request itself around still?

That sounds correct. It was some time ago since I worked with this code
but when i started IIRC KASAN told me the request along with the whole
capture list could disappear under us due to a parallel capture.

So the request reference added then might cover a bit too much now that
we also hold references on vma resources, which it looks like we do in
intel_engine_coredump_add_vma().

Another thing which is crappy with the current error capture code is
that the request capture list needs to be freed with the request and
not when the request signals (We can't block request signalling in the
capture code to keep the capture list around). There might be many
signaled requests hanging around in non-pruned dma_resv objects and
thus many unused capture lists with many unused vma resources. :/

/Thomas


> 
> Don't know.. it is a question if changes from 60dc43d1190d
> ("drm/i915: 
> Use struct vma_resource instead of struct vma_snapshot") removed the 
> need for holding the rq reference that "long" I guess? Adding Thomas
> and 
> Matt to perhaps comment.
> 
> Regards,
> 
> Tvrtko
> 
> 
> > John.
> > 
> > 
> > > } else {
> > >     kfree(ee);
> > >     ee = NULL;
> > > }
> > > 
> > > return ee;
> > > 
> > > No "if (rq) i915_request_put()" twice, and goto label can be 
> > > completely removed.
> > > 
> > > Regards,
> > > 
> > > Tvrtko
> > > 
> > > >         intel_engine_coredump_add_vma(ee, capture, compress);
> > > > -    i915_request_put(rq);
> > > > +    if (rq)
> > > > +        i915_request_put(rq);
> > > >         return ee;
> > 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-13 17:46         ` Hellstrom, Thomas
@ 2023-01-13 21:29           ` John Harrison
  2023-01-16 12:38             ` Tvrtko Ursulin
  2023-01-16 12:13           ` Tvrtko Ursulin
  1 sibling, 1 reply; 31+ messages in thread
From: John Harrison @ 2023-01-13 21:29 UTC (permalink / raw)
  To: Hellstrom, Thomas, tvrtko.ursulin, Intel-GFX; +Cc: Auld, Matthew, DRI-Devel

On 1/13/2023 09:46, Hellstrom, Thomas wrote:
> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>> On 12/01/2023 20:40, John Harrison wrote:
>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> There was a report of error captures occurring without any hung
>>>>> context being indicated despite the capture being initiated by
>>>>> a 'hung
>>>>> context notification' from GuC. The problem was not
>>>>> reproducible.
>>>>> However, it is possible to happen if the context in question
>>>>> has no
>>>>> active requests. For example, if the hang was in the context
>>>>> switch
>>>>> itself then the breadcrumb write would have occurred and the
>>>>> KMD would
>>>>> see an idle context.
>>>>>
>>>>> In the interests of attempting to provide as much information
>>>>> as
>>>>> possible about a hang, it seems wise to include the engine info
>>>>> regardless of whether a request was found or not. As opposed to
>>>>> just
>>>>> prentending there was no hang at all.
>>>>>
>>>>> So update the error capture code to always record engine
>>>>> information
>>>>> if an engine is given. Which means updating record_context() to
>>>>> take a
>>>>> context instead of a request (which it only ever used to find
>>>>> the
>>>>> context anyway). And split the request agnostic parts of
>>>>> intel_engine_coredump_add_request() out into a seaprate
>>>>> function.
>>>>>
>>>>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of
>>>>> a null
>>>>> pointer.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> Reviewed-by: Umesh Nerlige Ramappa
>>>>> <umesh.nerlige.ramappa@intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 61
>>>>> +++++++++++++++++++--------
>>>>>    1 file changed, 43 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> index 9d5d5a397b64e..bd2cf7d235df0 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> @@ -1370,14 +1370,14 @@ static void
>>>>> engine_record_execlists(struct
>>>>> intel_engine_coredump *ee)
>>>>>    }
>>>>>      static bool record_context(struct i915_gem_context_coredump
>>>>> *e,
>>>>> -               const struct i915_request *rq)
>>>>> +               struct intel_context *ce)
>>>>>    {
>>>>>        struct i915_gem_context *ctx;
>>>>>        struct task_struct *task;
>>>>>        bool simulated;
>>>>>          rcu_read_lock();
>>>>> -    ctx = rcu_dereference(rq->context->gem_context);
>>>>> +    ctx = rcu_dereference(ce->gem_context);
>>>>>        if (ctx && !kref_get_unless_zero(&ctx->ref))
>>>>>            ctx = NULL;
>>>>>        rcu_read_unlock();
>>>>> @@ -1396,8 +1396,8 @@ static bool record_context(struct
>>>>> i915_gem_context_coredump *e,
>>>>>        e->guilty = atomic_read(&ctx->guilty_count);
>>>>>        e->active = atomic_read(&ctx->active_count);
>>>>>    -    e->total_runtime =
>>>>> intel_context_get_total_runtime_ns(rq->context);
>>>>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq-
>>>>>> context);
>>>>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>>>>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>>>>          simulated = i915_gem_context_no_error_capture(ctx);
>>>>>    @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct
>>>>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>>>>        return ee;
>>>>>    }
>>>>>    +static struct intel_engine_capture_vma *
>>>>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>>>>> +                struct intel_context *ce,
>>>>> +                gfp_t gfp)
>>>>> +{
>>>>> +    struct intel_engine_capture_vma *vma = NULL;
>>>>> +
>>>>> +    ee->simulated |= record_context(&ee->context, ce);
>>>>> +    if (ee->simulated)
>>>>> +        return NULL;
>>>>> +
>>>>> +    /*
>>>>> +     * We need to copy these to an anonymous buffer
>>>>> +     * as the simplest method to avoid being overwritten
>>>>> +     * by userspace.
>>>>> +     */
>>>>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>>>>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>>>>> +
>>>>> +    return vma;
>>>>> +}
>>>>> +
>>>>>    struct intel_engine_capture_vma *
>>>>>    intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>                      struct i915_request *rq,
>>>>>                      gfp_t gfp)
>>>>>    {
>>>>> -    struct intel_engine_capture_vma *vma = NULL;
>>>>> +    struct intel_engine_capture_vma *vma;
>>>>>    -    ee->simulated |= record_context(&ee->context, rq);
>>>>> -    if (ee->simulated)
>>>>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>>>>> +    if (!vma)
>>>>>            return NULL;
>>>>>          /*
>>>>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>         */
>>>>>        vma = capture_vma_snapshot(vma, rq->batch_res, gfp,
>>>>> "batch");
>>>>>        vma = capture_user(vma, rq, gfp);
>>>>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>>>>> -    vma = capture_vma(vma, rq->context->state, "HW context",
>>>>> gfp);
>>>>>          ee->rq_head = rq->head;
>>>>>        ee->rq_post = rq->postfix;
>>>>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>        if (ce) {
>>>>>            intel_engine_clear_hung_context(engine);
>>>>>            rq = intel_context_find_active_request(ce);
>>>>> -        if (!rq || !i915_request_started(rq))
>>>>> -            goto no_request_capture;
>>>>> +        if (rq && !i915_request_started(rq)) {
>>>>> +            drm_info(&engine->gt->i915->drm, "Got hung context
>>>>> on %s
>>>>> with no active request!\n",
>>>> Suggest s/active/started/ since we have both i915_request_active
>>>> and
>>>> i915_request_started, so to align the terminology.
>>> The message text was based on the intent of the activity not the
>>> naming
>>> of some internal helper function. Can change it if you really want
>>> but
>>> "with no started request" just reads like bad English to me. Plus
>>> it
>>> gets removed in the next patch anyway...
>>>
>>>
>>>>> +                 engine->name);
>>>>> +            rq = NULL;
>>>>> +        }
>>>>>        } else {
>>>>>            /*
>>>>>             * Getting here with GuC enabled means it is a forced
>>>>> error
>>>>> capture
>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>                               flags);
>>>>>            }
>>>>>        }
>>>>> -    if (rq)
>>>>> +    if (rq) {
>>>>>            rq = i915_request_get_rcu(rq);
>>>>> +        capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    } else if (ce) {
>>>>> +        capture = engine_coredump_add_context(ee, ce,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    }
>>>>>    -    if (!rq)
>>>>> -        goto no_request_capture;
>>>>> -
>>>>> -    capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>>        if (!capture) {
>>>>> -        i915_request_put(rq);
>>>>> +        if (rq)
>>>>> +            i915_request_put(rq);
>>>>>            goto no_request_capture;
>>>>>        }
>>>>>        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>            intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>> ce);
>>>> This step requires non-NULL ce, so if you move it under the "else
>>>> if
>>>> (ce)" above then I *think* exit from the function can be
>>>> consolidated
>>>> to just:
>>>>
>>>> if (capture) {
>>>>      intel_engine_coredump_add_vma(ee, capture, compress);
>>>>      if (rq)
>>>>          i915_request_put(rq);
>>> Is there any reason the rq ref needs to be held during the add_vma
>>> call?
>>> Can it now just be moved earlier to be:
>>>       if (rq) {
>>>           rq = i915_request_get_rcu(rq);
>>>           capture = intel_engine_coredump_add_request(ee, rq,
>>> ATOMIC_MAYFAIL);
>>>           i915_request_put(rq);
>>>       }
>>>
>>> The internals of the request object are only touched in the above
>>> _add_request() code. The later _add_vma() call fiddles around with
>>> vmas
>>> that pulled from the request but the capture_vma code inside
>>> _add_request() has already copied everything, hasn't it? Or rather,
>>> it
>>> has grabbed its own private vma resource locks. So there is no
>>> requirement to keep the request itself around still?
> That sounds correct. It was some time ago since I worked with this code
> but when i started IIRC KASAN told me the request along with the whole
> capture list could disappear under us due to a parallel capture.
>
> So the request reference added then might cover a bit too much now that
> we also hold references on vma resources, which it looks like we do in
> intel_engine_coredump_add_vma().
So that means we end up with:
     rq = intel_context_find_active_request(ce);
     ...
     [test stuff like i915_request_started(rq)]
     ...
      if (rq) {
         rq = i915_request_get_rcu(rq);
         capture = intel_engine_coredump_add_request(ee, rq, 
ATOMIC_MAYFAIL);
         i915_request_put(rq);
     }

What is special about coredump_add_request() that it needs the request 
to be extra locked for that call and only that call? If the request can 
magically vanish after being found then what protects the _started() 
query? For that matter, what stops the request_get_rcu() itself being 
called on a pointer that is no longer valid? And if we do actually have 
sufficient locking in place to prevent that, why doesn't that cover the 
coredump_add_request() usage?

John.

>
> Another thing which is crappy with the current error capture code is
> that the request capture list needs to be freed with the request and
> not when the request signals (We can't block request signalling in the
> capture code to keep the capture list around). There might be many
> signaled requests hanging around in non-pruned dma_resv objects and
> thus many unused capture lists with many unused vma resources. :/
>
> /Thomas
>
>
>> Don't know.. it is a question if changes from 60dc43d1190d
>> ("drm/i915:
>> Use struct vma_resource instead of struct vma_snapshot") removed the
>> need for holding the rq reference that "long" I guess? Adding Thomas
>> and
>> Matt to perhaps comment.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>> John.
>>>
>>>
>>>> } else {
>>>>      kfree(ee);
>>>>      ee = NULL;
>>>> }
>>>>
>>>> return ee;
>>>>
>>>> No "if (rq) i915_request_put()" twice, and goto label can be
>>>> completely removed.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>>          intel_engine_coredump_add_vma(ee, capture, compress);
>>>>> -    i915_request_put(rq);
>>>>> +    if (rq)
>>>>> +        i915_request_put(rq);
>>>>>          return ee;


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-13  9:22       ` Tvrtko Ursulin
@ 2023-01-14  1:27         ` John Harrison
  2023-01-16 12:43           ` Tvrtko Ursulin
  0 siblings, 1 reply; 31+ messages in thread
From: John Harrison @ 2023-01-14  1:27 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/13/2023 01:22, Tvrtko Ursulin wrote:
> On 12/01/2023 20:59, John Harrison wrote:
>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> Engine resets are supposed to never fail. But in the case when one
>>>> does (due to unknown reasons that normally come down to a missing
>>>> w/a), it is useful to get as much information out of the system as
>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>> is not possible to get a guilty context notification back. So do a
>>>> manual search instead. Given that GuC is dead, this is safe because
>>>> GuC won't be changing the engine state asynchronously.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>> +++++++++++++++--
>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index b436dd7f12e42..99d09e3394597 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>> work_struct *w)
>>>>       guc->submission_state.reset_fail_mask = 0;
>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>   -    if (likely(reset_fail_mask))
>>>> +    if (likely(reset_fail_mask)) {
>>>> +        struct intel_engine_cs *engine;
>>>> +        enum intel_engine_id id;
>>>> +
>>>> +        /*
>>>> +         * GuC is toast at this point - it dead loops after 
>>>> sending the failed
>>>> +         * reset notification. So need to manually determine the 
>>>> guilty context.
>>>> +         * Note that it should be safe/reliable to do this here 
>>>> because the GuC
>>>> +         * is toast and will not be scheduling behind the KMD's back.
>>>> +         */
>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>> +            intel_guc_find_hung_context(engine);
>>>> +
>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>                         I915_ERROR_CAPTURE,
>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>                         reset_fail_mask);
>>>> +    }
>>>>   }
>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>
>>> This one I don't feel "at home" enough to r-b. Just a question - can 
>>> we be sure at this point that GuC is 100% stuck and there isn't a 
>>> chance it somehow comes alive and starts running in parallel (being 
>>> driven in parallel by a different "thread" in i915), interfering 
>>> with the assumption made in the comment?
>> The GuC API definition for the engine reset failure notification is 
>> that GuC will dead loop itself after sending - to quote "This is a 
>> catastrophic failure that requires a full GT reset, or FLR to 
>> recover.". So yes, GuC is 100% stuck and is not going to self 
>> recover. Guaranteed. If that changes in the future then that would be 
>> a backwards breaking API change and would require a corresponding 
>> driver update to go with supporting the new GuC firmware version.
>>
>> There is the potential for a GT reset to maybe occur in parallel and 
>> resurrect the GuC that way. Not sure how that could happen though. 
>> The heartbeat timeout is significantly longer than the GuC's 
>> pre-emption timeout + engine reset timeout. That just leaves manual 
>> resets from the user or maybe from a selftest. If the user is 
>> manually poking reset debugfs files then it is already known that all 
>> bets are off in terms of getting an accurate error capture. And if a 
>> selftest is triggering GT resets in parallel with engine resets then 
>> either it is a broken test or it is attempting to test an evil corner 
>> case in which it is expected that error capture results will be 
>> unreliable. Having said all that, given that the submission_state 
>> lock is held here, such a GT reset would not get very far in bring 
>> the GuC back up anyway. Certainly, it would not be able to get as far 
>> as submitting new work and thus potentially changing the engine state.
>>
>> So yes, if multiple impossible events occur back to back then the 
>> error capture may be wonky. Where wonky means a potentially innocent 
>> context/request gets blamed for breaking the hardware. Oh dear. I can 
>> live with that.
>
> Okay, so I was triggered by the "safe/reliable" qualification from the 
> comment. I agree "reliable" does not have to be and was mostly worried 
> about the "safe" part.
>
> From what you explain if short heartbeat, or manual reset invocation, 
> could actually mess up any of the data structures which added 
> intel_guc_find_hung_context walks and so crash the kernel.
>
> Looking inside, there is some lock dropping going on (and undocumented 
> irqsave games), and walking the list while unlocked. So whether or not 
> that can go bang if a full reset happens in parallel and re-activates 
> the normal driver flows.
There is no walking of unlocked lists. The xa_lock is held whenever it 
looks at the xa structure itself. The release is only while analysing 
the context that was retrieved. And the context retrieval itself starts 
with a kref_get_unless_zero. So everything is only ever accessed while 
locked or reference counted. The unlock of the xa while analysing a 
context is because the xa object can be accessed from interrupt code and 
so we don't want to hold it locked unnecessarily while scanning through 
requests within a context (all code which has no connection to the GuC 
backend at all).

I can drop the word 'safe' if it makes you nervous. That was only meant 
to refer to the possibility of such a scan returning bogus results due 
to contexts switching in/out of the hardware before/during/after the 
scan. There is no way for it to go bang.

John.


>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-13 17:46         ` Hellstrom, Thomas
  2023-01-13 21:29           ` John Harrison
@ 2023-01-16 12:13           ` Tvrtko Ursulin
  1 sibling, 0 replies; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-16 12:13 UTC (permalink / raw)
  To: Hellstrom, Thomas, Harrison, John C, Intel-GFX; +Cc: Auld, Matthew, DRI-Devel


On 13/01/2023 17:46, Hellstrom, Thomas wrote:
> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>>
>> On 12/01/2023 20:40, John Harrison wrote:
>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> There was a report of error captures occurring without any hung
>>>>> context being indicated despite the capture being initiated by
>>>>> a 'hung
>>>>> context notification' from GuC. The problem was not
>>>>> reproducible.
>>>>> However, it is possible to happen if the context in question
>>>>> has no
>>>>> active requests. For example, if the hang was in the context
>>>>> switch
>>>>> itself then the breadcrumb write would have occurred and the
>>>>> KMD would
>>>>> see an idle context.
>>>>>
>>>>> In the interests of attempting to provide as much information
>>>>> as
>>>>> possible about a hang, it seems wise to include the engine info
>>>>> regardless of whether a request was found or not. As opposed to
>>>>> just
>>>>> prentending there was no hang at all.
>>>>>
>>>>> So update the error capture code to always record engine
>>>>> information
>>>>> if an engine is given. Which means updating record_context() to
>>>>> take a
>>>>> context instead of a request (which it only ever used to find
>>>>> the
>>>>> context anyway). And split the request agnostic parts of
>>>>> intel_engine_coredump_add_request() out into a seaprate
>>>>> function.
>>>>>
>>>>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of
>>>>> a null
>>>>> pointer.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> Reviewed-by: Umesh Nerlige Ramappa
>>>>> <umesh.nerlige.ramappa@intel.com>
>>>>> ---
>>>>>    drivers/gpu/drm/i915/i915_gpu_error.c | 61
>>>>> +++++++++++++++++++--------
>>>>>    1 file changed, 43 insertions(+), 18 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> index 9d5d5a397b64e..bd2cf7d235df0 100644
>>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>>> @@ -1370,14 +1370,14 @@ static void
>>>>> engine_record_execlists(struct
>>>>> intel_engine_coredump *ee)
>>>>>    }
>>>>>      static bool record_context(struct i915_gem_context_coredump
>>>>> *e,
>>>>> -               const struct i915_request *rq)
>>>>> +               struct intel_context *ce)
>>>>>    {
>>>>>        struct i915_gem_context *ctx;
>>>>>        struct task_struct *task;
>>>>>        bool simulated;
>>>>>          rcu_read_lock();
>>>>> -    ctx = rcu_dereference(rq->context->gem_context);
>>>>> +    ctx = rcu_dereference(ce->gem_context);
>>>>>        if (ctx && !kref_get_unless_zero(&ctx->ref))
>>>>>            ctx = NULL;
>>>>>        rcu_read_unlock();
>>>>> @@ -1396,8 +1396,8 @@ static bool record_context(struct
>>>>> i915_gem_context_coredump *e,
>>>>>        e->guilty = atomic_read(&ctx->guilty_count);
>>>>>        e->active = atomic_read(&ctx->active_count);
>>>>>    -    e->total_runtime =
>>>>> intel_context_get_total_runtime_ns(rq->context);
>>>>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq-
>>>>>> context);
>>>>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>>>>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>>>>          simulated = i915_gem_context_no_error_capture(ctx);
>>>>>    @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct
>>>>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>>>>        return ee;
>>>>>    }
>>>>>    +static struct intel_engine_capture_vma *
>>>>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>>>>> +                struct intel_context *ce,
>>>>> +                gfp_t gfp)
>>>>> +{
>>>>> +    struct intel_engine_capture_vma *vma = NULL;
>>>>> +
>>>>> +    ee->simulated |= record_context(&ee->context, ce);
>>>>> +    if (ee->simulated)
>>>>> +        return NULL;
>>>>> +
>>>>> +    /*
>>>>> +     * We need to copy these to an anonymous buffer
>>>>> +     * as the simplest method to avoid being overwritten
>>>>> +     * by userspace.
>>>>> +     */
>>>>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>>>>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>>>>> +
>>>>> +    return vma;
>>>>> +}
>>>>> +
>>>>>    struct intel_engine_capture_vma *
>>>>>    intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>                      struct i915_request *rq,
>>>>>                      gfp_t gfp)
>>>>>    {
>>>>> -    struct intel_engine_capture_vma *vma = NULL;
>>>>> +    struct intel_engine_capture_vma *vma;
>>>>>    -    ee->simulated |= record_context(&ee->context, rq);
>>>>> -    if (ee->simulated)
>>>>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>>>>> +    if (!vma)
>>>>>            return NULL;
>>>>>          /*
>>>>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct
>>>>> intel_engine_coredump *ee,
>>>>>         */
>>>>>        vma = capture_vma_snapshot(vma, rq->batch_res, gfp,
>>>>> "batch");
>>>>>        vma = capture_user(vma, rq, gfp);
>>>>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>>>>> -    vma = capture_vma(vma, rq->context->state, "HW context",
>>>>> gfp);
>>>>>          ee->rq_head = rq->head;
>>>>>        ee->rq_post = rq->postfix;
>>>>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>        if (ce) {
>>>>>            intel_engine_clear_hung_context(engine);
>>>>>            rq = intel_context_find_active_request(ce);
>>>>> -        if (!rq || !i915_request_started(rq))
>>>>> -            goto no_request_capture;
>>>>> +        if (rq && !i915_request_started(rq)) {
>>>>> +            drm_info(&engine->gt->i915->drm, "Got hung context
>>>>> on %s
>>>>> with no active request!\n",
>>>>
>>>> Suggest s/active/started/ since we have both i915_request_active
>>>> and
>>>> i915_request_started, so to align the terminology.
>>> The message text was based on the intent of the activity not the
>>> naming
>>> of some internal helper function. Can change it if you really want
>>> but
>>> "with no started request" just reads like bad English to me. Plus
>>> it
>>> gets removed in the next patch anyway...
>>>
>>>
>>>>
>>>>> +                 engine->name);
>>>>> +            rq = NULL;
>>>>> +        }
>>>>>        } else {
>>>>>            /*
>>>>>             * Getting here with GuC enabled means it is a forced
>>>>> error
>>>>> capture
>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>> *engine,
>>>>>                               flags);
>>>>>            }
>>>>>        }
>>>>> -    if (rq)
>>>>> +    if (rq) {
>>>>>            rq = i915_request_get_rcu(rq);
>>>>> +        capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    } else if (ce) {
>>>>> +        capture = engine_coredump_add_context(ee, ce,
>>>>> ATOMIC_MAYFAIL);
>>>>> +    }
>>>>>    -    if (!rq)
>>>>> -        goto no_request_capture;
>>>>> -
>>>>> -    capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>>        if (!capture) {
>>>>> -        i915_request_put(rq);
>>>>> +        if (rq)
>>>>> +            i915_request_put(rq);
>>>>>            goto no_request_capture;
>>>>>        }
>>>>>        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>            intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>> ce);
>>>>
>>>> This step requires non-NULL ce, so if you move it under the "else
>>>> if
>>>> (ce)" above then I *think* exit from the function can be
>>>> consolidated
>>>> to just:
>>>>
>>>> if (capture) {
>>>>      intel_engine_coredump_add_vma(ee, capture, compress);
>>>>      if (rq)
>>>>          i915_request_put(rq);
>>> Is there any reason the rq ref needs to be held during the add_vma
>>> call?
>>> Can it now just be moved earlier to be:
>>>       if (rq) {
>>>           rq = i915_request_get_rcu(rq);
>>>           capture = intel_engine_coredump_add_request(ee, rq,
>>> ATOMIC_MAYFAIL);
>>>           i915_request_put(rq);
>>>       }
>>>
>>> The internals of the request object are only touched in the above
>>> _add_request() code. The later _add_vma() call fiddles around with
>>> vmas
>>> that pulled from the request but the capture_vma code inside
>>> _add_request() has already copied everything, hasn't it? Or rather,
>>> it
>>> has grabbed its own private vma resource locks. So there is no
>>> requirement to keep the request itself around still?
> 
> That sounds correct. It was some time ago since I worked with this code
> but when i started IIRC KASAN told me the request along with the whole
> capture list could disappear under us due to a parallel capture.
> 
> So the request reference added then might cover a bit too much now that
> we also hold references on vma resources, which it looks like we do in
> intel_engine_coredump_add_vma().

If you are not sure maybe just should leave the reference covering as it 
does? I don't think there is any harm in covering too much. Whether or 
not it is immediately released after the call to 
intel_engine_coredump_add_request(), or at the exit of capture_engine() 
I mean, we can skip that clean up if in doubt.

> Another thing which is crappy with the current error capture code is
> that the request capture list needs to be freed with the request and
> not when the request signals (We can't block request signalling in the
> capture code to keep the capture list around). There might be many
> signaled requests hanging around in non-pruned dma_resv objects and
> thus many unused capture lists with many unused vma resources. :/

This last part sounds vaguely familiar - is it really a problem with the 
error capture code and it wasn't some other refactoring which removed 
the pruning of signaled fences from dma_resv?

Regards,

Tvrtko

> 
> /Thomas
> 
> 
>>
>> Don't know.. it is a question if changes from 60dc43d1190d
>> ("drm/i915:
>> Use struct vma_resource instead of struct vma_snapshot") removed the
>> need for holding the rq reference that "long" I guess? Adding Thomas
>> and
>> Matt to perhaps comment.
>>
>> Regards,
>>
>> Tvrtko
>>
>>
>>> John.
>>>
>>>
>>>> } else {
>>>>      kfree(ee);
>>>>      ee = NULL;
>>>> }
>>>>
>>>> return ee;
>>>>
>>>> No "if (rq) i915_request_put()" twice, and goto label can be
>>>> completely removed.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>>
>>>>>          intel_engine_coredump_add_vma(ee, capture, compress);
>>>>> -    i915_request_put(rq);
>>>>> +    if (rq)
>>>>> +        i915_request_put(rq);
>>>>>          return ee;
>>>
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-13 21:29           ` John Harrison
@ 2023-01-16 12:38             ` Tvrtko Ursulin
  2023-01-17 19:40               ` John Harrison
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-16 12:38 UTC (permalink / raw)
  To: John Harrison, Hellstrom, Thomas, Intel-GFX; +Cc: Auld, Matthew, DRI-Devel


On 13/01/2023 21:29, John Harrison wrote:
> On 1/13/2023 09:46, Hellstrom, Thomas wrote:
>> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>>> On 12/01/2023 20:40, John Harrison wrote:
>>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:

[snip]

>>>>>> +                 engine->name);
>>>>>> +            rq = NULL;
>>>>>> +        }
>>>>>>        } else {
>>>>>>            /*
>>>>>>             * Getting here with GuC enabled means it is a forced
>>>>>> error
>>>>>> capture
>>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>>> *engine,
>>>>>>                               flags);
>>>>>>            }
>>>>>>        }
>>>>>> -    if (rq)
>>>>>> +    if (rq) {
>>>>>>            rq = i915_request_get_rcu(rq);
>>>>>> +        capture = intel_engine_coredump_add_request(ee, rq,
>>>>>> ATOMIC_MAYFAIL);
>>>>>> +    } else if (ce) {
>>>>>> +        capture = engine_coredump_add_context(ee, ce,
>>>>>> ATOMIC_MAYFAIL);
>>>>>> +    }
>>>>>>    -    if (!rq)
>>>>>> -        goto no_request_capture;
>>>>>> -
>>>>>> -    capture = intel_engine_coredump_add_request(ee, rq,
>>>>>> ATOMIC_MAYFAIL);
>>>>>>        if (!capture) {
>>>>>> -        i915_request_put(rq);
>>>>>> +        if (rq)
>>>>>> +            i915_request_put(rq);
>>>>>>            goto no_request_capture;
>>>>>>        }
>>>>>>        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>>            intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>>> ce);
>>>>> This step requires non-NULL ce, so if you move it under the "else
>>>>> if
>>>>> (ce)" above then I *think* exit from the function can be
>>>>> consolidated
>>>>> to just:
>>>>>
>>>>> if (capture) {
>>>>>      intel_engine_coredump_add_vma(ee, capture, compress);
>>>>>      if (rq)
>>>>>          i915_request_put(rq);
>>>> Is there any reason the rq ref needs to be held during the add_vma
>>>> call?
>>>> Can it now just be moved earlier to be:
>>>>       if (rq) {
>>>>           rq = i915_request_get_rcu(rq);
>>>>           capture = intel_engine_coredump_add_request(ee, rq,
>>>> ATOMIC_MAYFAIL);
>>>>           i915_request_put(rq);
>>>>       }
>>>>
>>>> The internals of the request object are only touched in the above
>>>> _add_request() code. The later _add_vma() call fiddles around with
>>>> vmas
>>>> that pulled from the request but the capture_vma code inside
>>>> _add_request() has already copied everything, hasn't it? Or rather,
>>>> it
>>>> has grabbed its own private vma resource locks. So there is no
>>>> requirement to keep the request itself around still?
>> That sounds correct. It was some time ago since I worked with this code
>> but when i started IIRC KASAN told me the request along with the whole
>> capture list could disappear under us due to a parallel capture.
>>
>> So the request reference added then might cover a bit too much now that
>> we also hold references on vma resources, which it looks like we do in
>> intel_engine_coredump_add_vma().
> So that means we end up with:
>      rq = intel_context_find_active_request(ce);
>      ...
>      [test stuff like i915_request_started(rq)]
>      ...
>       if (rq) {
>          rq = i915_request_get_rcu(rq);
>          capture = intel_engine_coredump_add_request(ee, rq, 
> ATOMIC_MAYFAIL);
>          i915_request_put(rq);
>      }
> 
> What is special about coredump_add_request() that it needs the request 
> to be extra locked for that call and only that call? If the request can 
> magically vanish after being found then what protects the _started() 
> query? For that matter, what stops the request_get_rcu() itself being 
> called on a pointer that is no longer valid? And if we do actually have 
> sufficient locking in place to prevent that, why doesn't that cover the 
> coredump_add_request() usage?

There is definitely a red flag there with the difference between the if 
and else blocks at the top of capture_engine(). And funnily enough, the 
first block appears to be GuC only. That is not obvious from the code 
and should probably have a comment, or function names made self-documenting.

I guess the special thing about intel_engine_coredump_add_request() is 
that it dereferences the rq. So it is possibly 573ba126aef3 
("drm/i915/guc: Capture error state on context reset") which added a bug 
where rq can be dereferenced with a reference held. Or perhaps with the 
GuC backend there is a guarantee request cannot be retired from 
elsewhere while error capture is examining it.

To unravel the error entry points into error capture, from execlists, 
debugfs, ringbuffer, I don't have the time to remind myself how all that 
works right now. Quite possibly at least some of those run async to the 
GPU so must be safe against parallel request retirement. So I don't know 
if the i915_request_get_rcu safe in all those cases without spending 
some time to refresh my knowledge a bit.

Sounds like the best plan is not to change this too much - just leave 
the scope of reference held as is and ideally eliminate the necessary 
goto labels. AFAIR that should be doable without changing anything real 
and unblock these improvements.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-14  1:27         ` John Harrison
@ 2023-01-16 12:43           ` Tvrtko Ursulin
  2023-01-17 21:14             ` John Harrison
  0 siblings, 1 reply; 31+ messages in thread
From: Tvrtko Ursulin @ 2023-01-16 12:43 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 14/01/2023 01:27, John Harrison wrote:
> On 1/13/2023 01:22, Tvrtko Ursulin wrote:
>> On 12/01/2023 20:59, John Harrison wrote:
>>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> Engine resets are supposed to never fail. But in the case when one
>>>>> does (due to unknown reasons that normally come down to a missing
>>>>> w/a), it is useful to get as much information out of the system as
>>>>> possible. Given that the GuC effectively dies on such a situation, it
>>>>> is not possible to get a guilty context notification back. So do a
>>>>> manual search instead. Given that GuC is dead, this is safe because
>>>>> GuC won't be changing the engine state asynchronously.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>>> +++++++++++++++--
>>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index b436dd7f12e42..99d09e3394597 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>>> work_struct *w)
>>>>>       guc->submission_state.reset_fail_mask = 0;
>>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>   -    if (likely(reset_fail_mask))
>>>>> +    if (likely(reset_fail_mask)) {
>>>>> +        struct intel_engine_cs *engine;
>>>>> +        enum intel_engine_id id;
>>>>> +
>>>>> +        /*
>>>>> +         * GuC is toast at this point - it dead loops after 
>>>>> sending the failed
>>>>> +         * reset notification. So need to manually determine the 
>>>>> guilty context.
>>>>> +         * Note that it should be safe/reliable to do this here 
>>>>> because the GuC
>>>>> +         * is toast and will not be scheduling behind the KMD's back.
>>>>> +         */
>>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>>> +            intel_guc_find_hung_context(engine);
>>>>> +
>>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>>                         I915_ERROR_CAPTURE,
>>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>>                         reset_fail_mask);
>>>>> +    }
>>>>>   }
>>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>>
>>>> This one I don't feel "at home" enough to r-b. Just a question - can 
>>>> we be sure at this point that GuC is 100% stuck and there isn't a 
>>>> chance it somehow comes alive and starts running in parallel (being 
>>>> driven in parallel by a different "thread" in i915), interfering 
>>>> with the assumption made in the comment?
>>> The GuC API definition for the engine reset failure notification is 
>>> that GuC will dead loop itself after sending - to quote "This is a 
>>> catastrophic failure that requires a full GT reset, or FLR to 
>>> recover.". So yes, GuC is 100% stuck and is not going to self 
>>> recover. Guaranteed. If that changes in the future then that would be 
>>> a backwards breaking API change and would require a corresponding 
>>> driver update to go with supporting the new GuC firmware version.
>>>
>>> There is the potential for a GT reset to maybe occur in parallel and 
>>> resurrect the GuC that way. Not sure how that could happen though. 
>>> The heartbeat timeout is significantly longer than the GuC's 
>>> pre-emption timeout + engine reset timeout. That just leaves manual 
>>> resets from the user or maybe from a selftest. If the user is 
>>> manually poking reset debugfs files then it is already known that all 
>>> bets are off in terms of getting an accurate error capture. And if a 
>>> selftest is triggering GT resets in parallel with engine resets then 
>>> either it is a broken test or it is attempting to test an evil corner 
>>> case in which it is expected that error capture results will be 
>>> unreliable. Having said all that, given that the submission_state 
>>> lock is held here, such a GT reset would not get very far in bring 
>>> the GuC back up anyway. Certainly, it would not be able to get as far 
>>> as submitting new work and thus potentially changing the engine state.
>>>
>>> So yes, if multiple impossible events occur back to back then the 
>>> error capture may be wonky. Where wonky means a potentially innocent 
>>> context/request gets blamed for breaking the hardware. Oh dear. I can 
>>> live with that.
>>
>> Okay, so I was triggered by the "safe/reliable" qualification from the 
>> comment. I agree "reliable" does not have to be and was mostly worried 
>> about the "safe" part.
>>
>> From what you explain if short heartbeat, or manual reset invocation, 
>> could actually mess up any of the data structures which added 
>> intel_guc_find_hung_context walks and so crash the kernel.
>>
>> Looking inside, there is some lock dropping going on (and undocumented 
>> irqsave games), and walking the list while unlocked. So whether or not 
>> that can go bang if a full reset happens in parallel and re-activates 
>> the normal driver flows.
> There is no walking of unlocked lists. The xa_lock is held whenever it 
> looks at the xa structure itself. The release is only while analysing 
> the context that was retrieved. And the context retrieval itself starts 
> with a kref_get_unless_zero. So everything is only ever accessed while 
> locked or reference counted. The unlock of the xa while analysing a 
> context is because the xa object can be accessed from interrupt code and 
> so we don't want to hold it locked unnecessarily while scanning through 
> requests within a context (all code which has no connection to the GuC 
> backend at all).

AFAICS intel_guc_find_hung_context walks &ce->guc_state.requests with no 
locks held. Other places in the code appear to use &ce->guc_state.lock, 
or maybe &sched_engine->lock, not sure. Then we have request submission, 
retirement and a few other places modify that list. So *if* indeed hung 
GuC can get resurrected by a parallel full reset while 
reset_fail_worker_func is running, why couldn't that list walk explode?

Regards,

Tvrtko

> I can drop the word 'safe' if it makes you nervous. That was only meant 
> to refer to the possibility of such a scan returning bogus results due 
> to contexts switching in/out of the hardware before/during/after the 
> scan. There is no way for it to go bang.
> 
> John.
> 
> 
>>
>> Regards,
>>
>> Tvrtko
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 1/4] drm/i915: Allow error capture without a request
  2023-01-16 12:38             ` Tvrtko Ursulin
@ 2023-01-17 19:40               ` John Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John Harrison @ 2023-01-17 19:40 UTC (permalink / raw)
  To: Tvrtko Ursulin, Hellstrom, Thomas, Intel-GFX; +Cc: Auld, Matthew, DRI-Devel

On 1/16/2023 04:38, Tvrtko Ursulin wrote:
> On 13/01/2023 21:29, John Harrison wrote:
>> On 1/13/2023 09:46, Hellstrom, Thomas wrote:
>>> On Fri, 2023-01-13 at 09:51 +0000, Tvrtko Ursulin wrote:
>>>> On 12/01/2023 20:40, John Harrison wrote:
>>>>> On 1/12/2023 02:01, Tvrtko Ursulin wrote:
>>>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>
> [snip]
>
>>>>>>> + engine->name);
>>>>>>> +            rq = NULL;
>>>>>>> +        }
>>>>>>>        } else {
>>>>>>>            /*
>>>>>>>             * Getting here with GuC enabled means it is a forced
>>>>>>> error
>>>>>>> capture
>>>>>>> @@ -1622,22 +1645,24 @@ capture_engine(struct intel_engine_cs
>>>>>>> *engine,
>>>>>>>                               flags);
>>>>>>>            }
>>>>>>>        }
>>>>>>> -    if (rq)
>>>>>>> +    if (rq) {
>>>>>>>            rq = i915_request_get_rcu(rq);
>>>>>>> +        capture = intel_engine_coredump_add_request(ee, rq,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>> +    } else if (ce) {
>>>>>>> +        capture = engine_coredump_add_context(ee, ce,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>> +    }
>>>>>>>    -    if (!rq)
>>>>>>> -        goto no_request_capture;
>>>>>>> -
>>>>>>> -    capture = intel_engine_coredump_add_request(ee, rq,
>>>>>>> ATOMIC_MAYFAIL);
>>>>>>>        if (!capture) {
>>>>>>> -        i915_request_put(rq);
>>>>>>> +        if (rq)
>>>>>>> +            i915_request_put(rq);
>>>>>>>            goto no_request_capture;
>>>>>>>        }
>>>>>>>        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>>>>> intel_guc_capture_get_matching_node(engine->gt, ee,
>>>>>>> ce);
>>>>>> This step requires non-NULL ce, so if you move it under the "else
>>>>>> if
>>>>>> (ce)" above then I *think* exit from the function can be
>>>>>> consolidated
>>>>>> to just:
>>>>>>
>>>>>> if (capture) {
>>>>>>      intel_engine_coredump_add_vma(ee, capture, compress);
>>>>>>      if (rq)
>>>>>>          i915_request_put(rq);
>>>>> Is there any reason the rq ref needs to be held during the add_vma
>>>>> call?
>>>>> Can it now just be moved earlier to be:
>>>>>       if (rq) {
>>>>>           rq = i915_request_get_rcu(rq);
>>>>>           capture = intel_engine_coredump_add_request(ee, rq,
>>>>> ATOMIC_MAYFAIL);
>>>>>           i915_request_put(rq);
>>>>>       }
>>>>>
>>>>> The internals of the request object are only touched in the above
>>>>> _add_request() code. The later _add_vma() call fiddles around with
>>>>> vmas
>>>>> that pulled from the request but the capture_vma code inside
>>>>> _add_request() has already copied everything, hasn't it? Or rather,
>>>>> it
>>>>> has grabbed its own private vma resource locks. So there is no
>>>>> requirement to keep the request itself around still?
>>> That sounds correct. It was some time ago since I worked with this code
>>> but when i started IIRC KASAN told me the request along with the whole
>>> capture list could disappear under us due to a parallel capture.
>>>
>>> So the request reference added then might cover a bit too much now that
>>> we also hold references on vma resources, which it looks like we do in
>>> intel_engine_coredump_add_vma().
>> So that means we end up with:
>>      rq = intel_context_find_active_request(ce);
>>      ...
>>      [test stuff like i915_request_started(rq)]
>>      ...
>>       if (rq) {
>>          rq = i915_request_get_rcu(rq);
>>          capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>>          i915_request_put(rq);
>>      }
>>
>> What is special about coredump_add_request() that it needs the 
>> request to be extra locked for that call and only that call? If the 
>> request can magically vanish after being found then what protects the 
>> _started() query? For that matter, what stops the request_get_rcu() 
>> itself being called on a pointer that is no longer valid? And if we 
>> do actually have sufficient locking in place to prevent that, why 
>> doesn't that cover the coredump_add_request() usage?
>
> There is definitely a red flag there with the difference between the 
> if and else blocks at the top of capture_engine(). And funnily enough, 
> the first block appears to be GuC only. That is not obvious from the 
> code and should probably have a comment, or function names made 
> self-documenting.
In terms of 'red flag', you mean the apparent difference in locking in 
this section?
         ce = intel_engine_get_hung_context(engine);
         if (ce) {
                 intel_engine_clear_hung_context(engine);
                 rq = intel_context_find_active_request(ce);
                 if (!rq || !i915_request_started(rq))
                         goto no_request_capture;
         } else {
                 /*
                  * Getting here with GuC enabled means it is a forced 
error capture
                  * with no actual hang. So, no need to attempt the 
execlist search.
                  */
                 if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
spin_lock_irqsave(&engine->sched_engine->lock, flags);
                         rq = 
intel_engine_execlist_find_hung_request(engine);
spin_unlock_irqrestore(&engine->sched_engine->lock,
                                                flags);
                 }
         }

There is no actual locking difference. The first thing 
intel_context_find_active_request() does is to acquire the relevant 
spinlock for the list it is about to traverse. I assume 
intel_engine_execlist_find_hung_request() must be called from other 
places that already have the appropriate lock and hence the lock must be 
done externally for all callers.

Technically, the first part does not have to be GuC only. It is entirely 
possible that a future improvement to execlists would add support for 
tagging the hanging context up front without requiring a fresh search 
here (based on pre-emption timeouts, heartbeat timeouts, or whatever it 
was that decided to call the error capture code in the first place). So 
there is no reason to add unnecessary enforcement of backend 
implementation details to this higher level by marking that as GuC only. 
Whereas, the search for an individual request after the hang has 
happened is an execlist only implementation detail. Hence the 
enforcement that we don't do that for GuC (with comment to explain).

>
> I guess the special thing about intel_engine_coredump_add_request() is 
> that it dereferences the rq. So it is possibly 573ba126aef3 
> ("drm/i915/guc: Capture error state on context reset") which added a 
> bug where rq can be dereferenced with a reference held. Or perhaps 
> with the GuC backend there is a guarantee request cannot be retired 
> from elsewhere while error capture is examining it.
"added a bug where rq can be dereferenced with a reference held." <-- 
did you mean 'with' or 'without'? Dereferencing with a reference held 
sounds correct to me.

You are meaning the loop in intel_context_find_active_request()? It gets 
the lock on the list of requests that it is scanning through. Presumably 
requests can't vanish which they are on that list, they must have been 
reference counted when adding. So dereferencing within the list has to 
be safe, yes? The issue is that there needs to be an extra get on the 
reference that is returned before dropping the list lock. And that 
reference would have to be released by the caller. Yes? And likewise, 
the request_get that currently exists in capture_engine() needs to be 
inside the execlist spinlock. Which is how it used to be before the GuC 
support was added in the above patch. Or rather, there was no explicit 
request_get at all but all the request processing was done inside the 
execlist spinlock (which seems bad given that it would be allocating 
memory and such within the spinlock?!).

Presumably engine_dump_active_requests() is also broken. It asserts that 
the execlist spinlock is held but does nothing about the GuC's 
equivalent spinlock despite pulling a request of the GuC state's list 
and then doing lots of dereferencing on that. It will also need to have 
intel_context_find_active_request() return an extra reference count on 
the request (before dropping the GuC lock as described above) and then 
manually put the request iff it got it via the hung context route rather 
than the hung request route! Or more simply, also get a local reference 
in the hung request path and then just unconditionally put at the end.

Sound plausible?

>
> To unravel the error entry points into error capture, from execlists, 
> debugfs, ringbuffer, I don't have the time to remind myself how all 
> that works right now. Quite possibly at least some of those run async 
> to the GPU so must be safe against parallel request retirement. So I 
> don't know if the i915_request_get_rcu safe in all those cases without 
> spending some time to refresh my knowledge a bit.
>
> Sounds like the best plan is not to change this too much - just leave 
> the scope of reference held as is and ideally eliminate the necessary 
> goto labels. AFAIR that should be doable without changing anything 
> real and unblock these improvements.
Hmm. If you want it all left unchanged, I don't think you can eliminate 
the goto. But it seems like the fix is to move the existing get into the 
execlist only spinlock and add a new get to 
intel_context_find_active_request(). Then everything should be 
guaranteed protected at all times.

John.

>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [Intel-gfx] [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-16 12:43           ` Tvrtko Ursulin
@ 2023-01-17 21:14             ` John Harrison
  0 siblings, 0 replies; 31+ messages in thread
From: John Harrison @ 2023-01-17 21:14 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/16/2023 04:43, Tvrtko Ursulin wrote:
> On 14/01/2023 01:27, John Harrison wrote:
>> On 1/13/2023 01:22, Tvrtko Ursulin wrote:
>>> On 12/01/2023 20:59, John Harrison wrote:
>>>> On 1/12/2023 02:15, Tvrtko Ursulin wrote:
>>>>> On 12/01/2023 02:53, John.C.Harrison@Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>
>>>>>> Engine resets are supposed to never fail. But in the case when one
>>>>>> does (due to unknown reasons that normally come down to a missing
>>>>>> w/a), it is useful to get as much information out of the system as
>>>>>> possible. Given that the GuC effectively dies on such a 
>>>>>> situation, it
>>>>>> is not possible to get a guilty context notification back. So do a
>>>>>> manual search instead. Given that GuC is dead, this is safe because
>>>>>> GuC won't be changing the engine state asynchronously.
>>>>>>
>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>> ---
>>>>>>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 
>>>>>> +++++++++++++++--
>>>>>>   1 file changed, 15 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> index b436dd7f12e42..99d09e3394597 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct 
>>>>>> work_struct *w)
>>>>>>       guc->submission_state.reset_fail_mask = 0;
>>>>>> spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>>>>   -    if (likely(reset_fail_mask))
>>>>>> +    if (likely(reset_fail_mask)) {
>>>>>> +        struct intel_engine_cs *engine;
>>>>>> +        enum intel_engine_id id;
>>>>>> +
>>>>>> +        /*
>>>>>> +         * GuC is toast at this point - it dead loops after 
>>>>>> sending the failed
>>>>>> +         * reset notification. So need to manually determine the 
>>>>>> guilty context.
>>>>>> +         * Note that it should be safe/reliable to do this here 
>>>>>> because the GuC
>>>>>> +         * is toast and will not be scheduling behind the KMD's 
>>>>>> back.
>>>>>> +         */
>>>>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>>>>> +            intel_guc_find_hung_context(engine);
>>>>>> +
>>>>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>>>>                         I915_ERROR_CAPTURE,
>>>>>> -                      "GuC failed to reset engine mask=0x%x\n",
>>>>>> +                      "GuC failed to reset engine mask=0x%x",
>>>>>>                         reset_fail_mask);
>>>>>> +    }
>>>>>>   }
>>>>>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>>>>>
>>>>> This one I don't feel "at home" enough to r-b. Just a question - 
>>>>> can we be sure at this point that GuC is 100% stuck and there 
>>>>> isn't a chance it somehow comes alive and starts running in 
>>>>> parallel (being driven in parallel by a different "thread" in 
>>>>> i915), interfering with the assumption made in the comment?
>>>> The GuC API definition for the engine reset failure notification is 
>>>> that GuC will dead loop itself after sending - to quote "This is a 
>>>> catastrophic failure that requires a full GT reset, or FLR to 
>>>> recover.". So yes, GuC is 100% stuck and is not going to self 
>>>> recover. Guaranteed. If that changes in the future then that would 
>>>> be a backwards breaking API change and would require a 
>>>> corresponding driver update to go with supporting the new GuC 
>>>> firmware version.
>>>>
>>>> There is the potential for a GT reset to maybe occur in parallel 
>>>> and resurrect the GuC that way. Not sure how that could happen 
>>>> though. The heartbeat timeout is significantly longer than the 
>>>> GuC's pre-emption timeout + engine reset timeout. That just leaves 
>>>> manual resets from the user or maybe from a selftest. If the user 
>>>> is manually poking reset debugfs files then it is already known 
>>>> that all bets are off in terms of getting an accurate error 
>>>> capture. And if a selftest is triggering GT resets in parallel with 
>>>> engine resets then either it is a broken test or it is attempting 
>>>> to test an evil corner case in which it is expected that error 
>>>> capture results will be unreliable. Having said all that, given 
>>>> that the submission_state lock is held here, such a GT reset would 
>>>> not get very far in bring the GuC back up anyway. Certainly, it 
>>>> would not be able to get as far as submitting new work and thus 
>>>> potentially changing the engine state.
>>>>
>>>> So yes, if multiple impossible events occur back to back then the 
>>>> error capture may be wonky. Where wonky means a potentially 
>>>> innocent context/request gets blamed for breaking the hardware. Oh 
>>>> dear. I can live with that.
>>>
>>> Okay, so I was triggered by the "safe/reliable" qualification from 
>>> the comment. I agree "reliable" does not have to be and was mostly 
>>> worried about the "safe" part.
>>>
>>> From what you explain if short heartbeat, or manual reset 
>>> invocation, could actually mess up any of the data structures which 
>>> added intel_guc_find_hung_context walks and so crash the kernel.
>>>
>>> Looking inside, there is some lock dropping going on (and 
>>> undocumented irqsave games), and walking the list while unlocked. So 
>>> whether or not that can go bang if a full reset happens in parallel 
>>> and re-activates the normal driver flows.
>> There is no walking of unlocked lists. The xa_lock is held whenever 
>> it looks at the xa structure itself. The release is only while 
>> analysing the context that was retrieved. And the context retrieval 
>> itself starts with a kref_get_unless_zero. So everything is only ever 
>> accessed while locked or reference counted. The unlock of the xa 
>> while analysing a context is because the xa object can be accessed 
>> from interrupt code and so we don't want to hold it locked 
>> unnecessarily while scanning through requests within a context (all 
>> code which has no connection to the GuC backend at all).
>
> AFAICS intel_guc_find_hung_context walks &ce->guc_state.requests with 
> no locks held. Other places in the code appear to use 
> &ce->guc_state.lock, or maybe &sched_engine->lock, not sure. Then we 
> have request submission, retirement and a few other places modify that 
> list. So *if* indeed hung GuC can get resurrected by a parallel full 
> reset while reset_fail_worker_func is running, why couldn't that list 
> walk explode?
Blurgh. Didn't even notice that loop somehow. Or rather, was assuming 
the context reference count covered everything for some dumb reason and 
didn't look closely at the fact it was scanning something else entirely. 
Yeah, there should be a guc_state lock around that loop.

John.

>
> Regards,
>
> Tvrtko
>
>> I can drop the word 'safe' if it makes you nervous. That was only 
>> meant to refer to the possibility of such a scan returning bogus 
>> results due to contexts switching in/out of the hardware 
>> before/during/after the scan. There is no way for it to go bang.
>>
>> John.
>>
>>
>>>
>>> Regards,
>>>
>>> Tvrtko
>>


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2023-01-17 21:14 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12  2:53 [PATCH 0/4] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-12  2:53 ` [Intel-gfx] " John.C.Harrison
2023-01-12  2:53 ` [PATCH 1/4] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
2023-01-12 10:01   ` Tvrtko Ursulin
2023-01-12 20:40     ` John Harrison
2023-01-13  9:51       ` Tvrtko Ursulin
2023-01-13 17:46         ` Hellstrom, Thomas
2023-01-13 21:29           ` John Harrison
2023-01-16 12:38             ` Tvrtko Ursulin
2023-01-17 19:40               ` John Harrison
2023-01-16 12:13           ` Tvrtko Ursulin
2023-01-12  2:53 ` [PATCH 2/4] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
2023-01-12 10:06   ` Tvrtko Ursulin
2023-01-12 20:46     ` John Harrison
2023-01-13  9:10       ` Tvrtko Ursulin
2023-01-12  2:53 ` [PATCH 3/4] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
2023-01-12 10:15   ` Tvrtko Ursulin
2023-01-12 20:59     ` John Harrison
2023-01-13  9:22       ` Tvrtko Ursulin
2023-01-14  1:27         ` John Harrison
2023-01-16 12:43           ` Tvrtko Ursulin
2023-01-17 21:14             ` John Harrison
2023-01-12  2:53 ` [PATCH 4/4] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-12  2:53   ` [Intel-gfx] " John.C.Harrison
2023-01-12 10:11   ` Tvrtko Ursulin
2023-01-12  3:21 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev2) Patchwork
2023-01-12  3:36 ` [Intel-gfx] ✓ Fi.CI.BAT: success " Patchwork
2023-01-12  5:36 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.