All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] Allow error capture without a request / on reset failure
@ 2023-01-17 21:36 ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

v2: Tidy up code flow in error capture. Reword some comments/messages.
(review feedback from Tvrtko)
Also fix up request locking issues from earlier changes noticed during
code review of this change.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (5):
  drm/i915: Fix request locking during error capture & debugfs dump
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 drivers/gpu/drm/i915/gt/intel_context.c       |  1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  7 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 32 ++++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 71 ++++++++++++-------
 4 files changed, 82 insertions(+), 29 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 0/5] Allow error capture without a request / on reset failure
@ 2023-01-17 21:36 ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

v2: Tidy up code flow in error capture. Reword some comments/messages.
(review feedback from Tvrtko)
Also fix up request locking issues from earlier changes noticed during
code review of this change.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (5):
  drm/i915: Fix request locking during error capture & debugfs dump
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 drivers/gpu/drm/i915/gt/intel_context.c       |  1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     |  7 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 32 ++++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 71 ++++++++++++-------
 4 files changed, 82 insertions(+), 29 deletions(-)

-- 
2.39.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-17 21:36   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Matthew Brost, Tvrtko Ursulin, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	intel-gfx, Lucas De Marchi, Chris Wilson, Bruce Chang,
	Daniele Ceraolo Spurio, DRI-Devel, Andrzej Hajda, Rodrigo Vivi,
	Tejas Upadhyay, John Harrison, Matthew Auld

From: John Harrison <John.C.Harrison@Intel.com>

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 		active = rq;
 	}
+	active = i915_request_get_rcu(active);
 	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
 	if (guc) {
 		ce = intel_engine_get_hung_context(engine);
-		if (ce)
+		if (ce) {
+			/* This will reference count the request (if found) */
 			hung_rq = intel_context_find_active_request(ce);
+		}
 	} else {
 		hung_rq = intel_engine_execlist_find_hung_request(engine);
+		hung_rq = i915_request_get_rcu(hung_rq);
 	}
 
 	if (hung_rq)
@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	else
 		intel_engine_dump_active_requests(&engine->sched_engine->requests,
 						  hung_rq, m);
+	if (hung_rq)
+		i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..3b34a82d692be 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
 		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
@@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
+		found = false;
+		spin_lock(&ce->guc_state.lock);
 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
@@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			xa_lock(&guc->context_lookup);
 			goto done;
 		}
+
 next:
 		intel_context_put(ce);
 		xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..4107a0dfcca7d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
 	ce = intel_engine_get_hung_context(engine);
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
+		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
 		if (!rq || !i915_request_started(rq))
 			goto no_request_capture;
@@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
 		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
 			spin_lock_irqsave(&engine->sched_engine->lock, flags);
 			rq = intel_engine_execlist_find_hung_request(engine);
+			rq = i915_request_get_rcu(rq);
 			spin_unlock_irqrestore(&engine->sched_engine->lock,
 					       flags);
 		}
 	}
-	if (rq)
-		rq = i915_request_get_rcu(rq);
-
 	if (!rq)
 		goto no_request_capture;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-17 21:36   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Andy Shevchenko, Michael Cheng, Alan Previn, intel-gfx,
	Lucas De Marchi, Chris Wilson, DRI-Devel, Andrzej Hajda,
	Rodrigo Vivi, Tejas Upadhyay, Matthew Auld

From: John Harrison <John.C.Harrison@Intel.com>

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count. So no change to that
code itself but the context version does change.

The only other caller is the code for dumping engine state to debugfs.
That code wasn't previously getting an explicit reference at all as it
does everything while holding the execlist specific spinlock. So that
needs updaing as well as that spinlock doesn't help when using GuC
submission. Rather than trying to conditionally get/put depending on
submission model, just change it to always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up too.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
 4 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..df64cf1954c1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 		active = rq;
 	}
+	active = i915_request_get_rcu(active);
 	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..517d1fb7ae333 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
 	if (guc) {
 		ce = intel_engine_get_hung_context(engine);
-		if (ce)
+		if (ce) {
+			/* This will reference count the request (if found) */
 			hung_rq = intel_context_find_active_request(ce);
+		}
 	} else {
 		hung_rq = intel_engine_execlist_find_hung_request(engine);
+		hung_rq = i915_request_get_rcu(hung_rq);
 	}
 
 	if (hung_rq)
@@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	else
 		intel_engine_dump_active_requests(&engine->sched_engine->requests,
 						  hung_rq, m);
+	if (hung_rq)
+		i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..3b34a82d692be 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
 		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
@@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
+		found = false;
+		spin_lock(&ce->guc_state.lock);
 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
@@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			xa_lock(&guc->context_lookup);
 			goto done;
 		}
+
 next:
 		intel_context_put(ce);
 		xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..4107a0dfcca7d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
 	ce = intel_engine_get_hung_context(engine);
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
+		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
 		if (!rq || !i915_request_started(rq))
 			goto no_request_capture;
@@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
 		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
 			spin_lock_irqsave(&engine->sched_engine->lock, flags);
 			rq = intel_engine_execlist_find_hung_request(engine);
+			rq = i915_request_get_rcu(rq);
 			spin_unlock_irqrestore(&engine->sched_engine->lock,
 					       flags);
 		}
 	}
-	if (rq)
-		rq = i915_request_get_rcu(rq);
-
 	if (!rq)
 		goto no_request_capture;
 
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 2/5] drm/i915: Allow error capture without a request
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-17 21:36   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Umesh Nerlige Ramappa, John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if an engine is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.
v3: Tidy up request locking code flow (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 69 ++++++++++++++++++---------
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4107a0dfcca7d..461489d599a7e 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1609,8 +1629,12 @@ capture_engine(struct intel_engine_cs *engine,
 		intel_engine_clear_hung_context(engine);
 		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
+		if (rq && !i915_request_started(rq)) {
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+				 engine->name);
+			i915_request_put(rq);
+			rq = NULL;
+		}
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
@@ -1624,25 +1648,24 @@ capture_engine(struct intel_engine_cs *engine,
 					       flags);
 		}
 	}
-	if (!rq)
-		goto no_request_capture;
-
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture) {
+	if (rq) {
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
 		i915_request_put(rq);
-		goto no_request_capture;
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
 	}
+
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
-	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (capture) {
+		intel_engine_coredump_add_vma(ee, capture, compress);
+	} else {
+		kfree(ee);
+		ee = NULL;
+	}
 
 	return ee;
-
-no_request_capture:
-	kfree(ee);
-	return NULL;
 }
 
 static void
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 2/5] drm/i915: Allow error capture without a request
@ 2023-01-17 21:36   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if an engine is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.
v3: Tidy up request locking code flow (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 69 ++++++++++++++++++---------
 1 file changed, 46 insertions(+), 23 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 4107a0dfcca7d..461489d599a7e 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1609,8 +1629,12 @@ capture_engine(struct intel_engine_cs *engine,
 		intel_engine_clear_hung_context(engine);
 		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
+		if (rq && !i915_request_started(rq)) {
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+				 engine->name);
+			i915_request_put(rq);
+			rq = NULL;
+		}
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
@@ -1624,25 +1648,24 @@ capture_engine(struct intel_engine_cs *engine,
 					       flags);
 		}
 	}
-	if (!rq)
-		goto no_request_capture;
-
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture) {
+	if (rq) {
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
 		i915_request_put(rq);
-		goto no_request_capture;
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
 	}
+
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
-	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (capture) {
+		intel_engine_coredump_add_vma(ee, capture, compress);
+	} else {
+		kfree(ee);
+		ee = NULL;
+	}
 
 	return ee;
-
-no_request_capture:
-	kfree(ee);
-	return NULL;
 }
 
 static void
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 3/5] drm/i915: Allow error capture of a pending request
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-17 21:36   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

v2: Reword 'not started' warning message (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 461489d599a7e..1d33822a8ca23 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1629,12 +1629,9 @@ capture_engine(struct intel_engine_cs *engine,
 		intel_engine_clear_hung_context(engine);
 		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
-		if (rq && !i915_request_started(rq)) {
-			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-				 engine->name);
-			i915_request_put(rq);
-			rq = NULL;
-		}
+		if (rq && !i915_request_started(rq))
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
+				 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 3/5] drm/i915: Allow error capture of a pending request
@ 2023-01-17 21:36   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

v2: Reword 'not started' warning message (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 461489d599a7e..1d33822a8ca23 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1629,12 +1629,9 @@ capture_engine(struct intel_engine_cs *engine,
 		intel_engine_clear_hung_context(engine);
 		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
-		if (rq && !i915_request_started(rq)) {
-			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-				 engine->name);
-			i915_request_put(rq);
-			rq = NULL;
-		}
+		if (rq && !i915_request_started(rq))
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
+				 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-17 21:36   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

v2: Change comment to be less alarming (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3b34a82d692be..9bc80b807dbcc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be reliable to do this here because the GuC is
+		 * toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails
@ 2023-01-17 21:36   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

v2: Change comment to be less alarming (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 3b34a82d692be..9bc80b807dbcc 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be reliable to do this here because the GuC is
+		 * toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v2 5/5] drm/i915/guc: Add a debug print on GuC triggered reset
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-17 21:36   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9bc80b807dbcc..6029dbab5feeb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v2 5/5] drm/i915/guc: Add a debug print on GuC triggered reset
@ 2023-01-17 21:36   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-17 21:36 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 9bc80b807dbcc..6029dbab5feeb 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4665,6 +4665,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Allow error capture without a request / on reset failure (rev3)
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
                   ` (5 preceding siblings ...)
  (?)
@ 2023-01-17 22:55 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-01-17 22:55 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev3)
URL   : https://patchwork.freedesktop.org/series/111454/
State : warning

== Summary ==

Error: dim checkpatch failed
79d89c5b0cf9 drm/i915: Fix request locking during error capture & debugfs dump
-:26: WARNING:BAD_FIXES_TAG: Please use correct Fixes: style 'Fixes: <12 chars of sha1> ("<title line>")' - ie: 'Fixes: 98df76aeb29b ("drm/i915/guc: Add a debug print on GuC triggered reset")'
#26: 
Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset

total: 0 errors, 1 warnings, 0 checks, 83 lines checked
6ad402b58742 drm/i915: Allow error capture without a request
f5f448ea38a4 drm/i915: Allow error capture of a pending request
43da516a0b97 drm/i915/guc: Look for a guilty context when an engine reset fails
98df76aeb29b drm/i915/guc: Add a debug print on GuC triggered reset



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Allow error capture without a request / on reset failure (rev3)
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
                   ` (6 preceding siblings ...)
  (?)
@ 2023-01-17 22:55 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-01-17 22:55 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev3)
URL   : https://patchwork.freedesktop.org/series/111454/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:117:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:148:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:150:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:154:26: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:156:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:174:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:176:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:180:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:182:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:186:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:188:9: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:192:35: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:16: warning: unreplaced symbol 'oldbit'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:195:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:237:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:237:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:237:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:239:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:239:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:239:9: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:66:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./arch/x86/include/asm/bitops.h:92:1: warning: unreplaced symbol 'return'
+./drivers/gpu/drm/i915/intel_uncore.h:346:1: warning: trying to copy expression type 31
+./drivers/gpu/drm/i915/intel_uncore.h:346:1: warning: trying to copy expression type 31
+./drivers/gpu/drm/i915/intel_uncore.h:351:1: warning: trying to copy expression type 31
+./include/asm-generic/bitops/generic-non-atomic.h:100:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:100:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:100:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:100:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:100:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:100:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:105:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:105:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:105:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:107:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:108:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:109:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:109:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:109:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:111:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:111:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:111:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:111:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:111:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:112:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:112:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:112:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:112:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:112:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:112:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:121:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:121:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:121:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:128:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:128:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:128:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:166:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:166:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:166:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:168:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:168:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:168:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:169:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:169:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:169:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:170:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:170:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:170:9: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:19: warning: unreplaced symbol 'val'
+./include/asm-generic/bitops/generic-non-atomic.h:172:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:172:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:172:25: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:172:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:172:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:172:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:28:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:28:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:28:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:30:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:31:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:31:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:31:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:33:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:33:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:33:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:37:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:37:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:37:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:39:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:40:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:40:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:40:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:42:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:42:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:42:16: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:55:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:55:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:55:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:57:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:58:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:58:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:58:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:60:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:60:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:60:15: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:73:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:73:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:73:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:75:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:76:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:77:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:77:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:77:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:79:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:79:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:79:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:79:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:79:20: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:80:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:80:17: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:80:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:23: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:80:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:80:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:80:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:93:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:93:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:93:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:95:9: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:96:9: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:97:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:97:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:97:9: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:99:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:99:10: warning: unreplaced symbol 'p'
+./include/asm-generic/bitops/generic-non-atomic.h:99:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:14: warning: unreplaced symbol 'old'
+./include/asm-generic/bitops/generic-non-atomic.h:99:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/generic-non-atomic.h:99:21: warning: unreplaced symbol 'mask'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:100:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:112:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:115:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:127:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:130:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:139:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:142:9: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:26:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:42:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:58:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'
+./include/asm-generic/bitops/instrumented-non-atomic.h:97:1: warning: unreplaced symbol 'return'



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Allow error capture without a request / on reset failure (rev3)
  2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
                   ` (7 preceding siblings ...)
  (?)
@ 2023-01-17 23:05 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-01-17 23:05 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 19478 bytes --]

== Series Details ==

Series: Allow error capture without a request / on reset failure (rev3)
URL   : https://patchwork.freedesktop.org/series/111454/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12594 -> Patchwork_111454v3
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_111454v3 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_111454v3, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/index.html

Participating hosts (42 -> 44)
------------------------------

  Additional (3): fi-kbl-soraka bat-rpls-2 fi-skl-6700k2 
  Missing    (1): fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_111454v3:

### IGT changes ###

#### Possible regressions ####

  * igt@debugfs_test@read_all_entries:
    - fi-kbl-soraka:      NOTRUN -> [INCOMPLETE][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-kbl-soraka/igt@debugfs_test@read_all_entries.html
    - fi-skl-6700k2:      NOTRUN -> [INCOMPLETE][2]
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-6700k2/igt@debugfs_test@read_all_entries.html
    - fi-cfl-8109u:       [PASS][3] -> [INCOMPLETE][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-cfl-8109u/igt@debugfs_test@read_all_entries.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-8109u/igt@debugfs_test@read_all_entries.html
    - fi-kbl-7567u:       [PASS][5] -> [INCOMPLETE][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-kbl-7567u/igt@debugfs_test@read_all_entries.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-kbl-7567u/igt@debugfs_test@read_all_entries.html
    - fi-kbl-8809g:       [PASS][7] -> [INCOMPLETE][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-kbl-8809g/igt@debugfs_test@read_all_entries.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-kbl-8809g/igt@debugfs_test@read_all_entries.html
    - fi-ivb-3770:        [PASS][9] -> [INCOMPLETE][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-ivb-3770/igt@debugfs_test@read_all_entries.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-ivb-3770/igt@debugfs_test@read_all_entries.html
    - fi-elk-e7500:       [PASS][11] -> [INCOMPLETE][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-elk-e7500/igt@debugfs_test@read_all_entries.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-elk-e7500/igt@debugfs_test@read_all_entries.html
    - fi-ctg-p8600:       [PASS][13] -> [INCOMPLETE][14]
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-ctg-p8600/igt@debugfs_test@read_all_entries.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-ctg-p8600/igt@debugfs_test@read_all_entries.html
    - fi-blb-e6850:       [PASS][15] -> [INCOMPLETE][16]
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-blb-e6850/igt@debugfs_test@read_all_entries.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-blb-e6850/igt@debugfs_test@read_all_entries.html
    - fi-bsw-n3050:       [PASS][17] -> [INCOMPLETE][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-bsw-n3050/igt@debugfs_test@read_all_entries.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-n3050/igt@debugfs_test@read_all_entries.html
    - fi-cfl-guc:         [PASS][19] -> [INCOMPLETE][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-cfl-guc/igt@debugfs_test@read_all_entries.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-guc/igt@debugfs_test@read_all_entries.html
    - fi-skl-6600u:       [PASS][21] -> [INCOMPLETE][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-skl-6600u/igt@debugfs_test@read_all_entries.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-6600u/igt@debugfs_test@read_all_entries.html
    - fi-pnv-d510:        [PASS][23] -> [INCOMPLETE][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-pnv-d510/igt@debugfs_test@read_all_entries.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-pnv-d510/igt@debugfs_test@read_all_entries.html
    - fi-glk-j4005:       [PASS][25] -> [INCOMPLETE][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-glk-j4005/igt@debugfs_test@read_all_entries.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-glk-j4005/igt@debugfs_test@read_all_entries.html
    - fi-snb-2600:        [PASS][27] -> [INCOMPLETE][28]
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-snb-2600/igt@debugfs_test@read_all_entries.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-snb-2600/igt@debugfs_test@read_all_entries.html
    - fi-skl-guc:         [PASS][29] -> [INCOMPLETE][30]
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-skl-guc/igt@debugfs_test@read_all_entries.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-guc/igt@debugfs_test@read_all_entries.html
    - fi-cfl-8700k:       [PASS][31] -> [INCOMPLETE][32]
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-cfl-8700k/igt@debugfs_test@read_all_entries.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-8700k/igt@debugfs_test@read_all_entries.html
    - fi-bsw-nick:        [PASS][33] -> [INCOMPLETE][34]
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-bsw-nick/igt@debugfs_test@read_all_entries.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-nick/igt@debugfs_test@read_all_entries.html
    - fi-rkl-11600:       [PASS][35] -> [INCOMPLETE][36]
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-rkl-11600/igt@debugfs_test@read_all_entries.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-rkl-11600/igt@debugfs_test@read_all_entries.html
    - fi-bsw-kefka:       [PASS][37] -> [INCOMPLETE][38]
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-bsw-kefka/igt@debugfs_test@read_all_entries.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-kefka/igt@debugfs_test@read_all_entries.html
    - fi-adl-ddr5:        [PASS][39] -> [INCOMPLETE][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-adl-ddr5/igt@debugfs_test@read_all_entries.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-adl-ddr5/igt@debugfs_test@read_all_entries.html

  * igt@i915_hangman@error-state-basic:
    - bat-dg1-5:          [PASS][41] -> [INCOMPLETE][42]
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg1-5/igt@i915_hangman@error-state-basic.html
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg1-5/igt@i915_hangman@error-state-basic.html
    - fi-rkl-guc:         [PASS][43] -> [INCOMPLETE][44]
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-rkl-guc/igt@i915_hangman@error-state-basic.html
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-rkl-guc/igt@i915_hangman@error-state-basic.html
    - bat-dg1-6:          [PASS][45] -> [INCOMPLETE][46]
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg1-6/igt@i915_hangman@error-state-basic.html
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg1-6/igt@i915_hangman@error-state-basic.html

  * igt@runner@aborted:
    - fi-apl-guc:         NOTRUN -> [FAIL][47]
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-apl-guc/igt@runner@aborted.html
    - bat-dg1-5:          NOTRUN -> [FAIL][48]
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg1-5/igt@runner@aborted.html
    - fi-rkl-guc:         NOTRUN -> [FAIL][49]
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-rkl-guc/igt@runner@aborted.html
    - bat-dg1-6:          NOTRUN -> [FAIL][50]
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg1-6/igt@runner@aborted.html

  
#### Warnings ####

  * igt@debugfs_test@read_all_entries:
    - fi-icl-u2:          [ABORT][51] ([i915#7763]) -> [INCOMPLETE][52]
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-icl-u2/igt@debugfs_test@read_all_entries.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-icl-u2/igt@debugfs_test@read_all_entries.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * igt@debugfs_test@read_all_entries:
    - {bat-jsl-1}:        [PASS][53] -> [INCOMPLETE][54]
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-jsl-1/igt@debugfs_test@read_all_entries.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-jsl-1/igt@debugfs_test@read_all_entries.html
    - {fi-jsl-1}:         [PASS][55] -> [INCOMPLETE][56]
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-jsl-1/igt@debugfs_test@read_all_entries.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-jsl-1/igt@debugfs_test@read_all_entries.html
    - {bat-kbl-2}:        [PASS][57] -> [INCOMPLETE][58]
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-kbl-2/igt@debugfs_test@read_all_entries.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-kbl-2/igt@debugfs_test@read_all_entries.html
    - {bat-adls-5}:       [PASS][59] -> [INCOMPLETE][60]
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-adls-5/igt@debugfs_test@read_all_entries.html
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-adls-5/igt@debugfs_test@read_all_entries.html
    - {bat-jsl-3}:        [PASS][61] -> [INCOMPLETE][62]
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-jsl-3/igt@debugfs_test@read_all_entries.html
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-jsl-3/igt@debugfs_test@read_all_entries.html

  * igt@i915_hangman@error-state-basic:
    - {bat-dg2-9}:        [PASS][63] -> [INCOMPLETE][64]
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg2-9/igt@i915_hangman@error-state-basic.html
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg2-9/igt@i915_hangman@error-state-basic.html
    - {bat-adln-1}:       [PASS][65] -> [INCOMPLETE][66]
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-adln-1/igt@i915_hangman@error-state-basic.html
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-adln-1/igt@i915_hangman@error-state-basic.html
    - {bat-rpls-2}:       NOTRUN -> [INCOMPLETE][67]
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-rpls-2/igt@i915_hangman@error-state-basic.html
    - {bat-dg2-8}:        [PASS][68] -> [INCOMPLETE][69]
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg2-8/igt@i915_hangman@error-state-basic.html
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg2-8/igt@i915_hangman@error-state-basic.html
    - {bat-adlm-1}:       [PASS][70] -> [INCOMPLETE][71]
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-adlm-1/igt@i915_hangman@error-state-basic.html
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-adlm-1/igt@i915_hangman@error-state-basic.html
    - {bat-rpls-1}:       [PASS][72] -> [INCOMPLETE][73]
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-rpls-1/igt@i915_hangman@error-state-basic.html
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-rpls-1/igt@i915_hangman@error-state-basic.html
    - {bat-adlp-6}:       [PASS][74] -> [INCOMPLETE][75]
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-adlp-6/igt@i915_hangman@error-state-basic.html
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-adlp-6/igt@i915_hangman@error-state-basic.html
    - {bat-dg1-7}:        [PASS][76] -> [INCOMPLETE][77]
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg1-7/igt@i915_hangman@error-state-basic.html
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg1-7/igt@i915_hangman@error-state-basic.html
    - {bat-adlp-9}:       [PASS][78] -> [INCOMPLETE][79]
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-adlp-9/igt@i915_hangman@error-state-basic.html
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-adlp-9/igt@i915_hangman@error-state-basic.html
    - {bat-dg2-11}:       [PASS][80] -> [INCOMPLETE][81]
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg2-11/igt@i915_hangman@error-state-basic.html
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg2-11/igt@i915_hangman@error-state-basic.html
    - {bat-dg2-oem1}:     [PASS][82] -> [INCOMPLETE][83]
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-dg2-oem1/igt@i915_hangman@error-state-basic.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg2-oem1/igt@i915_hangman@error-state-basic.html
    - {bat-rplp-1}:       [PASS][84] -> [INCOMPLETE][85]
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/bat-rplp-1/igt@i915_hangman@error-state-basic.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-rplp-1/igt@i915_hangman@error-state-basic.html

  * igt@runner@aborted:
    - {bat-dg2-9}:        NOTRUN -> [FAIL][86]
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/bat-dg2-9/igt@runner@aborted.html

  
Known issues
------------

  Here are the changes found in Patchwork_111454v3 that come from known issues:

### CI changes ###

#### Issues hit ####

  * boot:
    - fi-ilk-650:         [PASS][87] -> [FAIL][88] ([i915#7350])
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12594/fi-ilk-650/boot.html
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-ilk-650/boot.html

  

### IGT changes ###

#### Issues hit ####

  * igt@runner@aborted:
    - fi-skl-6700k2:      NOTRUN -> [FAIL][89] ([i915#4312])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-6700k2/igt@runner@aborted.html
    - fi-cfl-8109u:       NOTRUN -> [FAIL][90] ([i915#4312])
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-8109u/igt@runner@aborted.html
    - fi-ivb-3770:        NOTRUN -> [FAIL][91] ([i915#4312])
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-ivb-3770/igt@runner@aborted.html
    - fi-elk-e7500:       NOTRUN -> [FAIL][92] ([i915#4312])
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-elk-e7500/igt@runner@aborted.html
    - fi-snb-2600:        NOTRUN -> [FAIL][93] ([i915#4312])
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-snb-2600/igt@runner@aborted.html
    - fi-ctg-p8600:       NOTRUN -> [FAIL][94] ([i915#4312])
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-ctg-p8600/igt@runner@aborted.html
    - fi-blb-e6850:       NOTRUN -> [FAIL][95] ([i915#4312])
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-blb-e6850/igt@runner@aborted.html
    - fi-bsw-n3050:       NOTRUN -> [FAIL][96] ([i915#4312])
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-n3050/igt@runner@aborted.html
    - fi-skl-6600u:       NOTRUN -> [FAIL][97] ([i915#4312])
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-6600u/igt@runner@aborted.html
    - fi-pnv-d510:        NOTRUN -> [FAIL][98] ([i915#4312])
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-pnv-d510/igt@runner@aborted.html
    - fi-glk-j4005:       NOTRUN -> [FAIL][99] ([i915#4312])
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-glk-j4005/igt@runner@aborted.html
    - fi-skl-guc:         NOTRUN -> [FAIL][100] ([i915#4312])
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-skl-guc/igt@runner@aborted.html
    - fi-kbl-soraka:      NOTRUN -> [FAIL][101] ([i915#4312] / [i915#4991])
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-kbl-soraka/igt@runner@aborted.html
    - fi-kbl-7567u:       NOTRUN -> [FAIL][102] ([i915#4312])
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-kbl-7567u/igt@runner@aborted.html
    - fi-cfl-8700k:       NOTRUN -> [FAIL][103] ([i915#4312])
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-8700k/igt@runner@aborted.html
    - fi-bsw-nick:        NOTRUN -> [FAIL][104] ([i915#4312])
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-nick/igt@runner@aborted.html
    - fi-rkl-11600:       NOTRUN -> [FAIL][105] ([i915#4312])
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-rkl-11600/igt@runner@aborted.html
    - fi-bsw-kefka:       NOTRUN -> [FAIL][106] ([i915#4312])
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-bsw-kefka/igt@runner@aborted.html
    - fi-adl-ddr5:        NOTRUN -> [FAIL][107] ([i915#4312])
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-adl-ddr5/igt@runner@aborted.html
    - fi-cfl-guc:         NOTRUN -> [FAIL][108] ([i915#4312])
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/fi-cfl-guc/igt@runner@aborted.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4991]: https://gitlab.freedesktop.org/drm/intel/issues/4991
  [i915#7350]: https://gitlab.freedesktop.org/drm/intel/issues/7350
  [i915#7456]: https://gitlab.freedesktop.org/drm/intel/issues/7456
  [i915#7763]: https://gitlab.freedesktop.org/drm/intel/issues/7763


Build changes
-------------

  * Linux: CI_DRM_12594 -> Patchwork_111454v3

  CI-20190529: 20190529
  CI_DRM_12594: 5cec9cff5436577179bab7b52de0465ba169691a @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7121: aa16e81259f59734230d441905b9d0f605e4a4b5 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_111454v3: 5cec9cff5436577179bab7b52de0465ba169691a @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

bbfe814bb44f drm/i915/guc: Add a debug print on GuC triggered reset
820c31964ca3 drm/i915/guc: Look for a guilty context when an engine reset fails
3c3cc3702cb7 drm/i915: Allow error capture of a pending request
7b9165e8a786 drm/i915: Allow error capture without a request
94efccec5768 drm/i915: Fix request locking during error capture & debugfs dump

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_111454v3/index.html

[-- Attachment #2: Type: text/html, Size: 22244 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
@ 2023-01-18  8:29     ` Andy Shevchenko
  -1 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-18  8:29 UTC (permalink / raw)
  To: John.C.Harrison
  Cc: Matthew Brost, Tvrtko Ursulin, Bruce Chang, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	Intel-GFX, Lucas De Marchi, Chris Wilson, Daniele Ceraolo Spurio,
	DRI-Devel, Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay,
	Matthew Auld

On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count. So no change to that
> code itself but the context version does change.
> 
> The only other caller is the code for dumping engine state to debugfs.
> That code wasn't previously getting an explicit reference at all as it
> does everything while holding the execlist specific spinlock. So that
> needs updaing as well as that spinlock doesn't help when using GuC
> submission. Rather than trying to conditionally get/put depending on
> submission model, just change it to always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up too.

> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> with GuC")

Must be one line.

> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")

> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org

Is it possible to utilize --to --cc parameters to git send-email instead of
noisy Cc list?

...

> +	if (hung_rq)
> +		i915_request_put(hung_rq);

In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
(or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
inside i915_request_put() rather than in any of the callers.

...

> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>  			xa_lock(&guc->context_lookup);
>  			goto done;
>  		}
> +
>  next:
>  		intel_context_put(ce);
>  		xa_lock(&guc->context_lookup);

Stray change.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18  8:29     ` Andy Shevchenko
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-18  8:29 UTC (permalink / raw)
  To: John.C.Harrison
  Cc: Michael Cheng, Alan Previn, Intel-GFX, Lucas De Marchi,
	Chris Wilson, DRI-Devel, Andrzej Hajda, Rodrigo Vivi,
	Tejas Upadhyay, Matthew Auld

On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count. So no change to that
> code itself but the context version does change.
> 
> The only other caller is the code for dumping engine state to debugfs.
> That code wasn't previously getting an explicit reference at all as it
> does everything while holding the execlist specific spinlock. So that
> needs updaing as well as that spinlock doesn't help when using GuC
> submission. Rather than trying to conditionally get/put depending on
> submission model, just change it to always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up too.

> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> with GuC")

Must be one line.

> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")

> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org

Is it possible to utilize --to --cc parameters to git send-email instead of
noisy Cc list?

...

> +	if (hung_rq)
> +		i915_request_put(hung_rq);

In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
(or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
inside i915_request_put() rather than in any of the callers.

...

> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>  			xa_lock(&guc->context_lookup);
>  			goto done;
>  		}
> +
>  next:
>  		intel_context_put(ce);
>  		xa_lock(&guc->context_lookup);

Stray change.

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
@ 2023-01-18 16:22     ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2023-01-18 16:22 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX
  Cc: Matthew Brost, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	Lucas De Marchi, Chris Wilson, Bruce Chang,
	Daniele Ceraolo Spurio, DRI-Devel, Andrzej Hajda, Rodrigo Vivi,
	Tejas Upadhyay, Matthew Auld


On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count. So no change to that
> code itself but the context version does change.
> 
> The only other caller is the code for dumping engine state to debugfs.
> That code wasn't previously getting an explicit reference at all as it
> does everything while holding the execlist specific spinlock. So that
> needs updaing as well as that spinlock doesn't help when using GuC
> submission. Rather than trying to conditionally get/put depending on
> submission model, just change it to always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up too.
> 
> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> with GuC")
> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
>   drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
>   4 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e94365b08f1ef..df64cf1954c1d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   
>   		active = rq;
>   	}
> +	active = i915_request_get_rcu(active);
>   	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 922f1bb22dc68..517d1fb7ae333 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>   	if (guc) {
>   		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> +		if (ce) {
> +			/* This will reference count the request (if found) */
>   			hung_rq = intel_context_find_active_request(ce);
> +		}
>   	} else {
>   		hung_rq = intel_engine_execlist_find_hung_request(engine);
> +		hung_rq = i915_request_get_rcu(hung_rq);

Looks like intel_engine_execlist_find_hung_request can return NULL which 
i915_request_get_rcu will not handle.

Maybe it would come up simpler if intel_context_find_active_request 
wouldn't be getting the reference and then you can get one here at a 
single place for both branches?

>   	}
>   
>   	if (hung_rq)
> @@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	else
>   		intel_engine_dump_active_requests(&engine->sched_engine->requests,
>   						  hung_rq, m);
> +	if (hung_rq)
> +		i915_request_put(hung_rq);
>   }
>   
>   void intel_engine_dump(struct intel_engine_cs *engine,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..3b34a82d692be 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> +		bool found;
> +
>   		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> +			found = true;
> +			break;
> +		}
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		if (found) {
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			xa_lock(&guc->context_lookup);
>   			goto done;
>   		}
> +
>   next:
>   		intel_context_put(ce);
>   		xa_lock(&guc->context_lookup);

This hunk I have to leave for someone who know the GuC backend well.

Regards,

Tvrtko

> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..4107a0dfcca7d 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	ce = intel_engine_get_hung_context(engine);
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
> +		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq || !i915_request_started(rq))
>   			goto no_request_capture;
> @@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
>   		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>   			spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   			rq = intel_engine_execlist_find_hung_request(engine);
> +			rq = i915_request_get_rcu(rq);
>   			spin_unlock_irqrestore(&engine->sched_engine->lock,
>   					       flags);
>   		}
>   	}
> -	if (rq)
> -		rq = i915_request_get_rcu(rq);
> -
>   	if (!rq)
>   		goto no_request_capture;
>   

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18 16:22     ` Tvrtko Ursulin
  0 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2023-01-18 16:22 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX
  Cc: Andy Shevchenko, Michael Cheng, Alan Previn, Lucas De Marchi,
	Chris Wilson, DRI-Devel, Andrzej Hajda, Rodrigo Vivi,
	Tejas Upadhyay, Matthew Auld


On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count. So no change to that
> code itself but the context version does change.
> 
> The only other caller is the code for dumping engine state to debugfs.
> That code wasn't previously getting an explicit reference at all as it
> does everything while holding the execlist specific spinlock. So that
> needs updaing as well as that spinlock doesn't help when using GuC
> submission. Rather than trying to conditionally get/put depending on
> submission model, just change it to always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up too.
> 
> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> with GuC")
> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Chris Wilson <chris@chris-wilson.co.uk>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
>   drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
>   4 files changed, 20 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e94365b08f1ef..df64cf1954c1d 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -552,6 +552,7 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   
>   		active = rq;
>   	}
> +	active = i915_request_get_rcu(active);
>   	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 922f1bb22dc68..517d1fb7ae333 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2236,10 +2236,13 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>   	if (guc) {
>   		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> +		if (ce) {
> +			/* This will reference count the request (if found) */
>   			hung_rq = intel_context_find_active_request(ce);
> +		}
>   	} else {
>   		hung_rq = intel_engine_execlist_find_hung_request(engine);
> +		hung_rq = i915_request_get_rcu(hung_rq);

Looks like intel_engine_execlist_find_hung_request can return NULL which 
i915_request_get_rcu will not handle.

Maybe it would come up simpler if intel_context_find_active_request 
wouldn't be getting the reference and then you can get one here at a 
single place for both branches?

>   	}
>   
>   	if (hung_rq)
> @@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	else
>   		intel_engine_dump_active_requests(&engine->sched_engine->requests,
>   						  hung_rq, m);
> +	if (hung_rq)
> +		i915_request_put(hung_rq);
>   }
>   
>   void intel_engine_dump(struct intel_engine_cs *engine,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..3b34a82d692be 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> +		bool found;
> +
>   		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> +			found = true;
> +			break;
> +		}
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		if (found) {
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			xa_lock(&guc->context_lookup);
>   			goto done;
>   		}
> +
>   next:
>   		intel_context_put(ce);
>   		xa_lock(&guc->context_lookup);

This hunk I have to leave for someone who know the GuC backend well.

Regards,

Tvrtko

> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..4107a0dfcca7d 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	ce = intel_engine_get_hung_context(engine);
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
> +		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq || !i915_request_started(rq))
>   			goto no_request_capture;
> @@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
>   		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>   			spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   			rq = intel_engine_execlist_find_hung_request(engine);
> +			rq = i915_request_get_rcu(rq);
>   			spin_unlock_irqrestore(&engine->sched_engine->lock,
>   					       flags);
>   		}
>   	}
> -	if (rq)
> -		rq = i915_request_get_rcu(rq);
> -
>   	if (!rq)
>   		goto no_request_capture;
>   

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 2/5] drm/i915: Allow error capture without a request
  2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-18 16:34   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2023-01-18 16:34 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> There was a report of error captures occurring without any hung
> context being indicated despite the capture being initiated by a 'hung
> context notification' from GuC. The problem was not reproducible.
> However, it is possible to happen if the context in question has no
> active requests. For example, if the hang was in the context switch
> itself then the breadcrumb write would have occurred and the KMD would
> see an idle context.
> 
> In the interests of attempting to provide as much information as
> possible about a hang, it seems wise to include the engine info
> regardless of whether a request was found or not. As opposed to just
> prentending there was no hang at all.
> 
> So update the error capture code to always record engine information
> if an engine is given. Which means updating record_context() to take a
> context instead of a request (which it only ever used to find the
> context anyway). And split the request agnostic parts of
> intel_engine_coredump_add_request() out into a seaprate function.
> 
> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
> pointer.
> v3: Tidy up request locking code flow (Tvrtko)
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 69 ++++++++++++++++++---------
>   1 file changed, 46 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 4107a0dfcca7d..461489d599a7e 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
>   }
>   
>   static bool record_context(struct i915_gem_context_coredump *e,
> -			   const struct i915_request *rq)
> +			   struct intel_context *ce)
>   {
>   	struct i915_gem_context *ctx;
>   	struct task_struct *task;
>   	bool simulated;
>   
>   	rcu_read_lock();
> -	ctx = rcu_dereference(rq->context->gem_context);
> +	ctx = rcu_dereference(ce->gem_context);
>   	if (ctx && !kref_get_unless_zero(&ctx->ref))
>   		ctx = NULL;
>   	rcu_read_unlock();
> @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
>   	e->guilty = atomic_read(&ctx->guilty_count);
>   	e->active = atomic_read(&ctx->active_count);
>   
> -	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
> -	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
> +	e->total_runtime = intel_context_get_total_runtime_ns(ce);
> +	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>   
>   	simulated = i915_gem_context_no_error_capture(ctx);
>   
> @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
>   	return ee;
>   }
>   
> +static struct intel_engine_capture_vma *
> +engine_coredump_add_context(struct intel_engine_coredump *ee,
> +			    struct intel_context *ce,
> +			    gfp_t gfp)
> +{
> +	struct intel_engine_capture_vma *vma = NULL;
> +
> +	ee->simulated |= record_context(&ee->context, ce);
> +	if (ee->simulated)
> +		return NULL;
> +
> +	/*
> +	 * We need to copy these to an anonymous buffer
> +	 * as the simplest method to avoid being overwritten
> +	 * by userspace.
> +	 */
> +	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> +	vma = capture_vma(vma, ce->state, "HW context", gfp);
> +
> +	return vma;
> +}
> +
>   struct intel_engine_capture_vma *
>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   				  struct i915_request *rq,
>   				  gfp_t gfp)
>   {
> -	struct intel_engine_capture_vma *vma = NULL;
> +	struct intel_engine_capture_vma *vma;
>   
> -	ee->simulated |= record_context(&ee->context, rq);
> -	if (ee->simulated)
> +	vma = engine_coredump_add_context(ee, rq->context, gfp);
> +	if (!vma)
>   		return NULL;
>   
>   	/*
> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   	 */
>   	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>   	vma = capture_user(vma, rq, gfp);
> -	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> -	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>   
>   	ee->rq_head = rq->head;
>   	ee->rq_post = rq->postfix;
> @@ -1609,8 +1629,12 @@ capture_engine(struct intel_engine_cs *engine,
>   		intel_engine_clear_hung_context(engine);
>   		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
> -		if (!rq || !i915_request_started(rq))
> -			goto no_request_capture;
> +		if (rq && !i915_request_started(rq)) {
> +			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
> +				 engine->name);
> +			i915_request_put(rq);
> +			rq = NULL;
> +		}
>   	} else {
>   		/*
>   		 * Getting here with GuC enabled means it is a forced error capture
> @@ -1624,25 +1648,24 @@ capture_engine(struct intel_engine_cs *engine,
>   					       flags);
>   		}
>   	}
> -	if (!rq)
> -		goto no_request_capture;
> -
> -	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture) {
> +	if (rq) {
> +		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
>   		i915_request_put(rq);
> -		goto no_request_capture;
> +	} else if (ce) {
> +		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>   	}
> +
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>   
> -	intel_engine_coredump_add_vma(ee, capture, compress);
> -	i915_request_put(rq);
> +	if (capture) {
> +		intel_engine_coredump_add_vma(ee, capture, compress);
> +	} else {
> +		kfree(ee);
> +		ee = NULL;
> +	}
>   
>   	return ee;
> -
> -no_request_capture:
> -	kfree(ee);
> -	return NULL;
>   }
>   
>   static void

LGTM - regardless of how i915_request_get_rcu flow ends up:

Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 3/5] drm/i915: Allow error capture of a pending request
  2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-18 16:35   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2023-01-18 16:35 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> A hang situation has been observed where the only requests on the
> context were either completed or not yet started according to the
> breaadcrumbs. However, the register state claimed a batch was (maybe)
> in progress. So, allow capture of the pending request on the grounds
> that this might be better than nothing.
> 
> v2: Reword 'not started' warning message (Tvrtko)
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------
>   1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 461489d599a7e..1d33822a8ca23 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1629,12 +1629,9 @@ capture_engine(struct intel_engine_cs *engine,
>   		intel_engine_clear_hung_context(engine);
>   		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
> -		if (rq && !i915_request_started(rq)) {
> -			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
> -				 engine->name);
> -			i915_request_put(rq);
> -			rq = NULL;
> -		}
> +		if (rq && !i915_request_started(rq))
> +			drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
> +				 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
>   	} else {
>   		/*
>   		 * Getting here with GuC enabled means it is a forced error capture

Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-18 16:37   ` Tvrtko Ursulin
  -1 siblings, 0 replies; 30+ messages in thread
From: Tvrtko Ursulin @ 2023-01-18 16:37 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
> 
> v2: Change comment to be less alarming (Tvrtko)
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 3b34a82d692be..9bc80b807dbcc 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4754,11 +4754,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be reliable to do this here because the GuC is
> +		 * toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);
> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
> -				      "GuC failed to reset engine mask=0x%x\n",
> +				      "GuC failed to reset engine mask=0x%x",
>   				      reset_fail_mask);
> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,

Assuming 1/5 gets blessed by GuC experts this would then look safe to:

Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-18  8:29     ` [Intel-gfx] " Andy Shevchenko
@ 2023-01-18 17:34       ` John Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 17:34 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Tvrtko Ursulin, Intel-GFX, DRI-Devel

On 1/18/2023 00:29, Andy Shevchenko wrote:
> On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count. So no change to that
>> code itself but the context version does change.
>>
>> The only other caller is the code for dumping engine state to debugfs.
>> That code wasn't previously getting an explicit reference at all as it
>> does everything while holding the execlist specific spinlock. So that
>> needs updaing as well as that spinlock doesn't help when using GuC
>> submission. Rather than trying to conditionally get/put depending on
>> submission model, just change it to always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up too.
>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
>> with GuC")
> Must be one line.
In my tree it is one line. git itself does the line wrap when creating 
the email. I missed that I need to manually unwrap it again before 
actually sending the email. Although the CI checkpatch also pointed this 
out in it's own obscure manner.

>
>> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org
> Is it possible to utilize --to --cc parameters to git send-email instead of
> noisy Cc list?
This is the list auto-generated by the 'dim fixes' tool. I am told this 
is the officially correct way to create a fixes patch - copy the output 
from 'dim fixes' as is into the patch headers.

> ...
>
>> +	if (hung_rq)
>> +		i915_request_put(hung_rq);
> In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
> (or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
> inside i915_request_put() rather than in any of the callers.
That's as maybe. But this is how i915_request_put() currently works - it 
is simply a wrapper around 'dma_fence_put(&rq->fence);'. So passing in a 
null pointer will immediately cause a null pointer deref. If you want 
the put implementation to change and to re-work all its callers, that 
should be done in a separate patch and not piled on top of other changes.

>
> ...
>
>> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>>   			xa_lock(&guc->context_lookup);
>>   			goto done;
>>   		}
>> +
>>   next:
>>   		intel_context_put(ce);
>>   		xa_lock(&guc->context_lookup);
> Stray change.
Intentional change to improve the readability of a function that is 
being modified by other changes in this patch.

John.

>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18 17:34       ` John Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 17:34 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Intel-GFX, DRI-Devel

On 1/18/2023 00:29, Andy Shevchenko wrote:
> On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count. So no change to that
>> code itself but the context version does change.
>>
>> The only other caller is the code for dumping engine state to debugfs.
>> That code wasn't previously getting an explicit reference at all as it
>> does everything while holding the execlist specific spinlock. So that
>> needs updaing as well as that spinlock doesn't help when using GuC
>> submission. Rather than trying to conditionally get/put depending on
>> submission model, just change it to always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up too.
>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
>> with GuC")
> Must be one line.
In my tree it is one line. git itself does the line wrap when creating 
the email. I missed that I need to manually unwrap it again before 
actually sending the email. Although the CI checkpatch also pointed this 
out in it's own obscure manner.

>
>> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org
> Is it possible to utilize --to --cc parameters to git send-email instead of
> noisy Cc list?
This is the list auto-generated by the 'dim fixes' tool. I am told this 
is the officially correct way to create a fixes patch - copy the output 
from 'dim fixes' as is into the patch headers.

> ...
>
>> +	if (hung_rq)
>> +		i915_request_put(hung_rq);
> In Linux kernel the idiom is that freeing resources APIs should be NULL-aware
> (or ERR_PTR aware or both). Does i915 follows that? If so, the test should be
> inside i915_request_put() rather than in any of the callers.
That's as maybe. But this is how i915_request_put() currently works - it 
is simply a wrapper around 'dma_fence_put(&rq->fence);'. So passing in a 
null pointer will immediately cause a null pointer deref. If you want 
the put implementation to change and to re-work all its callers, that 
should be done in a separate patch and not piled on top of other changes.

>
> ...
>
>> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>>   			xa_lock(&guc->context_lookup);
>>   			goto done;
>>   		}
>> +
>>   next:
>>   		intel_context_put(ce);
>>   		xa_lock(&guc->context_lookup);
> Stray change.
Intentional change to improve the readability of a function that is 
being modified by other changes in this patch.

John.

>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-18 17:34       ` [Intel-gfx] " John Harrison
@ 2023-01-18 17:54         ` Andy Shevchenko
  -1 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-18 17:54 UTC (permalink / raw)
  To: John Harrison; +Cc: Tvrtko Ursulin, Intel-GFX, DRI-Devel

On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:
> On 1/18/2023 00:29, Andy Shevchenko wrote:
> > On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
> > > From: John Harrison <John.C.Harrison@Intel.com>
> > > 
> > > When GuC support was added to error capture, the locking around the
> > > request object was broken. Fix it up.
> > > 
> > > The context based search manages the spinlocking around the search
> > > internally. So it needs to grab the reference count internally as
> > > well. The execlist only request based search relies on external
> > > locking, so it needs an external reference count. So no change to that
> > > code itself but the context version does change.
> > > 
> > > The only other caller is the code for dumping engine state to debugfs.
> > > That code wasn't previously getting an explicit reference at all as it
> > > does everything while holding the execlist specific spinlock. So that
> > > needs updaing as well as that spinlock doesn't help when using GuC
> > > submission. Rather than trying to conditionally get/put depending on
> > > submission model, just change it to always do the get/put.
> > > 
> > > In addition, intel_guc_find_hung_context() was not acquiring the
> > > correct spinlock before searching the request list. So fix that up too.
> > > Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> > > with GuC")
> > Must be one line.
> In my tree it is one line. git itself does the line wrap when creating the
> email.

Can you elaborate? I never have had such issue with git send-email (starting
from v1.6.x of Git for sure).

> I missed that I need to manually unwrap it again before actually
> sending the email. Although the CI checkpatch also pointed this out in it's
> own obscure manner.

...

> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > Cc: Jani Nikula <jani.nikula@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > > Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > > Cc: Michael Cheng <michael.cheng@intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> > > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> > > Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> > > Cc: Bruce Chang <yu.bruce.chang@intel.com>
> > > Cc: intel-gfx@lists.freedesktop.org
> > Is it possible to utilize --to --cc parameters to git send-email instead of
> > noisy Cc list?
> This is the list auto-generated by the 'dim fixes' tool. I am told this is
> the officially correct way to create a fixes patch - copy the output from
> 'dim fixes' as is into the patch headers.

Okay, so it may be question to the `dim` tool then...

...

> > Stray change.
> Intentional change to improve the readability of a function that is being
> modified by other changes in this patch.

But not described in the commit message. That's why "stray".

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18 17:54         ` Andy Shevchenko
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-18 17:54 UTC (permalink / raw)
  To: John Harrison; +Cc: Intel-GFX, DRI-Devel

On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:
> On 1/18/2023 00:29, Andy Shevchenko wrote:
> > On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
> > > From: John Harrison <John.C.Harrison@Intel.com>
> > > 
> > > When GuC support was added to error capture, the locking around the
> > > request object was broken. Fix it up.
> > > 
> > > The context based search manages the spinlocking around the search
> > > internally. So it needs to grab the reference count internally as
> > > well. The execlist only request based search relies on external
> > > locking, so it needs an external reference count. So no change to that
> > > code itself but the context version does change.
> > > 
> > > The only other caller is the code for dumping engine state to debugfs.
> > > That code wasn't previously getting an explicit reference at all as it
> > > does everything while holding the execlist specific spinlock. So that
> > > needs updaing as well as that spinlock doesn't help when using GuC
> > > submission. Rather than trying to conditionally get/put depending on
> > > submission model, just change it to always do the get/put.
> > > 
> > > In addition, intel_guc_find_hung_context() was not acquiring the
> > > correct spinlock before searching the request list. So fix that up too.
> > > Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
> > > with GuC")
> > Must be one line.
> In my tree it is one line. git itself does the line wrap when creating the
> email.

Can you elaborate? I never have had such issue with git send-email (starting
from v1.6.x of Git for sure).

> I missed that I need to manually unwrap it again before actually
> sending the email. Although the CI checkpatch also pointed this out in it's
> own obscure manner.

...

> > > Cc: Matthew Brost <matthew.brost@intel.com>
> > > Cc: John Harrison <John.C.Harrison@Intel.com>
> > > Cc: Jani Nikula <jani.nikula@linux.intel.com>
> > > Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> > > Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> > > Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> > > Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> > > Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> > > Cc: Chris Wilson <chris@chris-wilson.co.uk>
> > > Cc: Matthew Auld <matthew.auld@intel.com>
> > > Cc: Matt Roper <matthew.d.roper@intel.com>
> > > Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> > > Cc: Michael Cheng <michael.cheng@intel.com>
> > > Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> > > Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> > > Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > > Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> > > Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> > > Cc: Bruce Chang <yu.bruce.chang@intel.com>
> > > Cc: intel-gfx@lists.freedesktop.org
> > Is it possible to utilize --to --cc parameters to git send-email instead of
> > noisy Cc list?
> This is the list auto-generated by the 'dim fixes' tool. I am told this is
> the officially correct way to create a fixes patch - copy the output from
> 'dim fixes' as is into the patch headers.

Okay, so it may be question to the `dim` tool then...

...

> > Stray change.
> Intentional change to improve the readability of a function that is being
> modified by other changes in this patch.

But not described in the commit message. That's why "stray".

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-18 16:22     ` [Intel-gfx] " Tvrtko Ursulin
@ 2023-01-18 17:55       ` John Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 17:55 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/18/2023 08:22, Tvrtko Ursulin wrote:
> On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count. So no change to that
>> code itself but the context version does change.
>>
>> The only other caller is the code for dumping engine state to debugfs.
>> That code wasn't previously getting an explicit reference at all as it
>> does everything while holding the execlist specific spinlock. So that
>> needs updaing as well as that spinlock doesn't help when using GuC
>> submission. Rather than trying to conditionally get/put depending on
>> submission model, just change it to always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up too.
>>
>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full 
>> GPU reset
>> with GuC")
>> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context 
>> reset")
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
>>   drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
>>   4 files changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
>> b/drivers/gpu/drm/i915/gt/intel_context.c
>> index e94365b08f1ef..df64cf1954c1d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>> @@ -552,6 +552,7 @@ struct i915_request 
>> *intel_context_find_active_request(struct intel_context *ce)
>>             active = rq;
>>       }
>> +    active = i915_request_get_rcu(active);
>>       spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>>         return active;
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 922f1bb22dc68..517d1fb7ae333 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -2236,10 +2236,13 @@ static void 
>> engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>>       guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>>       if (guc) {
>>           ce = intel_engine_get_hung_context(engine);
>> -        if (ce)
>> +        if (ce) {
>> +            /* This will reference count the request (if found) */
>>               hung_rq = intel_context_find_active_request(ce);
>> +        }
>>       } else {
>>           hung_rq = intel_engine_execlist_find_hung_request(engine);
>> +        hung_rq = i915_request_get_rcu(hung_rq);
>
> Looks like intel_engine_execlist_find_hung_request can return NULL 
> which i915_request_get_rcu will not handle.
Doh! That is correct.

>
> Maybe it would come up simpler if intel_context_find_active_request 
> wouldn't be getting the reference and then you can get one here at a 
> single place for both branches?
That would require moving the spinlock outside of 
intel_context_find_active_request so that it can be held while acquiring 
the request reference. And that means bleeding internal knowledge of 
which spinlock outside of the implementation and into the caller. As 
noted, the ideal would be extending the execlist implementation to do 
early tagging of the hung context/request at the point of hang 
detection. As opposed to rescanning the entire request list again at 
this point. And that will mean the lock being used inside 
'context_find_active' would be dependent upon GuC vs execlist backend. 
Which is an implementation detail we really should not be leaking out to 
the caller.

IMHO, it would be better to refactor engine_dump_active_requests() to 
acquire the sched_engine spinlock internally and only around the code 
which actually needs it (some of which is maybe execlist specific and 
not valid with GuC submission?). Certainly grabbing two independent 
spinlocks in a nested manner is not a good idea when there is no reason 
to do so.

John.

>
>>       }
>>         if (hung_rq)
>> @@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct 
>> intel_engine_cs *engine, struct d
>>       else
>> intel_engine_dump_active_requests(&engine->sched_engine->requests,
>>                             hung_rq, m);
>> +    if (hung_rq)
>> +        i915_request_put(hung_rq);
>>   }
>>     void intel_engine_dump(struct intel_engine_cs *engine,
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b436dd7f12e42..3b34a82d692be 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>         xa_lock_irqsave(&guc->context_lookup, flags);
>>       xa_for_each(&guc->context_lookup, index, ce) {
>> +        bool found;
>> +
>>           if (!kref_get_unless_zero(&ce->ref))
>>               continue;
>>   @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>                   goto next;
>>           }
>>   +        found = false;
>> +        spin_lock(&ce->guc_state.lock);
>>           list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>>               if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>>                   continue;
>>   +            found = true;
>> +            break;
>> +        }
>> +        spin_unlock(&ce->guc_state.lock);
>> +
>> +        if (found) {
>>               intel_engine_set_hung_context(engine, ce);
>>                 /* Can only cope with one hang at a time... */
>> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>               xa_lock(&guc->context_lookup);
>>               goto done;
>>           }
>> +
>>   next:
>>           intel_context_put(ce);
>>           xa_lock(&guc->context_lookup);
>
> This hunk I have to leave for someone who know the GuC backend well.
>
> Regards,
>
> Tvrtko
>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 9d5d5a397b64e..4107a0dfcca7d 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>>       ce = intel_engine_get_hung_context(engine);
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>> +        /* This will reference count the request (if found) */
>>           rq = intel_context_find_active_request(ce);
>>           if (!rq || !i915_request_started(rq))
>>               goto no_request_capture;
>> @@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
>>           if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>> spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>               rq = intel_engine_execlist_find_hung_request(engine);
>> +            rq = i915_request_get_rcu(rq);
>> spin_unlock_irqrestore(&engine->sched_engine->lock,
>>                              flags);
>>           }
>>       }
>> -    if (rq)
>> -        rq = i915_request_get_rcu(rq);
>> -
>>       if (!rq)
>>           goto no_request_capture;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18 17:55       ` John Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 17:55 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/18/2023 08:22, Tvrtko Ursulin wrote:
> On 17/01/2023 21:36, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count. So no change to that
>> code itself but the context version does change.
>>
>> The only other caller is the code for dumping engine state to debugfs.
>> That code wasn't previously getting an explicit reference at all as it
>> does everything while holding the execlist specific spinlock. So that
>> needs updaing as well as that spinlock doesn't help when using GuC
>> submission. Rather than trying to conditionally get/put depending on
>> submission model, just change it to always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up too.
>>
>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full 
>> GPU reset
>> with GuC")
>> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context 
>> reset")
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_context.c           |  1 +
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  7 ++++++-
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 11 +++++++++++
>>   drivers/gpu/drm/i915/i915_gpu_error.c             |  5 ++---
>>   4 files changed, 20 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
>> b/drivers/gpu/drm/i915/gt/intel_context.c
>> index e94365b08f1ef..df64cf1954c1d 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>> @@ -552,6 +552,7 @@ struct i915_request 
>> *intel_context_find_active_request(struct intel_context *ce)
>>             active = rq;
>>       }
>> +    active = i915_request_get_rcu(active);
>>       spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>>         return active;
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 922f1bb22dc68..517d1fb7ae333 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -2236,10 +2236,13 @@ static void 
>> engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>>       guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>>       if (guc) {
>>           ce = intel_engine_get_hung_context(engine);
>> -        if (ce)
>> +        if (ce) {
>> +            /* This will reference count the request (if found) */
>>               hung_rq = intel_context_find_active_request(ce);
>> +        }
>>       } else {
>>           hung_rq = intel_engine_execlist_find_hung_request(engine);
>> +        hung_rq = i915_request_get_rcu(hung_rq);
>
> Looks like intel_engine_execlist_find_hung_request can return NULL 
> which i915_request_get_rcu will not handle.
Doh! That is correct.

>
> Maybe it would come up simpler if intel_context_find_active_request 
> wouldn't be getting the reference and then you can get one here at a 
> single place for both branches?
That would require moving the spinlock outside of 
intel_context_find_active_request so that it can be held while acquiring 
the request reference. And that means bleeding internal knowledge of 
which spinlock outside of the implementation and into the caller. As 
noted, the ideal would be extending the execlist implementation to do 
early tagging of the hung context/request at the point of hang 
detection. As opposed to rescanning the entire request list again at 
this point. And that will mean the lock being used inside 
'context_find_active' would be dependent upon GuC vs execlist backend. 
Which is an implementation detail we really should not be leaking out to 
the caller.

IMHO, it would be better to refactor engine_dump_active_requests() to 
acquire the sched_engine spinlock internally and only around the code 
which actually needs it (some of which is maybe execlist specific and 
not valid with GuC submission?). Certainly grabbing two independent 
spinlocks in a nested manner is not a good idea when there is no reason 
to do so.

John.

>
>>       }
>>         if (hung_rq)
>> @@ -2250,6 +2253,8 @@ static void engine_dump_active_requests(struct 
>> intel_engine_cs *engine, struct d
>>       else
>> intel_engine_dump_active_requests(&engine->sched_engine->requests,
>>                             hung_rq, m);
>> +    if (hung_rq)
>> +        i915_request_put(hung_rq);
>>   }
>>     void intel_engine_dump(struct intel_engine_cs *engine,
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b436dd7f12e42..3b34a82d692be 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4820,6 +4820,8 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>         xa_lock_irqsave(&guc->context_lookup, flags);
>>       xa_for_each(&guc->context_lookup, index, ce) {
>> +        bool found;
>> +
>>           if (!kref_get_unless_zero(&ce->ref))
>>               continue;
>>   @@ -4836,10 +4838,18 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>                   goto next;
>>           }
>>   +        found = false;
>> +        spin_lock(&ce->guc_state.lock);
>>           list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>>               if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>>                   continue;
>>   +            found = true;
>> +            break;
>> +        }
>> +        spin_unlock(&ce->guc_state.lock);
>> +
>> +        if (found) {
>>               intel_engine_set_hung_context(engine, ce);
>>                 /* Can only cope with one hang at a time... */
>> @@ -4847,6 +4857,7 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>               xa_lock(&guc->context_lookup);
>>               goto done;
>>           }
>> +
>>   next:
>>           intel_context_put(ce);
>>           xa_lock(&guc->context_lookup);
>
> This hunk I have to leave for someone who know the GuC backend well.
>
> Regards,
>
> Tvrtko
>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 9d5d5a397b64e..4107a0dfcca7d 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>>       ce = intel_engine_get_hung_context(engine);
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>> +        /* This will reference count the request (if found) */
>>           rq = intel_context_find_active_request(ce);
>>           if (!rq || !i915_request_started(rq))
>>               goto no_request_capture;
>> @@ -1618,13 +1619,11 @@ capture_engine(struct intel_engine_cs *engine,
>>           if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>> spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>               rq = intel_engine_execlist_find_hung_request(engine);
>> +            rq = i915_request_get_rcu(rq);
>> spin_unlock_irqrestore(&engine->sched_engine->lock,
>>                              flags);
>>           }
>>       }
>> -    if (rq)
>> -        rq = i915_request_get_rcu(rq);
>> -
>>       if (!rq)
>>           goto no_request_capture;


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-18 17:54         ` [Intel-gfx] " Andy Shevchenko
@ 2023-01-18 18:18           ` John Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 18:18 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Tvrtko Ursulin, Intel-GFX, DRI-Devel

On 1/18/2023 09:54, Andy Shevchenko wrote:
> On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:
>> On 1/18/2023 00:29, Andy Shevchenko wrote:
>>> On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> When GuC support was added to error capture, the locking around the
>>>> request object was broken. Fix it up.
>>>>
>>>> The context based search manages the spinlocking around the search
>>>> internally. So it needs to grab the reference count internally as
>>>> well. The execlist only request based search relies on external
>>>> locking, so it needs an external reference count. So no change to that
>>>> code itself but the context version does change.
>>>>
>>>> The only other caller is the code for dumping engine state to debugfs.
>>>> That code wasn't previously getting an explicit reference at all as it
>>>> does everything while holding the execlist specific spinlock. So that
>>>> needs updaing as well as that spinlock doesn't help when using GuC
>>>> submission. Rather than trying to conditionally get/put depending on
>>>> submission model, just change it to always do the get/put.
>>>>
>>>> In addition, intel_guc_find_hung_context() was not acquiring the
>>>> correct spinlock before searching the request list. So fix that up too.
>>>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
>>>> with GuC")
>>> Must be one line.
>> In my tree it is one line. git itself does the line wrap when creating the
>> email.
> Can you elaborate? I never have had such issue with git send-email (starting
> from v1.6.x of Git for sure).
Hmm. Confused. I think it must have been something accidental in a text 
editor when reviewing the patch. Re-creating the emails now isn't 
wrapping it.

>> I missed that I need to manually unwrap it again before actually
>> sending the email. Although the CI checkpatch also pointed this out in it's
>> own obscure manner.
> ...
>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: John Harrison <John.C.Harrison@Intel.com>
>>>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>> Cc: Michael Cheng <michael.cheng@intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>>>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>>>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>>>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>>>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>>>> Cc: intel-gfx@lists.freedesktop.org
>>> Is it possible to utilize --to --cc parameters to git send-email instead of
>>> noisy Cc list?
>> This is the list auto-generated by the 'dim fixes' tool. I am told this is
>> the officially correct way to create a fixes patch - copy the output from
>> 'dim fixes' as is into the patch headers.
> Okay, so it may be question to the `dim` tool then...
>
> ...
>
>>> Stray change.
>> Intentional change to improve the readability of a function that is being
>> modified by other changes in this patch.
> But not described in the commit message. That's why "stray".
Didn't seem worth mentioning. I can add a comment about it.

John.

>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-18 18:18           ` John Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-18 18:18 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Intel-GFX, DRI-Devel

On 1/18/2023 09:54, Andy Shevchenko wrote:
> On Wed, Jan 18, 2023 at 09:34:47AM -0800, John Harrison wrote:
>> On 1/18/2023 00:29, Andy Shevchenko wrote:
>>> On Tue, Jan 17, 2023 at 01:36:26PM -0800, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> When GuC support was added to error capture, the locking around the
>>>> request object was broken. Fix it up.
>>>>
>>>> The context based search manages the spinlocking around the search
>>>> internally. So it needs to grab the reference count internally as
>>>> well. The execlist only request based search relies on external
>>>> locking, so it needs an external reference count. So no change to that
>>>> code itself but the context version does change.
>>>>
>>>> The only other caller is the code for dumping engine state to debugfs.
>>>> That code wasn't previously getting an explicit reference at all as it
>>>> does everything while holding the execlist specific spinlock. So that
>>>> needs updaing as well as that spinlock doesn't help when using GuC
>>>> submission. Rather than trying to conditionally get/put depending on
>>>> submission model, just change it to always do the get/put.
>>>>
>>>> In addition, intel_guc_find_hung_context() was not acquiring the
>>>> correct spinlock before searching the request list. So fix that up too.
>>>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset
>>>> with GuC")
>>> Must be one line.
>> In my tree it is one line. git itself does the line wrap when creating the
>> email.
> Can you elaborate? I never have had such issue with git send-email (starting
> from v1.6.x of Git for sure).
Hmm. Confused. I think it must have been something accidental in a text 
editor when reviewing the patch. Re-creating the emails now isn't 
wrapping it.

>> I missed that I need to manually unwrap it again before actually
>> sending the email. Although the CI checkpatch also pointed this out in it's
>> own obscure manner.
> ...
>
>>>> Cc: Matthew Brost <matthew.brost@intel.com>
>>>> Cc: John Harrison <John.C.Harrison@Intel.com>
>>>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>>>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>>>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>>>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>>>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> Cc: Chris Wilson <chris@chris-wilson.co.uk>
>>>> Cc: Matthew Auld <matthew.auld@intel.com>
>>>> Cc: Matt Roper <matthew.d.roper@intel.com>
>>>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>> Cc: Michael Cheng <michael.cheng@intel.com>
>>>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>>>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>>>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>>>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>>>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>>>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>>>> Cc: intel-gfx@lists.freedesktop.org
>>> Is it possible to utilize --to --cc parameters to git send-email instead of
>>> noisy Cc list?
>> This is the list auto-generated by the 'dim fixes' tool. I am told this is
>> the officially correct way to create a fixes patch - copy the output from
>> 'dim fixes' as is into the patch headers.
> Okay, so it may be question to the `dim` tool then...
>
> ...
>
>>> Stray change.
>> Intentional change to improve the readability of a function that is being
>> modified by other changes in this patch.
> But not described in the commit message. That's why "stray".
Didn't seem worth mentioning. I can add a comment about it.

John.

>


^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-01-18 18:19 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-17 21:36 [PATCH v2 0/5] Allow error capture without a request / on reset failure John.C.Harrison
2023-01-17 21:36 ` [Intel-gfx] " John.C.Harrison
2023-01-17 21:36 ` [PATCH v2 1/5] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
2023-01-18  8:29   ` Andy Shevchenko
2023-01-18  8:29     ` [Intel-gfx] " Andy Shevchenko
2023-01-18 17:34     ` John Harrison
2023-01-18 17:34       ` [Intel-gfx] " John Harrison
2023-01-18 17:54       ` Andy Shevchenko
2023-01-18 17:54         ` [Intel-gfx] " Andy Shevchenko
2023-01-18 18:18         ` John Harrison
2023-01-18 18:18           ` [Intel-gfx] " John Harrison
2023-01-18 16:22   ` Tvrtko Ursulin
2023-01-18 16:22     ` [Intel-gfx] " Tvrtko Ursulin
2023-01-18 17:55     ` John Harrison
2023-01-18 17:55       ` [Intel-gfx] " John Harrison
2023-01-17 21:36 ` [PATCH v2 2/5] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
2023-01-18 16:34   ` Tvrtko Ursulin
2023-01-17 21:36 ` [PATCH v2 3/5] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
2023-01-18 16:35   ` Tvrtko Ursulin
2023-01-17 21:36 ` [PATCH v2 4/5] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
2023-01-18 16:37   ` Tvrtko Ursulin
2023-01-17 21:36 ` [PATCH v2 5/5] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-17 21:36   ` [Intel-gfx] " John.C.Harrison
2023-01-17 22:55 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for Allow error capture without a request / on reset failure (rev3) Patchwork
2023-01-17 22:55 ` [Intel-gfx] ✗ Fi.CI.SPARSE: " Patchwork
2023-01-17 23:05 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.