[PATCH v4 0/7] Allow error capture without a request & fix locking issues

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v4 0/7] Allow error capture without a request & fix locking issues
@ 2023-01-20 23:28 John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
                   ` (6 more replies)
  0 siblings, 7 replies; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

While doing all this, it was noticed that the locking was broken in a
number of places when searching for hung requests and dumping request
info. So fix all that up as well.

v2: Tidy up code flow in error capture. Reword some comments/messages.
(review feedback from Tvrtko)
Also fix up request locking issues from earlier changes noticed during
code review of this change.
v3: Fix some potential null pointer derefs and a reference leak.
Add new patch to refactor the duplicated hung request search code into
a common backend agnostic wrapper function and use the correct
spinlocks for the correct lists. Also tweak some of the patch
descriptions for better accuracy.
v4: Shuffle some code around to more appropriate source files. Fix
potential leak of GuC capture object after code flow re-org and pull
improved info message earlier (Daniele). Also rename the GuC capture
object to be more consistent.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (7):
  drm/i915: Fix request locking during error capture & debugfs dump
  drm/i915: Fix up locking around dumping requests lists
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset
  drm/i915/guc: Rename GuC register state capture node to be more
    obvious

 drivers/gpu/drm/i915/gt/intel_context.c       |  4 +-
 drivers/gpu/drm/i915/gt/intel_context.h       |  3 +-
 drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 ++++++++-------
 .../drm/i915/gt/intel_execlists_submission.c  | 27 ++++++
 .../drm/i915/gt/intel_execlists_submission.h  |  4 +
 .../gpu/drm/i915/gt/uc/intel_guc_capture.c    |  8 +-
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 35 ++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 92 ++++++++++---------
 drivers/gpu/drm/i915/i915_gpu_error.h         |  2 +-
 10 files changed, 160 insertions(+), 93 deletions(-)

-- 
2.39.0

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-23 17:51   ` Tvrtko Ursulin
  2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Matthew Brost, Tvrtko Ursulin, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	intel-gfx, Lucas De Marchi, Bruce Chang, Daniele Ceraolo Spurio,
	DRI-Devel, Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay,
	John Harrison, Matthew Auld

From: John Harrison <John.C.Harrison@Intel.com>

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.

v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.
v3: Add a _get suffix to intel_context_find_active_request now that it
grabs a reference (Daniele).

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c           |  4 +++-
 drivers/gpu/drm/i915/gt/intel_context.h           |  3 +--
 drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  6 +++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c             | 13 ++++++-------
 5 files changed, 28 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..4285c1c71fa12 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
 	return rq;
 }
 
-struct i915_request *intel_context_find_active_request(struct intel_context *ce)
+struct i915_request *intel_context_find_active_request_get(struct intel_context *ce)
 {
 	struct intel_context *parent = intel_context_to_parent(ce);
 	struct i915_request *rq, *active = NULL;
@@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 		active = rq;
 	}
+	if (active)
+		active = i915_request_get_rcu(active);
 	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index fb62b7b8cbcda..ccc80c6607ca8 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct intel_context *ce,
 
 struct i915_request *intel_context_create_request(struct intel_context *ce);
 
-struct i915_request *
-intel_context_find_active_request(struct intel_context *ce);
+struct i915_request *intel_context_find_active_request_get(struct intel_context *ce);
 
 static inline bool intel_context_is_barrier(const struct intel_context *ce)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..fbc0a81617e89 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	if (guc) {
 		ce = intel_engine_get_hung_context(engine);
 		if (ce)
-			hung_rq = intel_context_find_active_request(ce);
+			hung_rq = intel_context_find_active_request_get(ce);
 	} else {
 		hung_rq = intel_engine_execlist_find_hung_request(engine);
+		if (hung_rq)
+			hung_rq = i915_request_get_rcu(hung_rq);
 	}
 
 	if (hung_rq)
@@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	else
 		intel_engine_dump_active_requests(&engine->sched_engine->requests,
 						  hung_rq, m);
+	if (hung_rq)
+		i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..ad4b2848b0f83 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 			goto next_context;
 
 		guilty = false;
-		rq = intel_context_find_active_request(ce);
+		rq = intel_context_find_active_request_get(ce);
 		if (!rq) {
 			head = ce->ring->tail;
 			goto out_replay;
@@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 		head = intel_ring_wrap(ce->ring, rq->head);
 
 		__i915_request_reset(rq, guilty);
+		i915_request_put(rq);
 out_replay:
 		guc_reset_state(ce, head, guilty);
 next_context:
@@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
 		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
@@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
+		found = false;
+		spin_lock(&ce->guc_state.lock);
 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
@@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			xa_lock(&guc->context_lookup);
 			goto done;
 		}
+
 next:
 		intel_context_put(ce);
 		xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..5c73dfa2fb3f6 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
 	ce = intel_engine_get_hung_context(engine);
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
-		rq = intel_context_find_active_request(ce);
+		rq = intel_context_find_active_request_get(ce);
 		if (!rq || !i915_request_started(rq))
 			goto no_request_capture;
 	} else {
@@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine,
 		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
 			spin_lock_irqsave(&engine->sched_engine->lock, flags);
 			rq = intel_engine_execlist_find_hung_request(engine);
+			if (rq)
+				rq = i915_request_get_rcu(rq);
 			spin_unlock_irqrestore(&engine->sched_engine->lock,
 					       flags);
 		}
 	}
-	if (rq)
-		rq = i915_request_get_rcu(rq);
-
 	if (!rq)
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture) {
-		i915_request_put(rq);
+	if (!capture)
 		goto no_request_capture;
-	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
@@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine,
 	return ee;
 
 no_request_capture:
+	if (rq)
+		i915_request_put(rq);
 	kfree(ee);
 	return NULL;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-20 23:40   ` John Harrison
                     ` (2 more replies)
  2023-01-20 23:28 ` [PATCH v4 3/7] drm/i915: Allow error capture without a request John.C.Harrison
                   ` (4 subsequent siblings)
  6 siblings, 3 replies; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The debugfs dump of requests was confused about what state requires
the execlist lock versus the GuC lock. There was also a bunch of
duplicated messy code between it and the error capture code.

So refactor the hung request search into a re-usable function. And
reduce the span of the execlist state lock to only the execlist
specific code paths. In order to do that, also move the report of hold
count (which is an execlist only concept) from the top level dump
function to the lower level execlist specific function. Also, move the
execlist specific code into the execlist source file.

v2: Rename some functions and move to more appropriate files (Daniele).

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 +++++++++----------
 .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
 .../drm/i915/gt/intel_execlists_submission.h  |  4 +
 drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
 5 files changed, 73 insertions(+), 62 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
index 0e24af5efee9c..b58c30ac8ef02 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine.h
+++ b/drivers/gpu/drm/i915/gt/intel_engine.h
@@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests,
 ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
 				   ktime_t *now);
 
-struct i915_request *
-intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine);
+void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
+				  struct intel_context **ce, struct i915_request **rq);
 
 u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
 struct intel_context *
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index fbc0a81617e89..1d77e27801bce 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq)
 	}
 }
 
-static unsigned long list_count(struct list_head *list)
-{
-	struct list_head *pos;
-	unsigned long count = 0;
-
-	list_for_each(pos, list)
-		count++;
-
-	return count;
-}
-
 static unsigned long read_ul(void *p, size_t x)
 {
 	return *(unsigned long *)(p + x);
@@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests,
 	}
 }
 
-static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
+static void engine_dump_active_requests(struct intel_engine_cs *engine,
+					struct drm_printer *m)
 {
+	struct intel_context *hung_ce = NULL;
 	struct i915_request *hung_rq = NULL;
-	struct intel_context *ce;
-	bool guc;
 
 	/*
 	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
@@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	 * But the intention here is just to report an instantaneous snapshot
 	 * so that's fine.
 	 */
-	lockdep_assert_held(&engine->sched_engine->lock);
+	intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
 
 	drm_printf(m, "\tRequests:\n");
 
-	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
-	if (guc) {
-		ce = intel_engine_get_hung_context(engine);
-		if (ce)
-			hung_rq = intel_context_find_active_request_get(ce);
-	} else {
-		hung_rq = intel_engine_execlist_find_hung_request(engine);
-		if (hung_rq)
-			hung_rq = i915_request_get_rcu(hung_rq);
-	}
-
 	if (hung_rq)
 		engine_dump_request(hung_rq, m, "\t\thung");
+	else if (hung_ce)
+		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
 
-	if (guc)
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
 		intel_guc_dump_active_requests(engine, hung_rq, m);
 	else
-		intel_engine_dump_active_requests(&engine->sched_engine->requests,
-						  hung_rq, m);
+		intel_execlist_dump_active_requests(engine, hung_rq, m);
+
 	if (hung_rq)
 		i915_request_put(hung_rq);
 }
@@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	struct i915_gpu_error * const error = &engine->i915->gpu_error;
 	struct i915_request *rq;
 	intel_wakeref_t wakeref;
-	unsigned long flags;
 	ktime_t dummy;
 
 	if (header) {
@@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 		   i915_reset_count(error));
 	print_properties(engine, m);
 
-	spin_lock_irqsave(&engine->sched_engine->lock, flags);
 	engine_dump_active_requests(engine, m);
 
-	drm_printf(m, "\tOn hold?: %lu\n",
-		   list_count(&engine->sched_engine->hold));
-	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
-
 	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
 	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
 	if (wakeref) {
@@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings,
 	return siblings[0]->cops->create_virtual(siblings, count, flags);
 }
 
-struct i915_request *
-intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
+static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine)
 {
 	struct i915_request *request, *active = NULL;
 
@@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
 	return active;
 }
 
+void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
+				  struct intel_context **ce, struct i915_request **rq)
+{
+	unsigned long flags;
+
+	*ce = intel_engine_get_hung_context(engine);
+	if (*ce) {
+		intel_engine_clear_hung_context(engine);
+
+		*rq = intel_context_find_active_request_get(*ce);
+		return;
+	}
+
+	/*
+	 * Getting here with GuC enabled means it is a forced error capture
+	 * with no actual hang. So, no need to attempt the execlist search.
+	 */
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
+		return;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+	*rq = engine_execlist_find_hung_request(engine);
+	if (*rq)
+		*rq = i915_request_get_rcu(*rq);
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
 void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
 {
 	/*
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
index 18ffe55282e59..05995c8577bef 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
@@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 	spin_unlock_irqrestore(&sched_engine->lock, flags);
 }
 
+static unsigned long list_count(struct list_head *list)
+{
+	struct list_head *pos;
+	unsigned long count = 0;
+
+	list_for_each(pos, list)
+		count++;
+
+	return count;
+}
+
+void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,
+					 struct i915_request *hung_rq,
+					 struct drm_printer *m)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+
+	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
+
+	drm_printf(m, "\tOn hold?: %lu\n",
+		   list_count(&engine->sched_engine->hold));
+
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
 #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
 #include "selftest_execlists.c"
 #endif
diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
index a1aa92c983a51..cb07488a03764 100644
--- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
+++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
@@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
 							int indent),
 				   unsigned int max);
 
+void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,
+					 struct i915_request *hung_rq,
+					 struct drm_printer *m);
+
 bool
 intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine);
 
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 5c73dfa2fb3f6..b20bd6365615b 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
 {
 	struct intel_engine_capture_vma *capture = NULL;
 	struct intel_engine_coredump *ee;
-	struct intel_context *ce;
+	struct intel_context *ce = NULL;
 	struct i915_request *rq = NULL;
-	unsigned long flags;
 
 	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
 	if (!ee)
 		return NULL;
 
-	ce = intel_engine_get_hung_context(engine);
-	if (ce) {
-		intel_engine_clear_hung_context(engine);
-		rq = intel_context_find_active_request_get(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
-	} else {
-		/*
-		 * Getting here with GuC enabled means it is a forced error capture
-		 * with no actual hang. So, no need to attempt the execlist search.
-		 */
-		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
-			spin_lock_irqsave(&engine->sched_engine->lock, flags);
-			rq = intel_engine_execlist_find_hung_request(engine);
-			if (rq)
-				rq = i915_request_get_rcu(rq);
-			spin_unlock_irqrestore(&engine->sched_engine->lock,
-					       flags);
-		}
-	}
-	if (!rq)
+	intel_engine_get_hung_entity(engine, &ce, &rq);
+	if (!rq || !i915_request_started(rq))
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 3/7] drm/i915: Allow error capture without a request
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-25  0:39   ` [Intel-gfx] " Ceraolo Spurio, Daniele
  2023-01-20 23:28 ` [PATCH v4 4/7] drm/i915: Allow error capture of a pending request John.C.Harrison
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Umesh Nerlige Ramappa, John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if a context is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.
v3: Tidy up request locking code flow (Tvrtko)
v4: Pull in improved info message from next patch and fix up potential
leak of GuC register state (Daniele)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++---------
 1 file changed, 50 insertions(+), 24 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index b20bd6365615b..225f1b11a6b93 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_engine_get_hung_entity(engine, &ce, &rq);
-	if (!rq || !i915_request_started(rq))
-		goto no_request_capture;
+	if (rq && !i915_request_started(rq)) {
+		drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
+			 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
+		i915_request_put(rq);
+		rq = NULL;
+	}
 
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture)
-		goto no_request_capture;
-	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
-		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
+	if (rq) {
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+		i915_request_put(rq);
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
+	}
 
-	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (capture) {
+		intel_engine_coredump_add_vma(ee, capture, compress);
 
-	return ee;
+		if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
+			intel_guc_capture_get_matching_node(engine->gt, ee, ce);
+	} else {
+		kfree(ee);
+		ee = NULL;
+	}
 
-no_request_capture:
-	if (rq)
-		i915_request_put(rq);
-	kfree(ee);
-	return NULL;
+	return ee;
 }
 
 static void
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 4/7] drm/i915: Allow error capture of a pending request
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
                   ` (2 preceding siblings ...)
  2023-01-20 23:28 ` [PATCH v4 3/7] drm/i915: Allow error capture without a request John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

v2: Reword 'not started' warning message (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 225f1b11a6b93..904f21e1380cd 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_engine_get_hung_entity(engine, &ce, &rq);
-	if (rq && !i915_request_started(rq)) {
+	if (rq && !i915_request_started(rq))
 		drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
 			 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
-		i915_request_put(rq);
-		rq = NULL;
-	}
 
 	if (rq) {
 		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
                   ` (3 preceding siblings ...)
  2023-01-20 23:28 ` [PATCH v4 4/7] drm/i915: Allow error capture of a pending request John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious John.C.Harrison
  6 siblings, 0 replies; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Daniele Ceraolo Spurio, John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC intentionally dies on such a situation,
it is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

v2: Change comment to be less alarming (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index ad4b2848b0f83..b6b4061e4b633 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4755,11 +4755,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be reliable to do this here because the GuC is
+		 * toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
                   ` (4 preceding siblings ...)
  2023-01-20 23:28 ` [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-20 23:28 ` [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious John.C.Harrison
  6 siblings, 0 replies; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b6b4061e4b633..b11d98092ffd1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4666,6 +4666,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious
  2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
                   ` (5 preceding siblings ...)
  2023-01-20 23:28 ` [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
@ 2023-01-20 23:28 ` John.C.Harrison
  2023-01-25  0:44   ` [Intel-gfx] " Ceraolo Spurio, Daniele
  6 siblings, 1 reply; 20+ messages in thread
From: John.C.Harrison @ 2023-01-20 23:28 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The GuC specific register state entry in the error capture object was
just called 'capture'. Although the companion 'node' entry was called
'guc_capture_node'. Rename the base entry to be 'guc_capture' instead
so that it is a) more consistent and b) more obvious what it is.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c | 8 ++++----
 drivers/gpu/drm/i915/i915_gpu_error.h          | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
index 1c1b85073b4bd..fc3b994626a4f 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
@@ -1506,7 +1506,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
 
 	if (!ebuf || !ee)
 		return -EINVAL;
-	cap = ee->capture;
+	cap = ee->guc_capture;
 	if (!cap || !ee->engine)
 		return -ENODEV;
 
@@ -1576,8 +1576,8 @@ void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
 	if (!ee || !ee->guc_capture_node)
 		return;
 
-	guc_capture_add_node_to_cachelist(ee->capture, ee->guc_capture_node);
-	ee->capture = NULL;
+	guc_capture_add_node_to_cachelist(ee->guc_capture, ee->guc_capture_node);
+	ee->guc_capture = NULL;
 	ee->guc_capture_node = NULL;
 }
 
@@ -1611,7 +1611,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,
 		    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK)) {
 			list_del(&n->link);
 			ee->guc_capture_node = n;
-			ee->capture = guc->capture;
+			ee->guc_capture = guc->capture;
 			return;
 		}
 	}
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
index efc75cc2ffdb9..56027ffbce51f 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.h
+++ b/drivers/gpu/drm/i915/i915_gpu_error.h
@@ -94,7 +94,7 @@ struct intel_engine_coredump {
 	struct intel_instdone instdone;
 
 	/* GuC matched capture-lists info */
-	struct intel_guc_state_capture *capture;
+	struct intel_guc_state_capture *guc_capture;
 	struct __guc_capture_parsed_output *guc_capture_node;
 
 	struct i915_gem_context_coredump {
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
@ 2023-01-20 23:40   ` John Harrison
  2023-01-24 14:40   ` [Intel-gfx] " Tvrtko Ursulin
  2023-01-25  0:31   ` Ceraolo Spurio, Daniele
  2 siblings, 0 replies; 20+ messages in thread
From: John Harrison @ 2023-01-20 23:40 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

On 1/20/2023 15:28, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> The debugfs dump of requests was confused about what state requires
> the execlist lock versus the GuC lock. There was also a bunch of
> duplicated messy code between it and the error capture code.
>
> So refactor the hung request search into a re-usable function. And
> reduce the span of the execlist state lock to only the execlist
> specific code paths. In order to do that, also move the report of hold
> count (which is an execlist only concept) from the top level dump
> function to the lower level execlist specific function. Also, move the
> execlist specific code into the execlist source file.
>
> v2: Rename some functions and move to more appropriate files (Daniele).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU 
reset with GuC")
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: intel-gfx@lists.freedesktop.org

John.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
@ 2023-01-23 17:51   ` Tvrtko Ursulin
  2023-01-23 20:35     ` John Harrison
  2023-01-25 22:04     ` John Harrison
  0 siblings, 2 replies; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-01-23 17:51 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX
  Cc: Matthew Brost, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	Lucas De Marchi, Bruce Chang, Daniele Ceraolo Spurio, DRI-Devel,
	Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay, Matthew Auld


On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count but within the
> spinlock not outside it.
> 
> The only other caller of the context based search is the code for
> dumping engine state to debugfs. That code wasn't previously getting
> an explicit reference at all as it does everything while holding the
> execlist specific spinlock. So, that needs updaing as well as that
> spinlock doesn't help when using GuC submission. Rather than trying to
> conditionally get/put depending on submission model, just change it to
> always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up
> too. While at it, add some extra whitespace padding for readability.

Is this part splittable into a separate patch?

> 
> v2: Explicitly document adding an extra blank line in some dense code
> (Andy Shevchenko). Fix multiple potential null pointer derefs in case
> of no request found (some spotted by Tvrtko, but there was more!).
> Also fix a leaked request in case of !started and another in
> __guc_reset_context now that intel_context_find_active_request is
> actually reference counting the returned request.
> v3: Add a _get suffix to intel_context_find_active_request now that it
> grabs a reference (Daniele).
> 
> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c           |  4 +++-
>   drivers/gpu/drm/i915/gt/intel_context.h           |  3 +--
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  6 +++++-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++++-
>   drivers/gpu/drm/i915/i915_gpu_error.c             | 13 ++++++-------
>   5 files changed, 28 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e94365b08f1ef..4285c1c71fa12 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -528,7 +528,7 @@ struct i915_request *intel_context_create_request(struct intel_context *ce)
>   	return rq;
>   }
>   
> -struct i915_request *intel_context_find_active_request(struct intel_context *ce)
> +struct i915_request *intel_context_find_active_request_get(struct intel_context *ce)

TBH I don't "dig" this name, it's a bit on the long side and feels out of character. I won't insist it be changed, but if get really has to be included in the name I would be happy with intel_context_get_active_request().

>   {
>   	struct intel_context *parent = intel_context_to_parent(ce);
>   	struct i915_request *rq, *active = NULL;
> @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   
>   		active = rq;
>   	}
> +	if (active)
> +		active = i915_request_get_rcu(active);
>   	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index fb62b7b8cbcda..ccc80c6607ca8 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct intel_context *ce,
>   
>   struct i915_request *intel_context_create_request(struct intel_context *ce);
>   
> -struct i915_request *
> -intel_context_find_active_request(struct intel_context *ce);
> +struct i915_request *intel_context_find_active_request_get(struct intel_context *ce);
>   
>   static inline bool intel_context_is_barrier(const struct intel_context *ce)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 922f1bb22dc68..fbc0a81617e89 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	if (guc) {
>   		ce = intel_engine_get_hung_context(engine);
>   		if (ce)
> -			hung_rq = intel_context_find_active_request(ce);
> +			hung_rq = intel_context_find_active_request_get(ce);
>   	} else {
>   		hung_rq = intel_engine_execlist_find_hung_request(engine);
> +		if (hung_rq)
> +			hung_rq = i915_request_get_rcu(hung_rq);
>   	}
>   
>   	if (hung_rq)
> @@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	else
>   		intel_engine_dump_active_requests(&engine->sched_engine->requests,
>   						  hung_rq, m);
> +	if (hung_rq)
> +		i915_request_put(hung_rq);

Argh... this is so horrible - not your patch - but the existing state of GuC backend was plugged in. I honestly don't know what to suggest here at this point... Above we have:

	if (guc)
		intel_guc_dump_active_requests(engine, hung_rq, m);
	else
		intel_engine_dump_active_requests(&engine->sched_engine->requests,
						  hung_rq, m);

As per your analysis the execlists code wants one lock held over that, especially when it calls intel_engine_dump_active_requests, which the GuC backed will also call from intel_guc_dump_active_requests (!) just needs a different lock held around it.

Is the lock held by intel_engine_dump over the call to engine_dump_active_requests truly useless in case of GuC? Or just wrong scope (too wide)?

>   }
>   
>   void intel_engine_dump(struct intel_engine_cs *engine,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..ad4b2848b0f83 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   			goto next_context;
>   
>   		guilty = false;
> -		rq = intel_context_find_active_request(ce);
> +		rq = intel_context_find_active_request_get(ce);
>   		if (!rq) {
>   			head = ce->ring->tail;
>   			goto out_replay;
> @@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   		head = intel_ring_wrap(ce->ring, rq->head);
>   
>   		__i915_request_reset(rq, guilty);
> +		i915_request_put(rq);
>   out_replay:
>   		guc_reset_state(ce, head, guilty);
>   next_context:
> @@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> +		bool found;
> +
>   		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> @@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> +			found = true;
> +			break;
> +		}
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		if (found) {
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> @@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			xa_lock(&guc->context_lookup);
>   			goto done;
>   		}
> +
>   next:
>   		intel_context_put(ce);
>   		xa_lock(&guc->context_lookup);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..5c73dfa2fb3f6 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	ce = intel_engine_get_hung_context(engine);
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
> -		rq = intel_context_find_active_request(ce);
> +		rq = intel_context_find_active_request_get(ce);
>   		if (!rq || !i915_request_started(rq))
>   			goto no_request_capture;
>   	} else {
> @@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine,
>   		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>   			spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   			rq = intel_engine_execlist_find_hung_request(engine);
> +			if (rq)
> +				rq = i915_request_get_rcu(rq);
>   			spin_unlock_irqrestore(&engine->sched_engine->lock,
>   					       flags);

Is it possible to consolidate this block with the one in engine_dump_active_requests? They seem identical..

	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
	if (guc) {
		ce = intel_engine_get_hung_context(engine);
		if (ce)
			hung_rq = intel_context_find_active_request(ce);
	} else {
		hung_rq = intel_engine_execlist_find_hung_request(engine);
	}


vs

	ce = intel_engine_get_hung_context(engine);
	if (ce) {
		intel_engine_clear_hung_context(engine);
		rq = intel_context_find_active_request(ce);
		if (!rq || !i915_request_started(rq))
			goto no_request_capture;
	} else {
		/*
		 * Getting here with GuC enabled means it is a forced error capture
		 * with no actual hang. So, no need to attempt the execlist search.
		 */
		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
			spin_lock_irqsave(&engine->sched_engine->lock, flags);
			rq = intel_engine_execlist_find_hung_request(engine);
			spin_unlock_irqrestore(&engine->sched_engine->lock,
					       flags);
		}
	}

We'd need a backend agnostic helper like:

intel_engine_get_hung_request(...)
{
...
	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
	if (guc) {
		ce = intel_engine_get_hung_context(engine);
		if (ce)
			hung_rq = intel_context_find_active_request(ce);
	} else {
		hung_rq = intel_engine_execlist_find_hung_request(engine);
	}

If locking can be untangled to work correctly for both callers.

Looks like I can't do a quick review on this but need to set aside a larger chunk of time. I'll try tomorrow.

Regards,

Tvrtko

>   		}
>   	}
> -	if (rq)
> -		rq = i915_request_get_rcu(rq);
> -
>   	if (!rq)
>   		goto no_request_capture;
>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture) {
> -		i915_request_put(rq);
> +	if (!capture)
>   		goto no_request_capture;
> -	}
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>   
> @@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine,
>   	return ee;
>   
>   no_request_capture:
> +	if (rq)
> +		i915_request_put(rq);
>   	kfree(ee);
>   	return NULL;
>   }

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-23 17:51   ` Tvrtko Ursulin
@ 2023-01-23 20:35     ` John Harrison
  2023-01-25 22:04     ` John Harrison
  1 sibling, 0 replies; 20+ messages in thread
From: John Harrison @ 2023-01-23 20:35 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: Daniele Ceraolo Spurio, DRI-Devel

On 1/23/2023 09:51, Tvrtko Ursulin wrote:
> On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count but within the
>> spinlock not outside it.
>>
>> The only other caller of the context based search is the code for
>> dumping engine state to debugfs. That code wasn't previously getting
>> an explicit reference at all as it does everything while holding the
>> execlist specific spinlock. So, that needs updaing as well as that
>> spinlock doesn't help when using GuC submission. Rather than trying to
>> conditionally get/put depending on submission model, just change it to
>> always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up
>> too. While at it, add some extra whitespace padding for readability.
>
> Is this part splittable into a separate patch?
I guess it could but it seems closely related to all the other locking 
fix ups in this patch.

>
>>
>> v2: Explicitly document adding an extra blank line in some dense code
>> (Andy Shevchenko). Fix multiple potential null pointer derefs in case
>> of no request found (some spotted by Tvrtko, but there was more!).
>> Also fix a leaked request in case of !started and another in
>> __guc_reset_context now that intel_context_find_active_request is
>> actually reference counting the returned request.
>> v3: Add a _get suffix to intel_context_find_active_request now that it
>> grabs a reference (Daniele).
>>
>> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full 
>> GPU reset with GuC")
>> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context 
>> reset")
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Jani Nikula <jani.nikula@linux.intel.com>
>> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
>> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Cc: Matthew Auld <matthew.auld@intel.com>
>> Cc: Matt Roper <matthew.d.roper@intel.com>
>> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Cc: Michael Cheng <michael.cheng@intel.com>
>> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
>> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
>> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
>> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
>> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
>> Cc: Bruce Chang <yu.bruce.chang@intel.com>
>> Cc: intel-gfx@lists.freedesktop.org
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_context.c           |  4 +++-
>>   drivers/gpu/drm/i915/gt/intel_context.h           |  3 +--
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  6 +++++-
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 14 +++++++++++++-
>>   drivers/gpu/drm/i915/i915_gpu_error.c             | 13 ++++++-------
>>   5 files changed, 28 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c 
>> b/drivers/gpu/drm/i915/gt/intel_context.c
>> index e94365b08f1ef..4285c1c71fa12 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_context.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
>> @@ -528,7 +528,7 @@ struct i915_request 
>> *intel_context_create_request(struct intel_context *ce)
>>       return rq;
>>   }
>>   -struct i915_request *intel_context_find_active_request(struct 
>> intel_context *ce)
>> +struct i915_request *intel_context_find_active_request_get(struct 
>> intel_context *ce)
>
> TBH I don't "dig" this name, it's a bit on the long side and feels out 
> of character. I won't insist it be changed, but if get really has to 
> be included in the name I would be happy with 
> intel_context_get_active_request().

Personally, I see the 'find' component as meaning it is a search not 
just a dereference of an existing pointer and therefore being a useful 
part of the name. I don't think there is a simple name that encapsulates 
everything that is going on here. But I don't feel too strongly about it 
if you really think the shorter version is better.

One could add some kerneldoc... but it would be almost the only function 
in the whole of intel_context.h with such. Not sure if that is 
intentional because "obviously it should be obvious what a function is 
doing by reading the code and documentation is a waste of space that 
gets out of date and inaccurate" and we aren't meant to kerneldoc 
internal behaviour or if it's just the general lack of documentation for 
any driver code.


>
>>   {
>>       struct intel_context *parent = intel_context_to_parent(ce);
>>       struct i915_request *rq, *active = NULL;
>> @@ -552,6 +552,8 @@ struct i915_request 
>> *intel_context_find_active_request(struct intel_context *ce)
>>             active = rq;
>>       }
>> +    if (active)
>> +        active = i915_request_get_rcu(active);
>>       spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>>         return active;
>> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h 
>> b/drivers/gpu/drm/i915/gt/intel_context.h
>> index fb62b7b8cbcda..ccc80c6607ca8 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_context.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
>> @@ -268,8 +268,7 @@ int intel_context_prepare_remote_request(struct 
>> intel_context *ce,
>>     struct i915_request *intel_context_create_request(struct 
>> intel_context *ce);
>>   -struct i915_request *
>> -intel_context_find_active_request(struct intel_context *ce);
>> +struct i915_request *intel_context_find_active_request_get(struct 
>> intel_context *ce);
>>     static inline bool intel_context_is_barrier(const struct 
>> intel_context *ce)
>>   {
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index 922f1bb22dc68..fbc0a81617e89 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -2237,9 +2237,11 @@ static void engine_dump_active_requests(struct 
>> intel_engine_cs *engine, struct d
>>       if (guc) {
>>           ce = intel_engine_get_hung_context(engine);
>>           if (ce)
>> -            hung_rq = intel_context_find_active_request(ce);
>> +            hung_rq = intel_context_find_active_request_get(ce);
>>       } else {
>>           hung_rq = intel_engine_execlist_find_hung_request(engine);
>> +        if (hung_rq)
>> +            hung_rq = i915_request_get_rcu(hung_rq);
>>       }
>>         if (hung_rq)
>> @@ -2250,6 +2252,8 @@ static void engine_dump_active_requests(struct 
>> intel_engine_cs *engine, struct d
>>       else
>> intel_engine_dump_active_requests(&engine->sched_engine->requests,
>>                             hung_rq, m);
>> +    if (hung_rq)
>> +        i915_request_put(hung_rq);
>
> Argh... this is so horrible - not your patch - but the existing state 
> of GuC backend was plugged in. I honestly don't know what to suggest 
> here at this point... Above we have:
>
>     if (guc)
>         intel_guc_dump_active_requests(engine, hung_rq, m);
>     else
> intel_engine_dump_active_requests(&engine->sched_engine->requests,
>                           hung_rq, m);
>
> As per your analysis the execlists code wants one lock held over that, 
> especially when it calls intel_engine_dump_active_requests, which the 
> GuC backed will also call from intel_guc_dump_active_requests (!) just 
> needs a different lock held around it.
Because the lock is effectively the backend implementation lock, not a 
top level driver global lock. So each backend has its own private lock 
around its own private data. It just so happens that both backends have 
a vaguely common list of tracked requests that can therefore be dumped 
by a common helper, even though the lists are managed completely 
differently.

>
> Is the lock held by intel_engine_dump over the call to 
> engine_dump_active_requests truly useless in case of GuC? Or just 
> wrong scope (too wide)?
Basically, yes. So far as I can tell, it is useless. It is locking a 
list that is only used by the execlist backend. The whole thing is a 
mess. Execlists was the only way to be and so ruled the universe. Then 
GuC came along and said 'hang on, that doesn't work for me'. Much 
horridness ensued.

Roll on Xe with it's correct layering...

>
>>   }
>>     void intel_engine_dump(struct intel_engine_cs *engine,
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index b436dd7f12e42..ad4b2848b0f83 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -1702,7 +1702,7 @@ static void __guc_reset_context(struct 
>> intel_context *ce, intel_engine_mask_t st
>>               goto next_context;
>>             guilty = false;
>> -        rq = intel_context_find_active_request(ce);
>> +        rq = intel_context_find_active_request_get(ce);
>>           if (!rq) {
>>               head = ce->ring->tail;
>>               goto out_replay;
>> @@ -1715,6 +1715,7 @@ static void __guc_reset_context(struct 
>> intel_context *ce, intel_engine_mask_t st
>>           head = intel_ring_wrap(ce->ring, rq->head);
>>             __i915_request_reset(rq, guilty);
>> +        i915_request_put(rq);
>>   out_replay:
>>           guc_reset_state(ce, head, guilty);
>>   next_context:
>> @@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>         xa_lock_irqsave(&guc->context_lookup, flags);
>>       xa_for_each(&guc->context_lookup, index, ce) {
>> +        bool found;
>> +
>>           if (!kref_get_unless_zero(&ce->ref))
>>               continue;
>>   @@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>                   goto next;
>>           }
>>   +        found = false;
>> +        spin_lock(&ce->guc_state.lock);
>>           list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>>               if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>>                   continue;
>>   +            found = true;
>> +            break;
>> +        }
>> +        spin_unlock(&ce->guc_state.lock);
>> +
>> +        if (found) {
>>               intel_engine_set_hung_context(engine, ce);
>>                 /* Can only cope with one hang at a time... */
>> @@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct 
>> intel_engine_cs *engine)
>>               xa_lock(&guc->context_lookup);
>>               goto done;
>>           }
>> +
>>   next:
>>           intel_context_put(ce);
>>           xa_lock(&guc->context_lookup);
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 9d5d5a397b64e..5c73dfa2fb3f6 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1607,7 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>>       ce = intel_engine_get_hung_context(engine);
>>       if (ce) {
>>           intel_engine_clear_hung_context(engine);
>> -        rq = intel_context_find_active_request(ce);
>> +        rq = intel_context_find_active_request_get(ce);
>>           if (!rq || !i915_request_started(rq))
>>               goto no_request_capture;
>>       } else {
>> @@ -1618,21 +1618,18 @@ capture_engine(struct intel_engine_cs *engine,
>>           if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>> spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>               rq = intel_engine_execlist_find_hung_request(engine);
>> +            if (rq)
>> +                rq = i915_request_get_rcu(rq);
>> spin_unlock_irqrestore(&engine->sched_engine->lock,
>>                              flags);
>
> Is it possible to consolidate this block with the one in 
> engine_dump_active_requests? They seem identical..
You mean as per the next patch that replaces both blocks with a common 
helper function - intel_engine_get_hung_entity?

> <snip>
> If locking can be untangled to work correctly for both callers.
The next patch reworks the debugfs dump code to have correct and minimal 
locking. That allows the search to be extracted into a common helper.

John.

>
> Looks like I can't do a quick review on this but need to set aside a 
> larger chunk of time. I'll try tomorrow.
>
> Regards,
>
> Tvrtko
>
>>           }
>>       }
>> -    if (rq)
>> -        rq = i915_request_get_rcu(rq);
>> -
>>       if (!rq)
>>           goto no_request_capture;
>>         capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> -    if (!capture) {
>> -        i915_request_put(rq);
>> +    if (!capture)
>>           goto no_request_capture;
>> -    }
>>       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>           intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>>   @@ -1642,6 +1639,8 @@ capture_engine(struct intel_engine_cs *engine,
>>       return ee;
>>     no_request_capture:
>> +    if (rq)
>> +        i915_request_put(rq);
>>       kfree(ee);
>>       return NULL;
>>   }


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
  2023-01-20 23:40   ` John Harrison
@ 2023-01-24 14:40   ` Tvrtko Ursulin
  2023-01-25 18:00     ` John Harrison
  2023-01-25  0:31   ` Ceraolo Spurio, Daniele
  2 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-01-24 14:40 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The debugfs dump of requests was confused about what state requires
> the execlist lock versus the GuC lock. There was also a bunch of
> duplicated messy code between it and the error capture code.
> 
> So refactor the hung request search into a re-usable function. And
> reduce the span of the execlist state lock to only the execlist
> specific code paths. In order to do that, also move the report of hold
> count (which is an execlist only concept) from the top level dump
> function to the lower level execlist specific function. Also, move the
> execlist specific code into the execlist source file.
> 
> v2: Rename some functions and move to more appropriate files (Daniele).

Continuing from yesterday where you pointed out 2/7 exists, after I 
declared capitulation on 1/7.. I think this refactor makes sense and 
definitely improves things a lot.

On the high level I am only unsure if the patch split could be improved. 
There seem to be three separate things, correct me if I missed something:

1) Locking fix in intel_guc_find_hung_context
2) Ref counting change throughout
3) Locking refactor / helper consolidation

(Or 2 and 3 swapped around, not sure.)

That IMO might be a bit easier to read because first patch wouldn't have 
two logical changes in it. Maybe easier to backport too if it comes to that?

On the low level it all looks fine to me - hopefully Daniele can do a 
detailed pass.

Regards,

Tvrtko

P.S. Only that intel_context_find_active_request_get hurts my eyes, and 
inflates the diff. I wouldn't rename it but if you guys insist okay.

> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 +++++++++----------
>   .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
>   .../drm/i915/gt/intel_execlists_submission.h  |  4 +
>   drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
>   5 files changed, 73 insertions(+), 62 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 0e24af5efee9c..b58c30ac8ef02 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests,
>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
>   				   ktime_t *now);
>   
> -struct i915_request *
> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine);
> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
> +				  struct intel_context **ce, struct i915_request **rq);
>   
>   u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
>   struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index fbc0a81617e89..1d77e27801bce 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq)
>   	}
>   }
>   
> -static unsigned long list_count(struct list_head *list)
> -{
> -	struct list_head *pos;
> -	unsigned long count = 0;
> -
> -	list_for_each(pos, list)
> -		count++;
> -
> -	return count;
> -}
> -
>   static unsigned long read_ul(void *p, size_t x)
>   {
>   	return *(unsigned long *)(p + x);
> @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests,
>   	}
>   }
>   
> -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
> +static void engine_dump_active_requests(struct intel_engine_cs *engine,
> +					struct drm_printer *m)
>   {
> +	struct intel_context *hung_ce = NULL;
>   	struct i915_request *hung_rq = NULL;
> -	struct intel_context *ce;
> -	bool guc;
>   
>   	/*
>   	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
> @@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	 * But the intention here is just to report an instantaneous snapshot
>   	 * so that's fine.
>   	 */
> -	lockdep_assert_held(&engine->sched_engine->lock);
> +	intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
>   
>   	drm_printf(m, "\tRequests:\n");
>   
> -	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
> -	if (guc) {
> -		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> -			hung_rq = intel_context_find_active_request_get(ce);
> -	} else {
> -		hung_rq = intel_engine_execlist_find_hung_request(engine);
> -		if (hung_rq)
> -			hung_rq = i915_request_get_rcu(hung_rq);
> -	}
> -
>   	if (hung_rq)
>   		engine_dump_request(hung_rq, m, "\t\thung");
> +	else if (hung_ce)
> +		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>   
> -	if (guc)
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
>   		intel_guc_dump_active_requests(engine, hung_rq, m);
>   	else
> -		intel_engine_dump_active_requests(&engine->sched_engine->requests,
> -						  hung_rq, m);
> +		intel_execlist_dump_active_requests(engine, hung_rq, m);
> +
>   	if (hung_rq)
>   		i915_request_put(hung_rq);
>   }
> @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	struct i915_gpu_error * const error = &engine->i915->gpu_error;
>   	struct i915_request *rq;
>   	intel_wakeref_t wakeref;
> -	unsigned long flags;
>   	ktime_t dummy;
>   
>   	if (header) {
> @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   		   i915_reset_count(error));
>   	print_properties(engine, m);
>   
> -	spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   	engine_dump_active_requests(engine, m);
>   
> -	drm_printf(m, "\tOn hold?: %lu\n",
> -		   list_count(&engine->sched_engine->hold));
> -	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> -
>   	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>   	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>   	if (wakeref) {
> @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings,
>   	return siblings[0]->cops->create_virtual(siblings, count, flags);
>   }
>   
> -struct i915_request *
> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
> +static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>   {
>   	struct i915_request *request, *active = NULL;
>   
> @@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>   	return active;
>   }
>   
> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
> +				  struct intel_context **ce, struct i915_request **rq)
> +{
> +	unsigned long flags;
> +
> +	*ce = intel_engine_get_hung_context(engine);
> +	if (*ce) {
> +		intel_engine_clear_hung_context(engine);
> +
> +		*rq = intel_context_find_active_request_get(*ce);
> +		return;
> +	}
> +
> +	/*
> +	 * Getting here with GuC enabled means it is a forced error capture
> +	 * with no actual hang. So, no need to attempt the execlist search.
> +	 */
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
> +		return;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +	*rq = engine_execlist_find_hung_request(engine);
> +	if (*rq)
> +		*rq = i915_request_get_rcu(*rq);
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
>   void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
>   {
>   	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 18ffe55282e59..05995c8577bef 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static unsigned long list_count(struct list_head *list)
> +{
> +	struct list_head *pos;
> +	unsigned long count = 0;
> +
> +	list_for_each(pos, list)
> +		count++;
> +
> +	return count;
> +}
> +
> +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,
> +					 struct i915_request *hung_rq,
> +					 struct drm_printer *m)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +
> +	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
> +
> +	drm_printf(m, "\tOn hold?: %lu\n",
> +		   list_count(&engine->sched_engine->hold));
> +
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_execlists.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> index a1aa92c983a51..cb07488a03764 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
>   							int indent),
>   				   unsigned int max);
>   
> +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,
> +					 struct i915_request *hung_rq,
> +					 struct drm_printer *m);
> +
>   bool
>   intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine);
>   
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 5c73dfa2fb3f6..b20bd6365615b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>   {
>   	struct intel_engine_capture_vma *capture = NULL;
>   	struct intel_engine_coredump *ee;
> -	struct intel_context *ce;
> +	struct intel_context *ce = NULL;
>   	struct i915_request *rq = NULL;
> -	unsigned long flags;
>   
>   	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
>   	if (!ee)
>   		return NULL;
>   
> -	ce = intel_engine_get_hung_context(engine);
> -	if (ce) {
> -		intel_engine_clear_hung_context(engine);
> -		rq = intel_context_find_active_request_get(ce);
> -		if (!rq || !i915_request_started(rq))
> -			goto no_request_capture;
> -	} else {
> -		/*
> -		 * Getting here with GuC enabled means it is a forced error capture
> -		 * with no actual hang. So, no need to attempt the execlist search.
> -		 */
> -		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
> -			spin_lock_irqsave(&engine->sched_engine->lock, flags);
> -			rq = intel_engine_execlist_find_hung_request(engine);
> -			if (rq)
> -				rq = i915_request_get_rcu(rq);
> -			spin_unlock_irqrestore(&engine->sched_engine->lock,
> -					       flags);
> -		}
> -	}
> -	if (!rq)
> +	intel_engine_get_hung_entity(engine, &ce, &rq);
> +	if (!rq || !i915_request_started(rq))
>   		goto no_request_capture;
>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
  2023-01-20 23:40   ` John Harrison
  2023-01-24 14:40   ` [Intel-gfx] " Tvrtko Ursulin
@ 2023-01-25  0:31   ` Ceraolo Spurio, Daniele
  2 siblings, 0 replies; 20+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-25  0:31 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/20/2023 3:28 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> The debugfs dump of requests was confused about what state requires
> the execlist lock versus the GuC lock. There was also a bunch of
> duplicated messy code between it and the error capture code.
>
> So refactor the hung request search into a re-usable function. And
> reduce the span of the execlist state lock to only the execlist
> specific code paths. In order to do that, also move the report of hold
> count (which is an execlist only concept) from the top level dump
> function to the lower level execlist specific function. Also, move the
> execlist specific code into the execlist source file.
>
> v2: Rename some functions and move to more appropriate files (Daniele).
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 +++++++++----------
>   .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
>   .../drm/i915/gt/intel_execlists_submission.h  |  4 +
>   drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
>   5 files changed, 73 insertions(+), 62 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h b/drivers/gpu/drm/i915/gt/intel_engine.h
> index 0e24af5efee9c..b58c30ac8ef02 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
> @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct list_head *requests,
>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
>   				   ktime_t *now);
>   
> -struct i915_request *
> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine);
> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
> +				  struct intel_context **ce, struct i915_request **rq);
>   
>   u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
>   struct intel_context *
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index fbc0a81617e89..1d77e27801bce 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2114,17 +2114,6 @@ static void print_request_ring(struct drm_printer *m, struct i915_request *rq)
>   	}
>   }
>   
> -static unsigned long list_count(struct list_head *list)
> -{
> -	struct list_head *pos;
> -	unsigned long count = 0;
> -
> -	list_for_each(pos, list)
> -		count++;
> -
> -	return count;
> -}
> -
>   static unsigned long read_ul(void *p, size_t x)
>   {
>   	return *(unsigned long *)(p + x);
> @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct list_head *requests,
>   	}
>   }
>   
> -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
> +static void engine_dump_active_requests(struct intel_engine_cs *engine,
> +					struct drm_printer *m)
>   {
> +	struct intel_context *hung_ce = NULL;
>   	struct i915_request *hung_rq = NULL;
> -	struct intel_context *ce;
> -	bool guc;
>   
>   	/*
>   	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
> @@ -2229,29 +2218,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	 * But the intention here is just to report an instantaneous snapshot
>   	 * so that's fine.
>   	 */
> -	lockdep_assert_held(&engine->sched_engine->lock);
> +	intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
>   
>   	drm_printf(m, "\tRequests:\n");
>   
> -	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
> -	if (guc) {
> -		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> -			hung_rq = intel_context_find_active_request_get(ce);
> -	} else {
> -		hung_rq = intel_engine_execlist_find_hung_request(engine);
> -		if (hung_rq)
> -			hung_rq = i915_request_get_rcu(hung_rq);
> -	}
> -
>   	if (hung_rq)
>   		engine_dump_request(hung_rq, m, "\t\thung");
> +	else if (hung_ce)
> +		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>   
> -	if (guc)
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
>   		intel_guc_dump_active_requests(engine, hung_rq, m);
>   	else
> -		intel_engine_dump_active_requests(&engine->sched_engine->requests,
> -						  hung_rq, m);
> +		intel_execlist_dump_active_requests(engine, hung_rq, m);
> +
>   	if (hung_rq)
>   		i915_request_put(hung_rq);
>   }
> @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	struct i915_gpu_error * const error = &engine->i915->gpu_error;
>   	struct i915_request *rq;
>   	intel_wakeref_t wakeref;
> -	unsigned long flags;
>   	ktime_t dummy;
>   
>   	if (header) {
> @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   		   i915_reset_count(error));
>   	print_properties(engine, m);
>   
> -	spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   	engine_dump_active_requests(engine, m);
>   
> -	drm_printf(m, "\tOn hold?: %lu\n",
> -		   list_count(&engine->sched_engine->hold));
> -	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> -
>   	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>   	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>   	if (wakeref) {
> @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct intel_engine_cs **siblings,
>   	return siblings[0]->cops->create_virtual(siblings, count, flags);
>   }
>   
> -struct i915_request *
> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
> +static struct i915_request *engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>   {
>   	struct i915_request *request, *active = NULL;
>   
> @@ -2405,6 +2378,33 @@ intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>   	return active;
>   }
>   
> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
> +				  struct intel_context **ce, struct i915_request **rq)
> +{
> +	unsigned long flags;
> +
> +	*ce = intel_engine_get_hung_context(engine);
> +	if (*ce) {
> +		intel_engine_clear_hung_context(engine);
> +
> +		*rq = intel_context_find_active_request_get(*ce);
> +		return;
> +	}
> +
> +	/*
> +	 * Getting here with GuC enabled means it is a forced error capture
> +	 * with no actual hang. So, no need to attempt the execlist search.
> +	 */
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
> +		return;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +	*rq = engine_execlist_find_hung_request(engine);
> +	if (*rq)
> +		*rq = i915_request_get_rcu(*rq);
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
>   void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
>   {
>   	/*
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> index 18ffe55282e59..05995c8577bef 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
> @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
>   	spin_unlock_irqrestore(&sched_engine->lock, flags);
>   }
>   
> +static unsigned long list_count(struct list_head *list)
> +{
> +	struct list_head *pos;
> +	unsigned long count = 0;
> +
> +	list_for_each(pos, list)
> +		count++;
> +
> +	return count;
> +}
> +
> +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,

nit: we usually use "execlists" and not "execlist".
Apart from this the patch LGTM.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> +					 struct i915_request *hung_rq,
> +					 struct drm_printer *m)
> +{
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +
> +	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
> +
> +	drm_printf(m, "\tOn hold?: %lu\n",
> +		   list_count(&engine->sched_engine->hold));
> +
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>   #include "selftest_execlists.c"
>   #endif
> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> index a1aa92c983a51..cb07488a03764 100644
> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
> @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct intel_engine_cs *engine,
>   							int indent),
>   				   unsigned int max);
>   
> +void intel_execlist_dump_active_requests(struct intel_engine_cs *engine,
> +					 struct i915_request *hung_rq,
> +					 struct drm_printer *m);
> +
>   bool
>   intel_engine_in_execlists_submission_mode(const struct intel_engine_cs *engine);
>   
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 5c73dfa2fb3f6..b20bd6365615b 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>   {
>   	struct intel_engine_capture_vma *capture = NULL;
>   	struct intel_engine_coredump *ee;
> -	struct intel_context *ce;
> +	struct intel_context *ce = NULL;
>   	struct i915_request *rq = NULL;
> -	unsigned long flags;
>   
>   	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
>   	if (!ee)
>   		return NULL;
>   
> -	ce = intel_engine_get_hung_context(engine);
> -	if (ce) {
> -		intel_engine_clear_hung_context(engine);
> -		rq = intel_context_find_active_request_get(ce);
> -		if (!rq || !i915_request_started(rq))
> -			goto no_request_capture;
> -	} else {
> -		/*
> -		 * Getting here with GuC enabled means it is a forced error capture
> -		 * with no actual hang. So, no need to attempt the execlist search.
> -		 */
> -		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
> -			spin_lock_irqsave(&engine->sched_engine->lock, flags);
> -			rq = intel_engine_execlist_find_hung_request(engine);
> -			if (rq)
> -				rq = i915_request_get_rcu(rq);
> -			spin_unlock_irqrestore(&engine->sched_engine->lock,
> -					       flags);
> -		}
> -	}
> -	if (!rq)
> +	intel_engine_get_hung_entity(engine, &ce, &rq);
> +	if (!rq || !i915_request_started(rq))
>   		goto no_request_capture;
>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 3/7] drm/i915: Allow error capture without a request
  2023-01-20 23:28 ` [PATCH v4 3/7] drm/i915: Allow error capture without a request John.C.Harrison
@ 2023-01-25  0:39   ` Ceraolo Spurio, Daniele
  2023-01-25  0:56     ` John Harrison
  0 siblings, 1 reply; 20+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-25  0:39 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/20/2023 3:28 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> There was a report of error captures occurring without any hung
> context being indicated despite the capture being initiated by a 'hung
> context notification' from GuC. The problem was not reproducible.
> However, it is possible to happen if the context in question has no
> active requests. For example, if the hang was in the context switch
> itself then the breadcrumb write would have occurred and the KMD would
> see an idle context.
>
> In the interests of attempting to provide as much information as
> possible about a hang, it seems wise to include the engine info
> regardless of whether a request was found or not. As opposed to just
> prentending there was no hang at all.
>
> So update the error capture code to always record engine information
> if a context is given. Which means updating record_context() to take a
> context instead of a request (which it only ever used to find the
> context anyway). And split the request agnostic parts of
> intel_engine_coredump_add_request() out into a seaprate function.
>
> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
> pointer.
> v3: Tidy up request locking code flow (Tvrtko)
> v4: Pull in improved info message from next patch and fix up potential
> leak of GuC register state (Daniele)

In the very unlikely case that the capture fails, we're leaving the data 
inside the GuC buffer. This is not new with this patch and not a bug 
(that buffer is a ring and the stale data will be overwritten if it gets 
full), but maybe something that can be improved as a follow-up.

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++---------
>   1 file changed, 50 insertions(+), 24 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index b20bd6365615b..225f1b11a6b93 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
>   }
>   
>   static bool record_context(struct i915_gem_context_coredump *e,
> -			   const struct i915_request *rq)
> +			   struct intel_context *ce)
>   {
>   	struct i915_gem_context *ctx;
>   	struct task_struct *task;
>   	bool simulated;
>   
>   	rcu_read_lock();
> -	ctx = rcu_dereference(rq->context->gem_context);
> +	ctx = rcu_dereference(ce->gem_context);
>   	if (ctx && !kref_get_unless_zero(&ctx->ref))
>   		ctx = NULL;
>   	rcu_read_unlock();
> @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
>   	e->guilty = atomic_read(&ctx->guilty_count);
>   	e->active = atomic_read(&ctx->active_count);
>   
> -	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
> -	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
> +	e->total_runtime = intel_context_get_total_runtime_ns(ce);
> +	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>   
>   	simulated = i915_gem_context_no_error_capture(ctx);
>   
> @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
>   	return ee;
>   }
>   
> +static struct intel_engine_capture_vma *
> +engine_coredump_add_context(struct intel_engine_coredump *ee,
> +			    struct intel_context *ce,
> +			    gfp_t gfp)
> +{
> +	struct intel_engine_capture_vma *vma = NULL;
> +
> +	ee->simulated |= record_context(&ee->context, ce);
> +	if (ee->simulated)
> +		return NULL;
> +
> +	/*
> +	 * We need to copy these to an anonymous buffer
> +	 * as the simplest method to avoid being overwritten
> +	 * by userspace.
> +	 */
> +	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> +	vma = capture_vma(vma, ce->state, "HW context", gfp);
> +
> +	return vma;
> +}
> +
>   struct intel_engine_capture_vma *
>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   				  struct i915_request *rq,
>   				  gfp_t gfp)
>   {
> -	struct intel_engine_capture_vma *vma = NULL;
> +	struct intel_engine_capture_vma *vma;
>   
> -	ee->simulated |= record_context(&ee->context, rq);
> -	if (ee->simulated)
> +	vma = engine_coredump_add_context(ee, rq->context, gfp);
> +	if (!vma)
>   		return NULL;
>   
>   	/*
> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   	 */
>   	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>   	vma = capture_user(vma, rq, gfp);
> -	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> -	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>   
>   	ee->rq_head = rq->head;
>   	ee->rq_post = rq->postfix;
> @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
>   		return NULL;
>   
>   	intel_engine_get_hung_entity(engine, &ce, &rq);
> -	if (!rq || !i915_request_started(rq))
> -		goto no_request_capture;
> +	if (rq && !i915_request_started(rq)) {
> +		drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
> +			 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
> +		i915_request_put(rq);
> +		rq = NULL;
> +	}
>   
> -	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture)
> -		goto no_request_capture;
> -	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
> -		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
> +	if (rq) {
> +		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> +		i915_request_put(rq);
> +	} else if (ce) {
> +		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
> +	}
>   
> -	intel_engine_coredump_add_vma(ee, capture, compress);
> -	i915_request_put(rq);
> +	if (capture) {
> +		intel_engine_coredump_add_vma(ee, capture, compress);
>   
> -	return ee;
> +		if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
> +			intel_guc_capture_get_matching_node(engine->gt, ee, ce);
> +	} else {
> +		kfree(ee);
> +		ee = NULL;
> +	}
>   
> -no_request_capture:
> -	if (rq)
> -		i915_request_put(rq);
> -	kfree(ee);
> -	return NULL;
> +	return ee;
>   }
>   
>   static void


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious
  2023-01-20 23:28 ` [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious John.C.Harrison
@ 2023-01-25  0:44   ` Ceraolo Spurio, Daniele
  0 siblings, 0 replies; 20+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-25  0:44 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/20/2023 3:28 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> The GuC specific register state entry in the error capture object was
> just called 'capture'. Although the companion 'node' entry was called
> 'guc_capture_node'. Rename the base entry to be 'guc_capture' instead
> so that it is a) more consistent and b) more obvious what it is.
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c | 8 ++++----
>   drivers/gpu/drm/i915/i915_gpu_error.h          | 2 +-
>   2 files changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> index 1c1b85073b4bd..fc3b994626a4f 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_capture.c
> @@ -1506,7 +1506,7 @@ int intel_guc_capture_print_engine_node(struct drm_i915_error_state_buf *ebuf,
>   
>   	if (!ebuf || !ee)
>   		return -EINVAL;
> -	cap = ee->capture;
> +	cap = ee->guc_capture;
>   	if (!cap || !ee->engine)
>   		return -ENODEV;
>   
> @@ -1576,8 +1576,8 @@ void intel_guc_capture_free_node(struct intel_engine_coredump *ee)
>   	if (!ee || !ee->guc_capture_node)
>   		return;
>   
> -	guc_capture_add_node_to_cachelist(ee->capture, ee->guc_capture_node);
> -	ee->capture = NULL;
> +	guc_capture_add_node_to_cachelist(ee->guc_capture, ee->guc_capture_node);
> +	ee->guc_capture = NULL;
>   	ee->guc_capture_node = NULL;
>   }
>   
> @@ -1611,7 +1611,7 @@ void intel_guc_capture_get_matching_node(struct intel_gt *gt,
>   		    (ce->lrc.lrca & CTX_GTT_ADDRESS_MASK)) {
>   			list_del(&n->link);
>   			ee->guc_capture_node = n;
> -			ee->capture = guc->capture;
> +			ee->guc_capture = guc->capture;
>   			return;
>   		}
>   	}
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.h b/drivers/gpu/drm/i915/i915_gpu_error.h
> index efc75cc2ffdb9..56027ffbce51f 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.h
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.h
> @@ -94,7 +94,7 @@ struct intel_engine_coredump {
>   	struct intel_instdone instdone;
>   
>   	/* GuC matched capture-lists info */
> -	struct intel_guc_state_capture *capture;
> +	struct intel_guc_state_capture *guc_capture;
>   	struct __guc_capture_parsed_output *guc_capture_node;
>   
>   	struct i915_gem_context_coredump {


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 3/7] drm/i915: Allow error capture without a request
  2023-01-25  0:39   ` [Intel-gfx] " Ceraolo Spurio, Daniele
@ 2023-01-25  0:56     ` John Harrison
  0 siblings, 0 replies; 20+ messages in thread
From: John Harrison @ 2023-01-25  0:56 UTC (permalink / raw)
  To: Ceraolo Spurio, Daniele, Intel-GFX; +Cc: DRI-Devel

On 1/24/2023 16:39, Ceraolo Spurio, Daniele wrote:
> On 1/20/2023 3:28 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> There was a report of error captures occurring without any hung
>> context being indicated despite the capture being initiated by a 'hung
>> context notification' from GuC. The problem was not reproducible.
>> However, it is possible to happen if the context in question has no
>> active requests. For example, if the hang was in the context switch
>> itself then the breadcrumb write would have occurred and the KMD would
>> see an idle context.
>>
>> In the interests of attempting to provide as much information as
>> possible about a hang, it seems wise to include the engine info
>> regardless of whether a request was found or not. As opposed to just
>> prentending there was no hang at all.
>>
>> So update the error capture code to always record engine information
>> if a context is given. Which means updating record_context() to take a
>> context instead of a request (which it only ever used to find the
>> context anyway). And split the request agnostic parts of
>> intel_engine_coredump_add_request() out into a seaprate function.
>>
>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
>> pointer.
>> v3: Tidy up request locking code flow (Tvrtko)
>> v4: Pull in improved info message from next patch and fix up potential
>> leak of GuC register state (Daniele)
>
> In the very unlikely case that the capture fails, we're leaving the 
> data inside the GuC buffer. This is not new with this patch and not a 
> bug (that buffer is a ring and the stale data will be overwritten if 
> it gets full), but maybe something that can be improved as a follow-up.
Correct and correct.

John.

>
> Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>
> Daniele
>
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 74 ++++++++++++++++++---------
>>   1 file changed, 50 insertions(+), 24 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index b20bd6365615b..225f1b11a6b93 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>> intel_engine_coredump *ee)
>>   }
>>     static bool record_context(struct i915_gem_context_coredump *e,
>> -               const struct i915_request *rq)
>> +               struct intel_context *ce)
>>   {
>>       struct i915_gem_context *ctx;
>>       struct task_struct *task;
>>       bool simulated;
>>         rcu_read_lock();
>> -    ctx = rcu_dereference(rq->context->gem_context);
>> +    ctx = rcu_dereference(ce->gem_context);
>>       if (ctx && !kref_get_unless_zero(&ctx->ref))
>>           ctx = NULL;
>>       rcu_read_unlock();
>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>> i915_gem_context_coredump *e,
>>       e->guilty = atomic_read(&ctx->guilty_count);
>>       e->active = atomic_read(&ctx->active_count);
>>   -    e->total_runtime = 
>> intel_context_get_total_runtime_ns(rq->context);
>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>         simulated = i915_gem_context_no_error_capture(ctx);
>>   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>       return ee;
>>   }
>>   +static struct intel_engine_capture_vma *
>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>> +                struct intel_context *ce,
>> +                gfp_t gfp)
>> +{
>> +    struct intel_engine_capture_vma *vma = NULL;
>> +
>> +    ee->simulated |= record_context(&ee->context, ce);
>> +    if (ee->simulated)
>> +        return NULL;
>> +
>> +    /*
>> +     * We need to copy these to an anonymous buffer
>> +     * as the simplest method to avoid being overwritten
>> +     * by userspace.
>> +     */
>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>> +
>> +    return vma;
>> +}
>> +
>>   struct intel_engine_capture_vma *
>>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>                     struct i915_request *rq,
>>                     gfp_t gfp)
>>   {
>> -    struct intel_engine_capture_vma *vma = NULL;
>> +    struct intel_engine_capture_vma *vma;
>>   -    ee->simulated |= record_context(&ee->context, rq);
>> -    if (ee->simulated)
>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>> +    if (!vma)
>>           return NULL;
>>         /*
>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>> intel_engine_coredump *ee,
>>        */
>>       vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>       vma = capture_user(vma, rq, gfp);
>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>         ee->rq_head = rq->head;
>>       ee->rq_post = rq->postfix;
>> @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
>>           return NULL;
>>         intel_engine_get_hung_entity(engine, &ce, &rq);
>> -    if (!rq || !i915_request_started(rq))
>> -        goto no_request_capture;
>> +    if (rq && !i915_request_started(rq)) {
>> +        drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with active request %lld:%lld [0x%04X] not yet started\n",
>> +             engine->name, rq->fence.context, rq->fence.seqno, 
>> ce->guc_id.id);
>> +        i915_request_put(rq);
>> +        rq = NULL;
>> +    }
>>   -    capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> -    if (!capture)
>> -        goto no_request_capture;
>> -    if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>> -        intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>> +    if (rq) {
>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> +        i915_request_put(rq);
>> +    } else if (ce) {
>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>> +    }
>>   -    intel_engine_coredump_add_vma(ee, capture, compress);
>> -    i915_request_put(rq);
>> +    if (capture) {
>> +        intel_engine_coredump_add_vma(ee, capture, compress);
>>   -    return ee;
>> +        if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>> +            intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>> +    } else {
>> +        kfree(ee);
>> +        ee = NULL;
>> +    }
>>   -no_request_capture:
>> -    if (rq)
>> -        i915_request_put(rq);
>> -    kfree(ee);
>> -    return NULL;
>> +    return ee;
>>   }
>>     static void
>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-24 14:40   ` [Intel-gfx] " Tvrtko Ursulin
@ 2023-01-25 18:00     ` John Harrison
  2023-01-25 18:12       ` Tvrtko Ursulin
  0 siblings, 1 reply; 20+ messages in thread
From: John Harrison @ 2023-01-25 18:00 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/24/2023 06:40, Tvrtko Ursulin wrote:
> On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> The debugfs dump of requests was confused about what state requires
>> the execlist lock versus the GuC lock. There was also a bunch of
>> duplicated messy code between it and the error capture code.
>>
>> So refactor the hung request search into a re-usable function. And
>> reduce the span of the execlist state lock to only the execlist
>> specific code paths. In order to do that, also move the report of hold
>> count (which is an execlist only concept) from the top level dump
>> function to the lower level execlist specific function. Also, move the
>> execlist specific code into the execlist source file.
>>
>> v2: Rename some functions and move to more appropriate files (Daniele).
>
> Continuing from yesterday where you pointed out 2/7 exists, after I 
> declared capitulation on 1/7.. I think this refactor makes sense and 
> definitely improves things a lot.
>
> On the high level I am only unsure if the patch split could be 
> improved. There seem to be three separate things, correct me if I 
> missed something:
>
> 1) Locking fix in intel_guc_find_hung_context
This is the change already it's own patch - #1/7. Can't really split 
that one up any further. Changing the internal GuC code requires 
changing the external common code to match.

> 2) Ref counting change throughout
> 3) Locking refactor / helper consolidation
These two being the changes in this patch - #2/7, yes?

The problem is that the reference counting fixes can only be done once 
the code has been refactored/reordered. And the refactor/reorder can 
only be done if the reference counting is fixed. I guess there would be 
some way to do the re-order first but it would require making even more 
of a mess of the spinlock activity to keep it all correct around that 
intermediate stage. So I don't think it would noticeably simplify the patch.

>
> (Or 2 and 3 swapped around, not sure.)
>
> That IMO might be a bit easier to read because first patch wouldn't 
> have two logical changes in it. Maybe easier to backport too if it 
> comes to that?
I'm not seeing 'two logical changes' in the first patch. Patch #1 fixes 
the reference counting of finding the hung request. That involves adding 
a reference count internally within the spinlock on the GuC side and 
moving the external reference count to within the spinlock on the 
execlist side and then doing a put in all cases. That really is a single 
change. It can't be split without either a) introducing a get/put 
mis-match bug or b) making the code really ugly as an intermediate 
(while still leaving one or other side broken).

John.

>
> On the low level it all looks fine to me - hopefully Daniele can do a 
> detailed pass.
>
> Regards,
>
> Tvrtko
>
> P.S. Only that intel_context_find_active_request_get hurts my eyes, 
> and inflates the diff. I wouldn't rename it but if you guys insist okay.
>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 +++++++++----------
>>   .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
>>   .../drm/i915/gt/intel_execlists_submission.h  |  4 +
>>   drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
>>   5 files changed, 73 insertions(+), 62 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>> index 0e24af5efee9c..b58c30ac8ef02 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>> @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct 
>> list_head *requests,
>>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>                      ktime_t *now);
>>   -struct i915_request *
>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs 
>> *engine);
>> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>> +                  struct intel_context **ce, struct i915_request **rq);
>>     u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
>>   struct intel_context *
>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> index fbc0a81617e89..1d77e27801bce 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>> @@ -2114,17 +2114,6 @@ static void print_request_ring(struct 
>> drm_printer *m, struct i915_request *rq)
>>       }
>>   }
>>   -static unsigned long list_count(struct list_head *list)
>> -{
>> -    struct list_head *pos;
>> -    unsigned long count = 0;
>> -
>> -    list_for_each(pos, list)
>> -        count++;
>> -
>> -    return count;
>> -}
>> -
>>   static unsigned long read_ul(void *p, size_t x)
>>   {
>>       return *(unsigned long *)(p + x);
>> @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct 
>> list_head *requests,
>>       }
>>   }
>>   -static void engine_dump_active_requests(struct intel_engine_cs 
>> *engine, struct drm_printer *m)
>> +static void engine_dump_active_requests(struct intel_engine_cs *engine,
>> +                    struct drm_printer *m)
>>   {
>> +    struct intel_context *hung_ce = NULL;
>>       struct i915_request *hung_rq = NULL;
>> -    struct intel_context *ce;
>> -    bool guc;
>>         /*
>>        * No need for an engine->irq_seqno_barrier() before the seqno 
>> reads.
>> @@ -2229,29 +2218,20 @@ static void 
>> engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>>        * But the intention here is just to report an instantaneous 
>> snapshot
>>        * so that's fine.
>>        */
>> -    lockdep_assert_held(&engine->sched_engine->lock);
>> +    intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
>>         drm_printf(m, "\tRequests:\n");
>>   -    guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>> -    if (guc) {
>> -        ce = intel_engine_get_hung_context(engine);
>> -        if (ce)
>> -            hung_rq = intel_context_find_active_request_get(ce);
>> -    } else {
>> -        hung_rq = intel_engine_execlist_find_hung_request(engine);
>> -        if (hung_rq)
>> -            hung_rq = i915_request_get_rcu(hung_rq);
>> -    }
>> -
>>       if (hung_rq)
>>           engine_dump_request(hung_rq, m, "\t\thung");
>> +    else if (hung_ce)
>> +        drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>>   -    if (guc)
>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>>           intel_guc_dump_active_requests(engine, hung_rq, m);
>>       else
>> - intel_engine_dump_active_requests(&engine->sched_engine->requests,
>> -                          hung_rq, m);
>> +        intel_execlist_dump_active_requests(engine, hung_rq, m);
>> +
>>       if (hung_rq)
>>           i915_request_put(hung_rq);
>>   }
>> @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs 
>> *engine,
>>       struct i915_gpu_error * const error = &engine->i915->gpu_error;
>>       struct i915_request *rq;
>>       intel_wakeref_t wakeref;
>> -    unsigned long flags;
>>       ktime_t dummy;
>>         if (header) {
>> @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs 
>> *engine,
>>              i915_reset_count(error));
>>       print_properties(engine, m);
>>   -    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>       engine_dump_active_requests(engine, m);
>>   -    drm_printf(m, "\tOn hold?: %lu\n",
>> -           list_count(&engine->sched_engine->hold));
>> - spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>> -
>>       drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>>       wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>>       if (wakeref) {
>> @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct 
>> intel_engine_cs **siblings,
>>       return siblings[0]->cops->create_virtual(siblings, count, flags);
>>   }
>>   -struct i915_request *
>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>> +static struct i915_request *engine_execlist_find_hung_request(struct 
>> intel_engine_cs *engine)
>>   {
>>       struct i915_request *request, *active = NULL;
>>   @@ -2405,6 +2378,33 @@ 
>> intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>>       return active;
>>   }
>>   +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>> +                  struct intel_context **ce, struct i915_request **rq)
>> +{
>> +    unsigned long flags;
>> +
>> +    *ce = intel_engine_get_hung_context(engine);
>> +    if (*ce) {
>> +        intel_engine_clear_hung_context(engine);
>> +
>> +        *rq = intel_context_find_active_request_get(*ce);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Getting here with GuC enabled means it is a forced error capture
>> +     * with no actual hang. So, no need to attempt the execlist search.
>> +     */
>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>> +        return;
>> +
>> +    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>> +    *rq = engine_execlist_find_hung_request(engine);
>> +    if (*rq)
>> +        *rq = i915_request_get_rcu(*rq);
>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>> +}
>> +
>>   void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
>>   {
>>       /*
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> index 18ffe55282e59..05995c8577bef 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>> @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct 
>> intel_engine_cs *engine,
>>       spin_unlock_irqrestore(&sched_engine->lock, flags);
>>   }
>>   +static unsigned long list_count(struct list_head *list)
>> +{
>> +    struct list_head *pos;
>> +    unsigned long count = 0;
>> +
>> +    list_for_each(pos, list)
>> +        count++;
>> +
>> +    return count;
>> +}
>> +
>> +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>> *engine,
>> +                     struct i915_request *hung_rq,
>> +                     struct drm_printer *m)
>> +{
>> +    unsigned long flags;
>> +
>> +    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>> +
>> + intel_engine_dump_active_requests(&engine->sched_engine->requests, 
>> hung_rq, m);
>> +
>> +    drm_printf(m, "\tOn hold?: %lu\n",
>> +           list_count(&engine->sched_engine->hold));
>> +
>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>> +}
>> +
>>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>>   #include "selftest_execlists.c"
>>   #endif
>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h 
>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>> index a1aa92c983a51..cb07488a03764 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>> @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct 
>> intel_engine_cs *engine,
>>                               int indent),
>>                      unsigned int max);
>>   +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>> *engine,
>> +                     struct i915_request *hung_rq,
>> +                     struct drm_printer *m);
>> +
>>   bool
>>   intel_engine_in_execlists_submission_mode(const struct 
>> intel_engine_cs *engine);
>>   diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 5c73dfa2fb3f6..b20bd6365615b 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>>   {
>>       struct intel_engine_capture_vma *capture = NULL;
>>       struct intel_engine_coredump *ee;
>> -    struct intel_context *ce;
>> +    struct intel_context *ce = NULL;
>>       struct i915_request *rq = NULL;
>> -    unsigned long flags;
>>         ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, 
>> dump_flags);
>>       if (!ee)
>>           return NULL;
>>   -    ce = intel_engine_get_hung_context(engine);
>> -    if (ce) {
>> -        intel_engine_clear_hung_context(engine);
>> -        rq = intel_context_find_active_request_get(ce);
>> -        if (!rq || !i915_request_started(rq))
>> -            goto no_request_capture;
>> -    } else {
>> -        /*
>> -         * Getting here with GuC enabled means it is a forced error 
>> capture
>> -         * with no actual hang. So, no need to attempt the execlist 
>> search.
>> -         */
>> -        if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>> - spin_lock_irqsave(&engine->sched_engine->lock, flags);
>> -            rq = intel_engine_execlist_find_hung_request(engine);
>> -            if (rq)
>> -                rq = i915_request_get_rcu(rq);
>> - spin_unlock_irqrestore(&engine->sched_engine->lock,
>> -                           flags);
>> -        }
>> -    }
>> -    if (!rq)
>> +    intel_engine_get_hung_entity(engine, &ce, &rq);
>> +    if (!rq || !i915_request_started(rq))
>>           goto no_request_capture;
>>         capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-25 18:00     ` John Harrison
@ 2023-01-25 18:12       ` Tvrtko Ursulin
  2023-01-25 18:17         ` John Harrison
  0 siblings, 1 reply; 20+ messages in thread
From: Tvrtko Ursulin @ 2023-01-25 18:12 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 25/01/2023 18:00, John Harrison wrote:
> On 1/24/2023 06:40, Tvrtko Ursulin wrote:
>> On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> The debugfs dump of requests was confused about what state requires
>>> the execlist lock versus the GuC lock. There was also a bunch of
>>> duplicated messy code between it and the error capture code.
>>>
>>> So refactor the hung request search into a re-usable function. And
>>> reduce the span of the execlist state lock to only the execlist
>>> specific code paths. In order to do that, also move the report of hold
>>> count (which is an execlist only concept) from the top level dump
>>> function to the lower level execlist specific function. Also, move the
>>> execlist specific code into the execlist source file.
>>>
>>> v2: Rename some functions and move to more appropriate files (Daniele).
>>
>> Continuing from yesterday where you pointed out 2/7 exists, after I 
>> declared capitulation on 1/7.. I think this refactor makes sense and 
>> definitely improves things a lot.
>>
>> On the high level I am only unsure if the patch split could be 
>> improved. There seem to be three separate things, correct me if I 
>> missed something:
>>
>> 1) Locking fix in intel_guc_find_hung_context
> This is the change already it's own patch - #1/7. Can't really split 
> that one up any further. Changing the internal GuC code requires 
> changing the external common code to match.
> 
>> 2) Ref counting change throughout
>> 3) Locking refactor / helper consolidation
> These two being the changes in this patch - #2/7, yes?
> 
> The problem is that the reference counting fixes can only be done once 
> the code has been refactored/reordered. And the refactor/reorder can 
> only be done if the reference counting is fixed. I guess there would be 
> some way to do the re-order first but it would require making even more 
> of a mess of the spinlock activity to keep it all correct around that 
> intermediate stage. So I don't think it would noticeably simplify the 
> patch.
> 
>>
>> (Or 2 and 3 swapped around, not sure.)
>>
>> That IMO might be a bit easier to read because first patch wouldn't 
>> have two logical changes in it. Maybe easier to backport too if it 
>> comes to that?
> I'm not seeing 'two logical changes' in the first patch. Patch #1 fixes 
> the reference counting of finding the hung request. That involves adding 
> a reference count internally within the spinlock on the GuC side and 
> moving the external reference count to within the spinlock on the 
> execlist side and then doing a put in all cases. That really is a single 
> change. It can't be split without either a) introducing a get/put 
> mis-match bug or b) making the code really ugly as an intermediate 
> (while still leaving one or other side broken).

I was thinking this part is wholy standalone:

@@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
  
  	xa_lock_irqsave(&guc->context_lookup, flags);
  	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
  		if (!kref_get_unless_zero(&ce->ref))
  			continue;
  
@@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
  				goto next;
  		}
  
+		found = false;
+		spin_lock(&ce->guc_state.lock);
  		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
  			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
  				continue;
  
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
  			intel_engine_set_hung_context(engine, ce);
  
  			/* Can only cope with one hang at a time... */
@@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
  			xa_lock(&guc->context_lookup);
  			goto done;
  		}
+
  next:
  		intel_context_put(ce);
  		xa_lock(&guc->context_lookup);

Am I missing something?

Regards,

Tvrtko

> 
> John.
> 
>>
>> On the low level it all looks fine to me - hopefully Daniele can do a 
>> detailed pass.
>>
>> Regards,
>>
>> Tvrtko
>>
>> P.S. Only that intel_context_find_active_request_get hurts my eyes, 
>> and inflates the diff. I wouldn't rename it but if you guys insist okay.
>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
>>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 +++++++++----------
>>>   .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
>>>   .../drm/i915/gt/intel_execlists_submission.h  |  4 +
>>>   drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
>>>   5 files changed, 73 insertions(+), 62 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>>> index 0e24af5efee9c..b58c30ac8ef02 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>>> @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct 
>>> list_head *requests,
>>>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>>                      ktime_t *now);
>>>   -struct i915_request *
>>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs 
>>> *engine);
>>> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>>> +                  struct intel_context **ce, struct i915_request **rq);
>>>     u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
>>>   struct intel_context *
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> index fbc0a81617e89..1d77e27801bce 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>> @@ -2114,17 +2114,6 @@ static void print_request_ring(struct 
>>> drm_printer *m, struct i915_request *rq)
>>>       }
>>>   }
>>>   -static unsigned long list_count(struct list_head *list)
>>> -{
>>> -    struct list_head *pos;
>>> -    unsigned long count = 0;
>>> -
>>> -    list_for_each(pos, list)
>>> -        count++;
>>> -
>>> -    return count;
>>> -}
>>> -
>>>   static unsigned long read_ul(void *p, size_t x)
>>>   {
>>>       return *(unsigned long *)(p + x);
>>> @@ -2216,11 +2205,11 @@ void intel_engine_dump_active_requests(struct 
>>> list_head *requests,
>>>       }
>>>   }
>>>   -static void engine_dump_active_requests(struct intel_engine_cs 
>>> *engine, struct drm_printer *m)
>>> +static void engine_dump_active_requests(struct intel_engine_cs *engine,
>>> +                    struct drm_printer *m)
>>>   {
>>> +    struct intel_context *hung_ce = NULL;
>>>       struct i915_request *hung_rq = NULL;
>>> -    struct intel_context *ce;
>>> -    bool guc;
>>>         /*
>>>        * No need for an engine->irq_seqno_barrier() before the seqno 
>>> reads.
>>> @@ -2229,29 +2218,20 @@ static void 
>>> engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>>>        * But the intention here is just to report an instantaneous 
>>> snapshot
>>>        * so that's fine.
>>>        */
>>> -    lockdep_assert_held(&engine->sched_engine->lock);
>>> +    intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
>>>         drm_printf(m, "\tRequests:\n");
>>>   -    guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>>> -    if (guc) {
>>> -        ce = intel_engine_get_hung_context(engine);
>>> -        if (ce)
>>> -            hung_rq = intel_context_find_active_request_get(ce);
>>> -    } else {
>>> -        hung_rq = intel_engine_execlist_find_hung_request(engine);
>>> -        if (hung_rq)
>>> -            hung_rq = i915_request_get_rcu(hung_rq);
>>> -    }
>>> -
>>>       if (hung_rq)
>>>           engine_dump_request(hung_rq, m, "\t\thung");
>>> +    else if (hung_ce)
>>> +        drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>>>   -    if (guc)
>>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>>>           intel_guc_dump_active_requests(engine, hung_rq, m);
>>>       else
>>> - intel_engine_dump_active_requests(&engine->sched_engine->requests,
>>> -                          hung_rq, m);
>>> +        intel_execlist_dump_active_requests(engine, hung_rq, m);
>>> +
>>>       if (hung_rq)
>>>           i915_request_put(hung_rq);
>>>   }
>>> @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs 
>>> *engine,
>>>       struct i915_gpu_error * const error = &engine->i915->gpu_error;
>>>       struct i915_request *rq;
>>>       intel_wakeref_t wakeref;
>>> -    unsigned long flags;
>>>       ktime_t dummy;
>>>         if (header) {
>>> @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct intel_engine_cs 
>>> *engine,
>>>              i915_reset_count(error));
>>>       print_properties(engine, m);
>>>   -    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>>       engine_dump_active_requests(engine, m);
>>>   -    drm_printf(m, "\tOn hold?: %lu\n",
>>> -           list_count(&engine->sched_engine->hold));
>>> - spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>> -
>>>       drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>>>       wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>>>       if (wakeref) {
>>> @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct 
>>> intel_engine_cs **siblings,
>>>       return siblings[0]->cops->create_virtual(siblings, count, flags);
>>>   }
>>>   -struct i915_request *
>>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>>> +static struct i915_request *engine_execlist_find_hung_request(struct 
>>> intel_engine_cs *engine)
>>>   {
>>>       struct i915_request *request, *active = NULL;
>>>   @@ -2405,6 +2378,33 @@ 
>>> intel_engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>>>       return active;
>>>   }
>>>   +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>>> +                  struct intel_context **ce, struct i915_request **rq)
>>> +{
>>> +    unsigned long flags;
>>> +
>>> +    *ce = intel_engine_get_hung_context(engine);
>>> +    if (*ce) {
>>> +        intel_engine_clear_hung_context(engine);
>>> +
>>> +        *rq = intel_context_find_active_request_get(*ce);
>>> +        return;
>>> +    }
>>> +
>>> +    /*
>>> +     * Getting here with GuC enabled means it is a forced error capture
>>> +     * with no actual hang. So, no need to attempt the execlist search.
>>> +     */
>>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>>> +        return;
>>> +
>>> +    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>> +    *rq = engine_execlist_find_hung_request(engine);
>>> +    if (*rq)
>>> +        *rq = i915_request_get_rcu(*rq);
>>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>> +}
>>> +
>>>   void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
>>>   {
>>>       /*
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
>>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> index 18ffe55282e59..05995c8577bef 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>> @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct 
>>> intel_engine_cs *engine,
>>>       spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>   }
>>>   +static unsigned long list_count(struct list_head *list)
>>> +{
>>> +    struct list_head *pos;
>>> +    unsigned long count = 0;
>>> +
>>> +    list_for_each(pos, list)
>>> +        count++;
>>> +
>>> +    return count;
>>> +}
>>> +
>>> +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>>> *engine,
>>> +                     struct i915_request *hung_rq,
>>> +                     struct drm_printer *m)
>>> +{
>>> +    unsigned long flags;
>>> +
>>> +    spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>> +
>>> + intel_engine_dump_active_requests(&engine->sched_engine->requests, 
>>> hung_rq, m);
>>> +
>>> +    drm_printf(m, "\tOn hold?: %lu\n",
>>> +           list_count(&engine->sched_engine->hold));
>>> +
>>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>> +}
>>> +
>>>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>>>   #include "selftest_execlists.c"
>>>   #endif
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h 
>>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>> index a1aa92c983a51..cb07488a03764 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>> @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct 
>>> intel_engine_cs *engine,
>>>                               int indent),
>>>                      unsigned int max);
>>>   +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>>> *engine,
>>> +                     struct i915_request *hung_rq,
>>> +                     struct drm_printer *m);
>>> +
>>>   bool
>>>   intel_engine_in_execlists_submission_mode(const struct 
>>> intel_engine_cs *engine);
>>>   diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 5c73dfa2fb3f6..b20bd6365615b 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>>>   {
>>>       struct intel_engine_capture_vma *capture = NULL;
>>>       struct intel_engine_coredump *ee;
>>> -    struct intel_context *ce;
>>> +    struct intel_context *ce = NULL;
>>>       struct i915_request *rq = NULL;
>>> -    unsigned long flags;
>>>         ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, 
>>> dump_flags);
>>>       if (!ee)
>>>           return NULL;
>>>   -    ce = intel_engine_get_hung_context(engine);
>>> -    if (ce) {
>>> -        intel_engine_clear_hung_context(engine);
>>> -        rq = intel_context_find_active_request_get(ce);
>>> -        if (!rq || !i915_request_started(rq))
>>> -            goto no_request_capture;
>>> -    } else {
>>> -        /*
>>> -         * Getting here with GuC enabled means it is a forced error 
>>> capture
>>> -         * with no actual hang. So, no need to attempt the execlist 
>>> search.
>>> -         */
>>> -        if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>>> - spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>> -            rq = intel_engine_execlist_find_hung_request(engine);
>>> -            if (rq)
>>> -                rq = i915_request_get_rcu(rq);
>>> - spin_unlock_irqrestore(&engine->sched_engine->lock,
>>> -                           flags);
>>> -        }
>>> -    }
>>> -    if (!rq)
>>> +    intel_engine_get_hung_entity(engine, &ce, &rq);
>>> +    if (!rq || !i915_request_started(rq))
>>>           goto no_request_capture;
>>>         capture = intel_engine_coredump_add_request(ee, rq, 
>>> ATOMIC_MAYFAIL);
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Intel-gfx] [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists
  2023-01-25 18:12       ` Tvrtko Ursulin
@ 2023-01-25 18:17         ` John Harrison
  0 siblings, 0 replies; 20+ messages in thread
From: John Harrison @ 2023-01-25 18:17 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 1/25/2023 10:12, Tvrtko Ursulin wrote:
> On 25/01/2023 18:00, John Harrison wrote:
>> On 1/24/2023 06:40, Tvrtko Ursulin wrote:
>>> On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> The debugfs dump of requests was confused about what state requires
>>>> the execlist lock versus the GuC lock. There was also a bunch of
>>>> duplicated messy code between it and the error capture code.
>>>>
>>>> So refactor the hung request search into a re-usable function. And
>>>> reduce the span of the execlist state lock to only the execlist
>>>> specific code paths. In order to do that, also move the report of hold
>>>> count (which is an execlist only concept) from the top level dump
>>>> function to the lower level execlist specific function. Also, move the
>>>> execlist specific code into the execlist source file.
>>>>
>>>> v2: Rename some functions and move to more appropriate files 
>>>> (Daniele).
>>>
>>> Continuing from yesterday where you pointed out 2/7 exists, after I 
>>> declared capitulation on 1/7.. I think this refactor makes sense and 
>>> definitely improves things a lot.
>>>
>>> On the high level I am only unsure if the patch split could be 
>>> improved. There seem to be three separate things, correct me if I 
>>> missed something:
>>>
>>> 1) Locking fix in intel_guc_find_hung_context
>> This is the change already it's own patch - #1/7. Can't really split 
>> that one up any further. Changing the internal GuC code requires 
>> changing the external common code to match.
>>
>>> 2) Ref counting change throughout
>>> 3) Locking refactor / helper consolidation
>> These two being the changes in this patch - #2/7, yes?
>>
>> The problem is that the reference counting fixes can only be done 
>> once the code has been refactored/reordered. And the refactor/reorder 
>> can only be done if the reference counting is fixed. I guess there 
>> would be some way to do the re-order first but it would require 
>> making even more of a mess of the spinlock activity to keep it all 
>> correct around that intermediate stage. So I don't think it would 
>> noticeably simplify the patch.
>>
>>>
>>> (Or 2 and 3 swapped around, not sure.)
>>>
>>> That IMO might be a bit easier to read because first patch wouldn't 
>>> have two logical changes in it. Maybe easier to backport too if it 
>>> comes to that?
>> I'm not seeing 'two logical changes' in the first patch. Patch #1 
>> fixes the reference counting of finding the hung request. That 
>> involves adding a reference count internally within the spinlock on 
>> the GuC side and moving the external reference count to within the 
>> spinlock on the execlist side and then doing a put in all cases. That 
>> really is a single change. It can't be split without either a) 
>> introducing a get/put mis-match bug or b) making the code really ugly 
>> as an intermediate (while still leaving one or other side broken).
>
> I was thinking this part is wholy standalone:
>
> @@ -4820,6 +4821,8 @@ void intel_guc_find_hung_context(struct 
> intel_engine_cs *engine)
>
>      xa_lock_irqsave(&guc->context_lookup, flags);
>      xa_for_each(&guc->context_lookup, index, ce) {
> +        bool found;
> +
>          if (!kref_get_unless_zero(&ce->ref))
>              continue;
>
> @@ -4836,10 +4839,18 @@ void intel_guc_find_hung_context(struct 
> intel_engine_cs *engine)
>                  goto next;
>          }
>
> +        found = false;
> +        spin_lock(&ce->guc_state.lock);
>          list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>              if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>                  continue;
>
> +            found = true;
> +            break;
> +        }
> +        spin_unlock(&ce->guc_state.lock);
> +
> +        if (found) {
>              intel_engine_set_hung_context(engine, ce);
>
>              /* Can only cope with one hang at a time... */
> @@ -4847,6 +4858,7 @@ void intel_guc_find_hung_context(struct 
> intel_engine_cs *engine)
>              xa_lock(&guc->context_lookup);
>              goto done;
>          }
> +
>  next:
>          intel_context_put(ce);
>          xa_lock(&guc->context_lookup);
>
> Am I missing something?
Doh.

Yes, I guess that part is stand alone. I was getting myself confused and 
thinking that was part of moving a get inside the spinlock. But you are 
right, that part is just about using the correct spinlock for that loop.

So yeah, I can split that chunk out to a separate patch. But that is 
splitting patch #1 into #1a and #1b. It doesn't help with patch #2. 
Which is the one I though you were complaining about being too complex. 
Which it is :(. But I'm really not seeing anyway to simplify it given 
how much of a mess the code is in.

John.


>
> Regards,
>
> Tvrtko
>
>>
>> John.
>>
>>>
>>> On the low level it all looks fine to me - hopefully Daniele can do 
>>> a detailed pass.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>>
>>> P.S. Only that intel_context_find_active_request_get hurts my eyes, 
>>> and inflates the diff. I wouldn't rename it but if you guys insist 
>>> okay.
>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_engine.h        |  4 +-
>>>>   drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 74 
>>>> +++++++++----------
>>>>   .../drm/i915/gt/intel_execlists_submission.c  | 27 +++++++
>>>>   .../drm/i915/gt/intel_execlists_submission.h  |  4 +
>>>>   drivers/gpu/drm/i915/i915_gpu_error.c         | 26 +------
>>>>   5 files changed, 73 insertions(+), 62 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine.h 
>>>> b/drivers/gpu/drm/i915/gt/intel_engine.h
>>>> index 0e24af5efee9c..b58c30ac8ef02 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine.h
>>>> @@ -250,8 +250,8 @@ void intel_engine_dump_active_requests(struct 
>>>> list_head *requests,
>>>>   ktime_t intel_engine_get_busy_time(struct intel_engine_cs *engine,
>>>>                      ktime_t *now);
>>>>   -struct i915_request *
>>>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs 
>>>> *engine);
>>>> +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>>>> +                  struct intel_context **ce, struct i915_request 
>>>> **rq);
>>>>     u32 intel_engine_context_size(struct intel_gt *gt, u8 class);
>>>>   struct intel_context *
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> index fbc0a81617e89..1d77e27801bce 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
>>>> @@ -2114,17 +2114,6 @@ static void print_request_ring(struct 
>>>> drm_printer *m, struct i915_request *rq)
>>>>       }
>>>>   }
>>>>   -static unsigned long list_count(struct list_head *list)
>>>> -{
>>>> -    struct list_head *pos;
>>>> -    unsigned long count = 0;
>>>> -
>>>> -    list_for_each(pos, list)
>>>> -        count++;
>>>> -
>>>> -    return count;
>>>> -}
>>>> -
>>>>   static unsigned long read_ul(void *p, size_t x)
>>>>   {
>>>>       return *(unsigned long *)(p + x);
>>>> @@ -2216,11 +2205,11 @@ void 
>>>> intel_engine_dump_active_requests(struct list_head *requests,
>>>>       }
>>>>   }
>>>>   -static void engine_dump_active_requests(struct intel_engine_cs 
>>>> *engine, struct drm_printer *m)
>>>> +static void engine_dump_active_requests(struct intel_engine_cs 
>>>> *engine,
>>>> +                    struct drm_printer *m)
>>>>   {
>>>> +    struct intel_context *hung_ce = NULL;
>>>>       struct i915_request *hung_rq = NULL;
>>>> -    struct intel_context *ce;
>>>> -    bool guc;
>>>>         /*
>>>>        * No need for an engine->irq_seqno_barrier() before the 
>>>> seqno reads.
>>>> @@ -2229,29 +2218,20 @@ static void 
>>>> engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>>>>        * But the intention here is just to report an instantaneous 
>>>> snapshot
>>>>        * so that's fine.
>>>>        */
>>>> - lockdep_assert_held(&engine->sched_engine->lock);
>>>> +    intel_engine_get_hung_entity(engine, &hung_ce, &hung_rq);
>>>>         drm_printf(m, "\tRequests:\n");
>>>>   -    guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>>>> -    if (guc) {
>>>> -        ce = intel_engine_get_hung_context(engine);
>>>> -        if (ce)
>>>> -            hung_rq = intel_context_find_active_request_get(ce);
>>>> -    } else {
>>>> -        hung_rq = intel_engine_execlist_find_hung_request(engine);
>>>> -        if (hung_rq)
>>>> -            hung_rq = i915_request_get_rcu(hung_rq);
>>>> -    }
>>>> -
>>>>       if (hung_rq)
>>>>           engine_dump_request(hung_rq, m, "\t\thung");
>>>> +    else if (hung_ce)
>>>> +        drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>>>>   -    if (guc)
>>>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>>>>           intel_guc_dump_active_requests(engine, hung_rq, m);
>>>>       else
>>>> - intel_engine_dump_active_requests(&engine->sched_engine->requests,
>>>> -                          hung_rq, m);
>>>> +        intel_execlist_dump_active_requests(engine, hung_rq, m);
>>>> +
>>>>       if (hung_rq)
>>>>           i915_request_put(hung_rq);
>>>>   }
>>>> @@ -2263,7 +2243,6 @@ void intel_engine_dump(struct intel_engine_cs 
>>>> *engine,
>>>>       struct i915_gpu_error * const error = &engine->i915->gpu_error;
>>>>       struct i915_request *rq;
>>>>       intel_wakeref_t wakeref;
>>>> -    unsigned long flags;
>>>>       ktime_t dummy;
>>>>         if (header) {
>>>> @@ -2300,13 +2279,8 @@ void intel_engine_dump(struct 
>>>> intel_engine_cs *engine,
>>>>              i915_reset_count(error));
>>>>       print_properties(engine, m);
>>>>   - spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>>>       engine_dump_active_requests(engine, m);
>>>>   -    drm_printf(m, "\tOn hold?: %lu\n",
>>>> - list_count(&engine->sched_engine->hold));
>>>> - spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>>> -
>>>>       drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>>>>       wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>>>>       if (wakeref) {
>>>> @@ -2352,8 +2326,7 @@ intel_engine_create_virtual(struct 
>>>> intel_engine_cs **siblings,
>>>>       return siblings[0]->cops->create_virtual(siblings, count, 
>>>> flags);
>>>>   }
>>>>   -struct i915_request *
>>>> -intel_engine_execlist_find_hung_request(struct intel_engine_cs 
>>>> *engine)
>>>> +static struct i915_request 
>>>> *engine_execlist_find_hung_request(struct intel_engine_cs *engine)
>>>>   {
>>>>       struct i915_request *request, *active = NULL;
>>>>   @@ -2405,6 +2378,33 @@ 
>>>> intel_engine_execlist_find_hung_request(struct intel_engine_cs 
>>>> *engine)
>>>>       return active;
>>>>   }
>>>>   +void intel_engine_get_hung_entity(struct intel_engine_cs *engine,
>>>> +                  struct intel_context **ce, struct i915_request 
>>>> **rq)
>>>> +{
>>>> +    unsigned long flags;
>>>> +
>>>> +    *ce = intel_engine_get_hung_context(engine);
>>>> +    if (*ce) {
>>>> +        intel_engine_clear_hung_context(engine);
>>>> +
>>>> +        *rq = intel_context_find_active_request_get(*ce);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Getting here with GuC enabled means it is a forced error 
>>>> capture
>>>> +     * with no actual hang. So, no need to attempt the execlist 
>>>> search.
>>>> +     */
>>>> +    if (intel_uc_uses_guc_submission(&engine->gt->uc))
>>>> +        return;
>>>> +
>>>> + spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>>> +    *rq = engine_execlist_find_hung_request(engine);
>>>> +    if (*rq)
>>>> +        *rq = i915_request_get_rcu(*rq);
>>>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>>> +}
>>>> +
>>>>   void xehp_enable_ccs_engines(struct intel_engine_cs *engine)
>>>>   {
>>>>       /*
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>> index 18ffe55282e59..05995c8577bef 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.c
>>>> @@ -4150,6 +4150,33 @@ void intel_execlists_show_requests(struct 
>>>> intel_engine_cs *engine,
>>>>       spin_unlock_irqrestore(&sched_engine->lock, flags);
>>>>   }
>>>>   +static unsigned long list_count(struct list_head *list)
>>>> +{
>>>> +    struct list_head *pos;
>>>> +    unsigned long count = 0;
>>>> +
>>>> +    list_for_each(pos, list)
>>>> +        count++;
>>>> +
>>>> +    return count;
>>>> +}
>>>> +
>>>> +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>>>> *engine,
>>>> +                     struct i915_request *hung_rq,
>>>> +                     struct drm_printer *m)
>>>> +{
>>>> +    unsigned long flags;
>>>> +
>>>> + spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>>> +
>>>> + 
>>>> intel_engine_dump_active_requests(&engine->sched_engine->requests, 
>>>> hung_rq, m);
>>>> +
>>>> +    drm_printf(m, "\tOn hold?: %lu\n",
>>>> + list_count(&engine->sched_engine->hold));
>>>> +
>>>> + spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
>>>> +}
>>>> +
>>>>   #if IS_ENABLED(CONFIG_DRM_I915_SELFTEST)
>>>>   #include "selftest_execlists.c"
>>>>   #endif
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h 
>>>> b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>>> index a1aa92c983a51..cb07488a03764 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_execlists_submission.h
>>>> @@ -32,6 +32,10 @@ void intel_execlists_show_requests(struct 
>>>> intel_engine_cs *engine,
>>>>                               int indent),
>>>>                      unsigned int max);
>>>>   +void intel_execlist_dump_active_requests(struct intel_engine_cs 
>>>> *engine,
>>>> +                     struct i915_request *hung_rq,
>>>> +                     struct drm_printer *m);
>>>> +
>>>>   bool
>>>>   intel_engine_in_execlists_submission_mode(const struct 
>>>> intel_engine_cs *engine);
>>>>   diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> index 5c73dfa2fb3f6..b20bd6365615b 100644
>>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>>> @@ -1596,35 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>>>>   {
>>>>       struct intel_engine_capture_vma *capture = NULL;
>>>>       struct intel_engine_coredump *ee;
>>>> -    struct intel_context *ce;
>>>> +    struct intel_context *ce = NULL;
>>>>       struct i915_request *rq = NULL;
>>>> -    unsigned long flags;
>>>>         ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, 
>>>> dump_flags);
>>>>       if (!ee)
>>>>           return NULL;
>>>>   -    ce = intel_engine_get_hung_context(engine);
>>>> -    if (ce) {
>>>> -        intel_engine_clear_hung_context(engine);
>>>> -        rq = intel_context_find_active_request_get(ce);
>>>> -        if (!rq || !i915_request_started(rq))
>>>> -            goto no_request_capture;
>>>> -    } else {
>>>> -        /*
>>>> -         * Getting here with GuC enabled means it is a forced 
>>>> error capture
>>>> -         * with no actual hang. So, no need to attempt the 
>>>> execlist search.
>>>> -         */
>>>> -        if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>>>> - spin_lock_irqsave(&engine->sched_engine->lock, flags);
>>>> -            rq = intel_engine_execlist_find_hung_request(engine);
>>>> -            if (rq)
>>>> -                rq = i915_request_get_rcu(rq);
>>>> - spin_unlock_irqrestore(&engine->sched_engine->lock,
>>>> -                           flags);
>>>> -        }
>>>> -    }
>>>> -    if (!rq)
>>>> +    intel_engine_get_hung_entity(engine, &ce, &rq);
>>>> +    if (!rq || !i915_request_started(rq))
>>>>           goto no_request_capture;
>>>>         capture = intel_engine_coredump_add_request(ee, rq, 
>>>> ATOMIC_MAYFAIL);
>>


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-23 17:51   ` Tvrtko Ursulin
  2023-01-23 20:35     ` John Harrison
@ 2023-01-25 22:04     ` John Harrison
  1 sibling, 0 replies; 20+ messages in thread
From: John Harrison @ 2023-01-25 22:04 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: Daniele Ceraolo Spurio, DRI-Devel

On 1/23/2023 09:51, Tvrtko Ursulin wrote:
> On 20/01/2023 23:28, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> <snip>
>>
>>   -struct i915_request *intel_context_find_active_request(struct 
>> intel_context *ce)
>> +struct i915_request *intel_context_find_active_request_get(struct 
>> intel_context *ce)
>
> TBH I don't "dig" this name, it's a bit on the long side and feels out 
> of character. I won't insist it be changed, but if get really has to 
> be included in the name I would be happy with 
> intel_context_get_active_request().
Daniele sided with you on this one. Will use your naming.

John.


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-01-25 22:04 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-20 23:28 [PATCH v4 0/7] Allow error capture without a request & fix locking issues John.C.Harrison
2023-01-20 23:28 ` [PATCH v4 1/7] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
2023-01-23 17:51   ` Tvrtko Ursulin
2023-01-23 20:35     ` John Harrison
2023-01-25 22:04     ` John Harrison
2023-01-20 23:28 ` [PATCH v4 2/7] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
2023-01-20 23:40   ` John Harrison
2023-01-24 14:40   ` [Intel-gfx] " Tvrtko Ursulin
2023-01-25 18:00     ` John Harrison
2023-01-25 18:12       ` Tvrtko Ursulin
2023-01-25 18:17         ` John Harrison
2023-01-25  0:31   ` Ceraolo Spurio, Daniele
2023-01-20 23:28 ` [PATCH v4 3/7] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-25  0:39   ` [Intel-gfx] " Ceraolo Spurio, Daniele
2023-01-25  0:56     ` John Harrison
2023-01-20 23:28 ` [PATCH v4 4/7] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-20 23:28 ` [PATCH v4 5/7] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-20 23:28 ` [PATCH v4 6/7] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-20 23:28 ` [PATCH v4 7/7] drm/i915/guc: Rename GuC register state capture node to be more obvious John.C.Harrison
2023-01-25  0:44   ` [Intel-gfx] " Ceraolo Spurio, Daniele

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).