[PATCH v3 0/6] Allow error capture without a request & fix locking issues

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Allow error capture without a request & fix locking issues
@ 2023-01-19  6:49 ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

While doing all this, it was noticed that the locking was broken in a
number of places when searching for hung requests and dumping request
info. So fix all that up as well.

v2: Tidy up code flow in error capture. Reword some comments/messages.
(review feedback from Tvrtko)
Also fix up request locking issues from earlier changes noticed during
code review of this change.
v3: Fix some potential null pointer derefs and a reference leak.
Add new patch to refactor the duplicated hung request search code into
a common backend agnostic wrapper function and use the correct
spinlocks for the correct lists. Also tweak some of the patch
descriptions for better accuracy.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (6):
  drm/i915: Fix request locking during error capture & debugfs dump
  drm/i915: Fix up locking around dumping requests lists
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 drivers/gpu/drm/i915/gt/intel_context.c       | 31 +++++++
 drivers/gpu/drm/i915/gt/intel_context.h       |  3 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 49 ++++++-----
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 34 ++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 88 ++++++++++---------
 5 files changed, 138 insertions(+), 67 deletions(-)

-- 
2.39.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 0/6] Allow error capture without a request & fix locking issues
@ 2023-01-19  6:49 ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of an engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

While doing all this, it was noticed that the locking was broken in a
number of places when searching for hung requests and dumping request
info. So fix all that up as well.

v2: Tidy up code flow in error capture. Reword some comments/messages.
(review feedback from Tvrtko)
Also fix up request locking issues from earlier changes noticed during
code review of this change.
v3: Fix some potential null pointer derefs and a reference leak.
Add new patch to refactor the duplicated hung request search code into
a common backend agnostic wrapper function and use the correct
spinlocks for the correct lists. Also tweak some of the patch
descriptions for better accuracy.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>

John Harrison (6):
  drm/i915: Fix request locking during error capture & debugfs dump
  drm/i915: Fix up locking around dumping requests lists
  drm/i915: Allow error capture without a request
  drm/i915: Allow error capture of a pending request
  drm/i915/guc: Look for a guilty context when an engine reset fails
  drm/i915/guc: Add a debug print on GuC triggered reset

 drivers/gpu/drm/i915/gt/intel_context.c       | 31 +++++++
 drivers/gpu/drm/i915/gt/intel_context.h       |  3 +
 drivers/gpu/drm/i915/gt/intel_engine_cs.c     | 49 ++++++-----
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 34 ++++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 88 ++++++++++---------
 5 files changed, 138 insertions(+), 67 deletions(-)

-- 
2.39.0

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:49   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Matthew Brost, Tvrtko Ursulin, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	intel-gfx, Lucas De Marchi, Bruce Chang, Daniele Ceraolo Spurio,
	DRI-Devel, Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay,
	John Harrison, Matthew Auld

From: John Harrison <John.C.Harrison@Intel.com>

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.

v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c           |  2 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  8 +++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 +++++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c             | 12 ++++++------
 4 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..e7c5509c48ef1 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 		active = rq;
 	}
+	if (active)
+		active = i915_request_get_rcu(active);
 	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..6a082658d0082 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,14 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
 	if (guc) {
 		ce = intel_engine_get_hung_context(engine);
-		if (ce)
+		if (ce) {
+			/* This will reference count the request (if found) */
 			hung_rq = intel_context_find_active_request(ce);
+		}
 	} else {
 		hung_rq = intel_engine_execlist_find_hung_request(engine);
+		if (hung_rq)
+			hung_rq = i915_request_get_rcu(hung_rq);
 	}
 
 	if (hung_rq)
@@ -2250,6 +2254,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	else
 		intel_engine_dump_active_requests(&engine->sched_engine->requests,
 						  hung_rq, m);
+	if (hung_rq)
+		i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..d123cbd90a919 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1702,6 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 			goto next_context;
 
 		guilty = false;
+		/* NB: This gets a reference to the request */
 		rq = intel_context_find_active_request(ce);
 		if (!rq) {
 			head = ce->ring->tail;
@@ -1715,6 +1716,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 		head = intel_ring_wrap(ce->ring, rq->head);
 
 		__i915_request_reset(rq, guilty);
+		i915_request_put(rq);
 out_replay:
 		guc_reset_state(ce, head, guilty);
 next_context:
@@ -4820,6 +4822,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
 		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
@@ -4836,10 +4840,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
+		found = false;
+		spin_lock(&ce->guc_state.lock);
 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
@@ -4847,6 +4859,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			xa_lock(&guc->context_lookup);
 			goto done;
 		}
+
 next:
 		intel_context_put(ce);
 		xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..7ea36478ee52d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
 	ce = intel_engine_get_hung_context(engine);
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
+		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
 		if (!rq || !i915_request_started(rq))
 			goto no_request_capture;
@@ -1618,21 +1619,18 @@ capture_engine(struct intel_engine_cs *engine,
 		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
 			spin_lock_irqsave(&engine->sched_engine->lock, flags);
 			rq = intel_engine_execlist_find_hung_request(engine);
+			if (rq)
+				rq = i915_request_get_rcu(rq);
 			spin_unlock_irqrestore(&engine->sched_engine->lock,
 					       flags);
 		}
 	}
-	if (rq)
-		rq = i915_request_get_rcu(rq);
-
 	if (!rq)
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture) {
-		i915_request_put(rq);
+	if (!capture)
 		goto no_request_capture;
-	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
@@ -1642,6 +1640,8 @@ capture_engine(struct intel_engine_cs *engine,
 	return ee;
 
 no_request_capture:
+	if (rq)
+		i915_request_put(rq);
 	kfree(ee);
 	return NULL;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-19  6:49   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX
  Cc: Andy Shevchenko, Michael Cheng, Alan Previn, intel-gfx,
	Lucas De Marchi, DRI-Devel, Andrzej Hajda, Rodrigo Vivi,
	Tejas Upadhyay, Matthew Auld

From: John Harrison <John.C.Harrison@Intel.com>

When GuC support was added to error capture, the locking around the
request object was broken. Fix it up.

The context based search manages the spinlocking around the search
internally. So it needs to grab the reference count internally as
well. The execlist only request based search relies on external
locking, so it needs an external reference count but within the
spinlock not outside it.

The only other caller of the context based search is the code for
dumping engine state to debugfs. That code wasn't previously getting
an explicit reference at all as it does everything while holding the
execlist specific spinlock. So, that needs updaing as well as that
spinlock doesn't help when using GuC submission. Rather than trying to
conditionally get/put depending on submission model, just change it to
always do the get/put.

In addition, intel_guc_find_hung_context() was not acquiring the
correct spinlock before searching the request list. So fix that up
too. While at it, add some extra whitespace padding for readability.

v2: Explicitly document adding an extra blank line in some dense code
(Andy Shevchenko). Fix multiple potential null pointer derefs in case
of no request found (some spotted by Tvrtko, but there was more!).
Also fix a leaked request in case of !started and another in
__guc_reset_context now that intel_context_find_active_request is
actually reference counting the returned request.

Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
Cc: Matthew Brost <matthew.brost@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Jani Nikula <jani.nikula@linux.intel.com>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Cc: Matthew Auld <matthew.auld@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Cc: Michael Cheng <michael.cheng@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
Cc: Bruce Chang <yu.bruce.chang@intel.com>
Cc: intel-gfx@lists.freedesktop.org
Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c           |  2 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  8 +++++++-
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 +++++++++++++
 drivers/gpu/drm/i915/i915_gpu_error.c             | 12 ++++++------
 4 files changed, 28 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e94365b08f1ef..e7c5509c48ef1 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 
 		active = rq;
 	}
+	if (active)
+		active = i915_request_get_rcu(active);
 	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
 
 	return active;
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 922f1bb22dc68..6a082658d0082 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2236,10 +2236,14 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
 	if (guc) {
 		ce = intel_engine_get_hung_context(engine);
-		if (ce)
+		if (ce) {
+			/* This will reference count the request (if found) */
 			hung_rq = intel_context_find_active_request(ce);
+		}
 	} else {
 		hung_rq = intel_engine_execlist_find_hung_request(engine);
+		if (hung_rq)
+			hung_rq = i915_request_get_rcu(hung_rq);
 	}
 
 	if (hung_rq)
@@ -2250,6 +2254,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	else
 		intel_engine_dump_active_requests(&engine->sched_engine->requests,
 						  hung_rq, m);
+	if (hung_rq)
+		i915_request_put(hung_rq);
 }
 
 void intel_engine_dump(struct intel_engine_cs *engine,
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index b436dd7f12e42..d123cbd90a919 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1702,6 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 			goto next_context;
 
 		guilty = false;
+		/* NB: This gets a reference to the request */
 		rq = intel_context_find_active_request(ce);
 		if (!rq) {
 			head = ce->ring->tail;
@@ -1715,6 +1716,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
 		head = intel_ring_wrap(ce->ring, rq->head);
 
 		__i915_request_reset(rq, guilty);
+		i915_request_put(rq);
 out_replay:
 		guc_reset_state(ce, head, guilty);
 next_context:
@@ -4820,6 +4822,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 
 	xa_lock_irqsave(&guc->context_lookup, flags);
 	xa_for_each(&guc->context_lookup, index, ce) {
+		bool found;
+
 		if (!kref_get_unless_zero(&ce->ref))
 			continue;
 
@@ -4836,10 +4840,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 				goto next;
 		}
 
+		found = false;
+		spin_lock(&ce->guc_state.lock);
 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
 			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
 				continue;
 
+			found = true;
+			break;
+		}
+		spin_unlock(&ce->guc_state.lock);
+
+		if (found) {
 			intel_engine_set_hung_context(engine, ce);
 
 			/* Can only cope with one hang at a time... */
@@ -4847,6 +4859,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
 			xa_lock(&guc->context_lookup);
 			goto done;
 		}
+
 next:
 		intel_context_put(ce);
 		xa_lock(&guc->context_lookup);
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..7ea36478ee52d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
 	ce = intel_engine_get_hung_context(engine);
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
+		/* This will reference count the request (if found) */
 		rq = intel_context_find_active_request(ce);
 		if (!rq || !i915_request_started(rq))
 			goto no_request_capture;
@@ -1618,21 +1619,18 @@ capture_engine(struct intel_engine_cs *engine,
 		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
 			spin_lock_irqsave(&engine->sched_engine->lock, flags);
 			rq = intel_engine_execlist_find_hung_request(engine);
+			if (rq)
+				rq = i915_request_get_rcu(rq);
 			spin_unlock_irqrestore(&engine->sched_engine->lock,
 					       flags);
 		}
 	}
-	if (rq)
-		rq = i915_request_get_rcu(rq);
-
 	if (!rq)
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture) {
-		i915_request_put(rq);
+	if (!capture)
 		goto no_request_capture;
-	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
@@ -1642,6 +1640,8 @@ capture_engine(struct intel_engine_cs *engine,
 	return ee;
 
 no_request_capture:
+	if (rq)
+		i915_request_put(rq);
 	kfree(ee);
 	return NULL;
 }
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 2/6] drm/i915: Fix up locking around dumping requests lists
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:49   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The debugfs dump of requests was confused about what state requires
the execlist lock versus the GuC lock. There was also a bunch of
duplicated messy code between it and the error capture code.

So refactor the hung request search into a re-usable function. And
reduce the span of the execlist state lock to only the execlist
specific code paths.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c   | 29 +++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h   |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 51 +++++++++++------------
 drivers/gpu/drm/i915/i915_gpu_error.c     | 27 ++----------
 4 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e7c5509c48ef1..a61f052092ed9 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -559,6 +559,35 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_get_hung_entity(struct intel_engine_cs *engine,
+			   struct intel_context **ce, struct i915_request **rq)
+{
+	unsigned long flags;
+
+	*ce = intel_engine_get_hung_context(engine);
+	if (*ce) {
+		intel_engine_clear_hung_context(engine);
+
+		/* This will reference count the request (if found) */
+		*rq = intel_context_find_active_request(*ce);
+
+		return;
+	}
+
+	/*
+	 * Getting here with GuC enabled means it is a forced error capture
+	 * with no actual hang. So, no need to attempt the execlist search.
+	 */
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
+		return;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+	*rq = intel_engine_execlist_find_hung_request(engine);
+	if (*rq)
+		*rq = i915_request_get_rcu(*rq);
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index fb62b7b8cbcda..ca50f3312a941 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -271,6 +271,9 @@ struct i915_request *intel_context_create_request(struct intel_context *ce);
 struct i915_request *
 intel_context_find_active_request(struct intel_context *ce);
 
+void intel_get_hung_entity(struct intel_engine_cs *engine,
+			   struct intel_context **ce, struct i915_request **rq);
+
 static inline bool intel_context_is_barrier(const struct intel_context *ce)
 {
 	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 6a082658d0082..5e173dfc8849e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2216,11 +2216,27 @@ void intel_engine_dump_active_requests(struct list_head *requests,
 	}
 }
 
-static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
+static void execlist_dump_active_requests(struct intel_engine_cs *engine,
+					  struct i915_request *hung_rq,
+					  struct drm_printer *m)
 {
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+
+	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
+
+	drm_printf(m, "\tOn hold?: %lu\n",
+		   list_count(&engine->sched_engine->hold));
+
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
+static void engine_dump_active_requests(struct intel_engine_cs *engine,
+					struct drm_printer *m)
+{
+	struct intel_context *hung_ce = NULL;
 	struct i915_request *hung_rq = NULL;
-	struct intel_context *ce;
-	bool guc;
 
 	/*
 	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
@@ -2229,31 +2245,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	 * But the intention here is just to report an instantaneous snapshot
 	 * so that's fine.
 	 */
-	lockdep_assert_held(&engine->sched_engine->lock);
+	intel_get_hung_entity(engine, &hung_ce, &hung_rq);
 
 	drm_printf(m, "\tRequests:\n");
 
-	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
-	if (guc) {
-		ce = intel_engine_get_hung_context(engine);
-		if (ce) {
-			/* This will reference count the request (if found) */
-			hung_rq = intel_context_find_active_request(ce);
-		}
-	} else {
-		hung_rq = intel_engine_execlist_find_hung_request(engine);
-		if (hung_rq)
-			hung_rq = i915_request_get_rcu(hung_rq);
-	}
-
 	if (hung_rq)
 		engine_dump_request(hung_rq, m, "\t\thung");
+	else if (hung_ce)
+		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
 
-	if (guc)
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
 		intel_guc_dump_active_requests(engine, hung_rq, m);
 	else
-		intel_engine_dump_active_requests(&engine->sched_engine->requests,
-						  hung_rq, m);
+		execlist_dump_active_requests(engine, hung_rq, m);
+
 	if (hung_rq)
 		i915_request_put(hung_rq);
 }
@@ -2265,7 +2270,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	struct i915_gpu_error * const error = &engine->i915->gpu_error;
 	struct i915_request *rq;
 	intel_wakeref_t wakeref;
-	unsigned long flags;
 	ktime_t dummy;
 
 	if (header) {
@@ -2302,13 +2306,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 		   i915_reset_count(error));
 	print_properties(engine, m);
 
-	spin_lock_irqsave(&engine->sched_engine->lock, flags);
 	engine_dump_active_requests(engine, m);
 
-	drm_printf(m, "\tOn hold?: %lu\n",
-		   list_count(&engine->sched_engine->hold));
-	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
-
 	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
 	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
 	if (wakeref) {
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 7ea36478ee52d..78cf95e4dd230 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1596,36 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
 {
 	struct intel_engine_capture_vma *capture = NULL;
 	struct intel_engine_coredump *ee;
-	struct intel_context *ce;
+	struct intel_context *ce = NULL;
 	struct i915_request *rq = NULL;
-	unsigned long flags;
 
 	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
 	if (!ee)
 		return NULL;
 
-	ce = intel_engine_get_hung_context(engine);
-	if (ce) {
-		intel_engine_clear_hung_context(engine);
-		/* This will reference count the request (if found) */
-		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
-	} else {
-		/*
-		 * Getting here with GuC enabled means it is a forced error capture
-		 * with no actual hang. So, no need to attempt the execlist search.
-		 */
-		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
-			spin_lock_irqsave(&engine->sched_engine->lock, flags);
-			rq = intel_engine_execlist_find_hung_request(engine);
-			if (rq)
-				rq = i915_request_get_rcu(rq);
-			spin_unlock_irqrestore(&engine->sched_engine->lock,
-					       flags);
-		}
-	}
-	if (!rq)
+	intel_get_hung_entity(engine, &ce, &rq);
+	if (!rq || !i915_request_started(rq))
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 2/6] drm/i915: Fix up locking around dumping requests lists
@ 2023-01-19  6:49   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The debugfs dump of requests was confused about what state requires
the execlist lock versus the GuC lock. There was also a bunch of
duplicated messy code between it and the error capture code.

So refactor the hung request search into a re-usable function. And
reduce the span of the execlist state lock to only the execlist
specific code paths.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_context.c   | 29 +++++++++++++
 drivers/gpu/drm/i915/gt/intel_context.h   |  3 ++
 drivers/gpu/drm/i915/gt/intel_engine_cs.c | 51 +++++++++++------------
 drivers/gpu/drm/i915/i915_gpu_error.c     | 27 ++----------
 4 files changed, 60 insertions(+), 50 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
index e7c5509c48ef1..a61f052092ed9 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.c
+++ b/drivers/gpu/drm/i915/gt/intel_context.c
@@ -559,6 +559,35 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
 	return active;
 }
 
+void intel_get_hung_entity(struct intel_engine_cs *engine,
+			   struct intel_context **ce, struct i915_request **rq)
+{
+	unsigned long flags;
+
+	*ce = intel_engine_get_hung_context(engine);
+	if (*ce) {
+		intel_engine_clear_hung_context(engine);
+
+		/* This will reference count the request (if found) */
+		*rq = intel_context_find_active_request(*ce);
+
+		return;
+	}
+
+	/*
+	 * Getting here with GuC enabled means it is a forced error capture
+	 * with no actual hang. So, no need to attempt the execlist search.
+	 */
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
+		return;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+	*rq = intel_engine_execlist_find_hung_request(engine);
+	if (*rq)
+		*rq = i915_request_get_rcu(*rq);
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
 void intel_context_bind_parent_child(struct intel_context *parent,
 				     struct intel_context *child)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
index fb62b7b8cbcda..ca50f3312a941 100644
--- a/drivers/gpu/drm/i915/gt/intel_context.h
+++ b/drivers/gpu/drm/i915/gt/intel_context.h
@@ -271,6 +271,9 @@ struct i915_request *intel_context_create_request(struct intel_context *ce);
 struct i915_request *
 intel_context_find_active_request(struct intel_context *ce);
 
+void intel_get_hung_entity(struct intel_engine_cs *engine,
+			   struct intel_context **ce, struct i915_request **rq);
+
 static inline bool intel_context_is_barrier(const struct intel_context *ce)
 {
 	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
index 6a082658d0082..5e173dfc8849e 100644
--- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
+++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
@@ -2216,11 +2216,27 @@ void intel_engine_dump_active_requests(struct list_head *requests,
 	}
 }
 
-static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
+static void execlist_dump_active_requests(struct intel_engine_cs *engine,
+					  struct i915_request *hung_rq,
+					  struct drm_printer *m)
 {
+	unsigned long flags;
+
+	spin_lock_irqsave(&engine->sched_engine->lock, flags);
+
+	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
+
+	drm_printf(m, "\tOn hold?: %lu\n",
+		   list_count(&engine->sched_engine->hold));
+
+	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
+}
+
+static void engine_dump_active_requests(struct intel_engine_cs *engine,
+					struct drm_printer *m)
+{
+	struct intel_context *hung_ce = NULL;
 	struct i915_request *hung_rq = NULL;
-	struct intel_context *ce;
-	bool guc;
 
 	/*
 	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
@@ -2229,31 +2245,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
 	 * But the intention here is just to report an instantaneous snapshot
 	 * so that's fine.
 	 */
-	lockdep_assert_held(&engine->sched_engine->lock);
+	intel_get_hung_entity(engine, &hung_ce, &hung_rq);
 
 	drm_printf(m, "\tRequests:\n");
 
-	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
-	if (guc) {
-		ce = intel_engine_get_hung_context(engine);
-		if (ce) {
-			/* This will reference count the request (if found) */
-			hung_rq = intel_context_find_active_request(ce);
-		}
-	} else {
-		hung_rq = intel_engine_execlist_find_hung_request(engine);
-		if (hung_rq)
-			hung_rq = i915_request_get_rcu(hung_rq);
-	}
-
 	if (hung_rq)
 		engine_dump_request(hung_rq, m, "\t\thung");
+	else if (hung_ce)
+		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
 
-	if (guc)
+	if (intel_uc_uses_guc_submission(&engine->gt->uc))
 		intel_guc_dump_active_requests(engine, hung_rq, m);
 	else
-		intel_engine_dump_active_requests(&engine->sched_engine->requests,
-						  hung_rq, m);
+		execlist_dump_active_requests(engine, hung_rq, m);
+
 	if (hung_rq)
 		i915_request_put(hung_rq);
 }
@@ -2265,7 +2270,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 	struct i915_gpu_error * const error = &engine->i915->gpu_error;
 	struct i915_request *rq;
 	intel_wakeref_t wakeref;
-	unsigned long flags;
 	ktime_t dummy;
 
 	if (header) {
@@ -2302,13 +2306,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
 		   i915_reset_count(error));
 	print_properties(engine, m);
 
-	spin_lock_irqsave(&engine->sched_engine->lock, flags);
 	engine_dump_active_requests(engine, m);
 
-	drm_printf(m, "\tOn hold?: %lu\n",
-		   list_count(&engine->sched_engine->hold));
-	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
-
 	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
 	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
 	if (wakeref) {
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 7ea36478ee52d..78cf95e4dd230 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1596,36 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
 {
 	struct intel_engine_capture_vma *capture = NULL;
 	struct intel_engine_coredump *ee;
-	struct intel_context *ce;
+	struct intel_context *ce = NULL;
 	struct i915_request *rq = NULL;
-	unsigned long flags;
 
 	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
 	if (!ee)
 		return NULL;
 
-	ce = intel_engine_get_hung_context(engine);
-	if (ce) {
-		intel_engine_clear_hung_context(engine);
-		/* This will reference count the request (if found) */
-		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
-	} else {
-		/*
-		 * Getting here with GuC enabled means it is a forced error capture
-		 * with no actual hang. So, no need to attempt the execlist search.
-		 */
-		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
-			spin_lock_irqsave(&engine->sched_engine->lock, flags);
-			rq = intel_engine_execlist_find_hung_request(engine);
-			if (rq)
-				rq = i915_request_get_rcu(rq);
-			spin_unlock_irqrestore(&engine->sched_engine->lock,
-					       flags);
-		}
-	}
-	if (!rq)
+	intel_get_hung_entity(engine, &ce, &rq);
+	if (!rq || !i915_request_started(rq))
 		goto no_request_capture;
 
 	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 3/6] drm/i915: Allow error capture without a request
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:49   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: Umesh Nerlige Ramappa, John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if a context is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.
v3: Tidy up request locking code flow (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 70 ++++++++++++++++++---------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 78cf95e4dd230..743614fff5472 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_get_hung_entity(engine, &ce, &rq);
-	if (!rq || !i915_request_started(rq))
-		goto no_request_capture;
+	if (rq && !i915_request_started(rq)) {
+		drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+			 engine->name);
+		i915_request_put(rq);
+		rq = NULL;
+	}
+
+	if (rq) {
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+		i915_request_put(rq);
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
+	}
 
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture)
-		goto no_request_capture;
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
-	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (capture) {
+		intel_engine_coredump_add_vma(ee, capture, compress);
+	} else {
+		kfree(ee);
+		ee = NULL;
+	}
 
 	return ee;
-
-no_request_capture:
-	if (rq)
-		i915_request_put(rq);
-	kfree(ee);
-	return NULL;
 }
 
 static void
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 3/6] drm/i915: Allow error capture without a request
@ 2023-01-19  6:49   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if a context is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
pointer.
v3: Tidy up request locking code flow (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 70 ++++++++++++++++++---------
 1 file changed, 48 insertions(+), 22 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 78cf95e4dd230..743614fff5472 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_get_hung_entity(engine, &ce, &rq);
-	if (!rq || !i915_request_started(rq))
-		goto no_request_capture;
+	if (rq && !i915_request_started(rq)) {
+		drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+			 engine->name);
+		i915_request_put(rq);
+		rq = NULL;
+	}
+
+	if (rq) {
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+		i915_request_put(rq);
+	} else if (ce) {
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
+	}
 
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-	if (!capture)
-		goto no_request_capture;
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
 		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
 
-	intel_engine_coredump_add_vma(ee, capture, compress);
-	i915_request_put(rq);
+	if (capture) {
+		intel_engine_coredump_add_vma(ee, capture, compress);
+	} else {
+		kfree(ee);
+		ee = NULL;
+	}
 
 	return ee;
-
-no_request_capture:
-	if (rq)
-		i915_request_put(rq);
-	kfree(ee);
-	return NULL;
 }
 
 static void
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 4/6] drm/i915: Allow error capture of a pending request
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:49   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

v2: Reword 'not started' warning message (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 743614fff5472..0d1e68830b75e 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_get_hung_entity(engine, &ce, &rq);
-	if (rq && !i915_request_started(rq)) {
-		drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-			 engine->name);
-		i915_request_put(rq);
-		rq = NULL;
-	}
+	if (rq && !i915_request_started(rq))
+		drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
+			 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
 
 	if (rq) {
 		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 4/6] drm/i915: Allow error capture of a pending request
@ 2023-01-19  6:49   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

A hang situation has been observed where the only requests on the
context were either completed or not yet started according to the
breaadcrumbs. However, the register state claimed a batch was (maybe)
in progress. So, allow capture of the pending request on the grounds
that this might be better than nothing.

v2: Reword 'not started' warning message (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 743614fff5472..0d1e68830b75e 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1624,12 +1624,9 @@ capture_engine(struct intel_engine_cs *engine,
 		return NULL;
 
 	intel_get_hung_entity(engine, &ce, &rq);
-	if (rq && !i915_request_started(rq)) {
-		drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
-			 engine->name);
-		i915_request_put(rq);
-		rq = NULL;
-	}
+	if (rq && !i915_request_started(rq))
+		drm_info(&engine->gt->i915->drm, "Got hung context on %s with active request %lld:%lld [0x%04X] not yet started\n",
+			 engine->name, rq->fence.context, rq->fence.seqno, ce->guc_id.id);
 
 	if (rq) {
 		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 5/6] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:49   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC intentionally dies on such a situation,
it is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

v2: Change comment to be less alarming (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d123cbd90a919..7c5ea66218443 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4756,11 +4756,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be reliable to do this here because the GuC is
+		 * toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 5/6] drm/i915/guc: Look for a guilty context when an engine reset fails
@ 2023-01-19  6:49   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:49 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never fail. But in the case when one
does (due to unknown reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC intentionally dies on such a situation,
it is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

v2: Change comment to be less alarming (Tvrtko)

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index d123cbd90a919..7c5ea66218443 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4756,11 +4756,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be reliable to do this here because the GuC is
+		 * toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
-				      "GuC failed to reset engine mask=0x%x\n",
+				      "GuC failed to reset engine mask=0x%x",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v3 6/6] drm/i915/guc: Add a debug print on GuC triggered reset
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19  6:50   ` John.C.Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:50 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel, Tvrtko Ursulin

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 7c5ea66218443..30f79d333ae9b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4667,6 +4667,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] [PATCH v3 6/6] drm/i915/guc: Add a debug print on GuC triggered reset
@ 2023-01-19  6:50   ` John.C.Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John.C.Harrison @ 2023-01-19  6:50 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

For understanding bug reports, it can be useful to have an explicit
dmesg print when a reset notification is received from GuC. As opposed
to simply inferring that this happened from other messages.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 7c5ea66218443..30f79d333ae9b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4667,6 +4667,10 @@ static void guc_handle_context_reset(struct intel_guc *guc,
 {
 	trace_intel_context_reset(ce);
 
+	drm_dbg(&guc_to_gt(guc)->i915->drm, "Got GuC reset of 0x%04X, exiting = %d, banned = %d\n",
+		ce->guc_id.id, test_bit(CONTEXT_EXITING, &ce->flags),
+		test_bit(CONTEXT_BANNED, &ce->flags));
+
 	if (likely(intel_context_is_schedulable(ce))) {
 		capture_error_state(guc, ce);
 		guc_context_replay(ce);
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for Allow error capture without a request & fix locking issues
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
                   ` (6 preceding siblings ...)
  (?)
@ 2023-01-19  9:30 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-01-19  9:30 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 7788 bytes --]

== Series Details ==

Series: Allow error capture without a request & fix locking issues
URL   : https://patchwork.freedesktop.org/series/113078/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_12608 -> Patchwork_113078v1
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/index.html

Participating hosts (39 -> 39)
------------------------------

  Additional (2): fi-bsw-kefka fi-rkl-11600 
  Missing    (2): fi-kbl-soraka fi-snb-2520m 

Known issues
------------

  Here are the changes found in Patchwork_113078v1 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@debugfs_test@basic-hwmon:
    - fi-rkl-11600:       NOTRUN -> [SKIP][1] ([i915#7456])
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@debugfs_test@basic-hwmon.html

  * igt@gem_huc_copy@huc-copy:
    - fi-rkl-11600:       NOTRUN -> [SKIP][2] ([i915#2190])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@parallel-random-engines:
    - fi-rkl-11600:       NOTRUN -> [SKIP][3] ([i915#4613]) +3 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@gem_lmem_swapping@parallel-random-engines.html

  * igt@gem_tiled_pread_basic:
    - fi-rkl-11600:       NOTRUN -> [SKIP][4] ([i915#3282])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@gem_tiled_pread_basic.html

  * igt@i915_pm_backlight@basic-brightness:
    - fi-rkl-11600:       NOTRUN -> [SKIP][5] ([i915#7561])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@i915_pm_backlight@basic-brightness.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-glk-j4005:       [PASS][6] -> [DMESG-FAIL][7] ([i915#5334])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_suspend@basic-s3-without-i915:
    - fi-rkl-11600:       NOTRUN -> [INCOMPLETE][8] ([i915#4817])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html

  * igt@kms_chamelium_hpd@dp-hpd-fast:
    - fi-rkl-11600:       NOTRUN -> [SKIP][9] ([i915#7828]) +7 similar issues
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@kms_chamelium_hpd@dp-hpd-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor:
    - fi-rkl-11600:       NOTRUN -> [SKIP][10] ([i915#4103])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@kms_cursor_legacy@basic-busy-flip-before-cursor.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-rkl-11600:       NOTRUN -> [SKIP][11] ([fdo#109285] / [i915#4098])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_psr@sprite_plane_onoff:
    - fi-rkl-11600:       NOTRUN -> [SKIP][12] ([i915#1072]) +3 similar issues
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@kms_psr@sprite_plane_onoff.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - fi-rkl-11600:       NOTRUN -> [SKIP][13] ([i915#3555] / [i915#4098])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@kms_setmode@basic-clone-single-crtc.html

  * igt@prime_vgem@basic-fence-flip:
    - fi-bsw-kefka:       NOTRUN -> [SKIP][14] ([fdo#109271]) +25 similar issues
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-bsw-kefka/igt@prime_vgem@basic-fence-flip.html

  * igt@prime_vgem@basic-read:
    - fi-rkl-11600:       NOTRUN -> [SKIP][15] ([fdo#109295] / [i915#3291] / [i915#3708]) +2 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@prime_vgem@basic-read.html

  * igt@prime_vgem@basic-userptr:
    - fi-rkl-11600:       NOTRUN -> [SKIP][16] ([fdo#109295] / [i915#3301] / [i915#3708])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-rkl-11600/igt@prime_vgem@basic-userptr.html

  
#### Possible fixes ####

  * igt@fbdev@write:
    - fi-blb-e6850:       [SKIP][17] ([fdo#109271]) -> [PASS][18] +4 similar issues
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/fi-blb-e6850/igt@fbdev@write.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-blb-e6850/igt@fbdev@write.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions-varying-size:
    - fi-bsw-n3050:       [FAIL][19] ([i915#6298]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/fi-bsw-n3050/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions-varying-size.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/fi-bsw-n3050/igt@kms_cursor_legacy@basic-busy-flip-before-cursor@atomic-transitions-varying-size.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4817]: https://gitlab.freedesktop.org/drm/intel/issues/4817
  [i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
  [i915#5334]: https://gitlab.freedesktop.org/drm/intel/issues/5334
  [i915#6298]: https://gitlab.freedesktop.org/drm/intel/issues/6298
  [i915#7456]: https://gitlab.freedesktop.org/drm/intel/issues/7456
  [i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561
  [i915#7828]: https://gitlab.freedesktop.org/drm/intel/issues/7828


Build changes
-------------

  * Linux: CI_DRM_12608 -> Patchwork_113078v1

  CI-20190529: 20190529
  CI_DRM_12608: 9bc304ac2c7586f2679ad346d4627db64328dc9b @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7124: 6090d0a62e9c087429320c6fd3b6735ff68fde71 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_113078v1: 9bc304ac2c7586f2679ad346d4627db64328dc9b @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

62bd65201cb7 drm/i915/guc: Add a debug print on GuC triggered reset
5efe90cb242d drm/i915/guc: Look for a guilty context when an engine reset fails
91d6c0010c3d drm/i915: Allow error capture of a pending request
b0edb20cb9c1 drm/i915: Allow error capture without a request
ecef793b43fc drm/i915: Fix up locking around dumping requests lists
d16b8160c701 drm/i915: Fix request locking during error capture & debugfs dump

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/index.html

[-- Attachment #2: Type: text/html, Size: 8993 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
@ 2023-01-19 15:16     ` Andy Shevchenko
  -1 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-19 15:16 UTC (permalink / raw)
  To: John.C.Harrison
  Cc: Matthew Brost, Tvrtko Ursulin, Bruce Chang, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	Intel-GFX, Lucas De Marchi, Daniele Ceraolo Spurio, DRI-Devel,
	Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay, Matthew Auld

On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count but within the
> spinlock not outside it.
> 
> The only other caller of the context based search is the code for
> dumping engine state to debugfs. That code wasn't previously getting
> an explicit reference at all as it does everything while holding the
> execlist specific spinlock. So, that needs updaing as well as that
> spinlock doesn't help when using GuC submission. Rather than trying to
> conditionally get/put depending on submission model, just change it to
> always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up
> too. While at it, add some extra whitespace padding for readability.

...

> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>  		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>  			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>  				continue;
>  
> +			found = true;
> +			break;
> +		}

This can be combined to (see also below)

		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
				break;
		}

> +		spin_unlock(&ce->guc_state.lock);

Instead of 'found' you can check the current entry pointer

		if (!list_entry_is_head(...))

And because requests can only be messed up with the guc_state itself, I think
you don't need to perform the above check under spinlock, so it's safe.

> +		if (found) {
>  			intel_engine_set_hung_context(engine, ce);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-19 15:16     ` Andy Shevchenko
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-19 15:16 UTC (permalink / raw)
  To: John.C.Harrison
  Cc: Michael Cheng, Alan Previn, Intel-GFX, Lucas De Marchi,
	DRI-Devel, Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay,
	Matthew Auld

On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
> 
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count but within the
> spinlock not outside it.
> 
> The only other caller of the context based search is the code for
> dumping engine state to debugfs. That code wasn't previously getting
> an explicit reference at all as it does everything while holding the
> execlist specific spinlock. So, that needs updaing as well as that
> spinlock doesn't help when using GuC submission. Rather than trying to
> conditionally get/put depending on submission model, just change it to
> always do the get/put.
> 
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up
> too. While at it, add some extra whitespace padding for readability.

...

> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>  		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>  			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>  				continue;
>  
> +			found = true;
> +			break;
> +		}

This can be combined to (see also below)

		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
				break;
		}

> +		spin_unlock(&ce->guc_state.lock);

Instead of 'found' you can check the current entry pointer

		if (!list_entry_is_head(...))

And because requests can only be messed up with the guc_state itself, I think
you don't need to perform the above check under spinlock, so it's safe.

> +		if (found) {
>  			intel_engine_set_hung_context(engine, ce);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
@ 2023-01-20  1:12     ` Ceraolo Spurio, Daniele
  -1 siblings, 0 replies; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  1:12 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX
  Cc: Matthew Brost, Tvrtko Ursulin, Andy Shevchenko, Michael Cheng,
	Aravind Iddamsetty, Alan Previn, Umesh Nerlige Ramappa,
	Lucas De Marchi, Bruce Chang, DRI-Devel, Andrzej Hajda,
	Rodrigo Vivi, Tejas Upadhyay, Matthew Auld



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
>
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count but within the
> spinlock not outside it.
>
> The only other caller of the context based search is the code for
> dumping engine state to debugfs. That code wasn't previously getting
> an explicit reference at all as it does everything while holding the
> execlist specific spinlock. So, that needs updaing as well as that
> spinlock doesn't help when using GuC submission. Rather than trying to
> conditionally get/put depending on submission model, just change it to
> always do the get/put.
>
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up
> too. While at it, add some extra whitespace padding for readability.
>
> v2: Explicitly document adding an extra blank line in some dense code
> (Andy Shevchenko). Fix multiple potential null pointer derefs in case
> of no request found (some spotted by Tvrtko, but there was more!).
> Also fix a leaked request in case of !started and another in
> __guc_reset_context now that intel_context_find_active_request is
> actually reference counting the returned request.
>
> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c           |  2 ++
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  8 +++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 +++++++++++++
>   drivers/gpu/drm/i915/i915_gpu_error.c             | 12 ++++++------
>   4 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e94365b08f1ef..e7c5509c48ef1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)

Might be worth renaming this to intel_context_active_request_get(..), to 
make it clear that it does take a reference. Not a blocker.

>   
>   		active = rq;
>   	}
> +	if (active)
> +		active = i915_request_get_rcu(active);
>   	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 922f1bb22dc68..6a082658d0082 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2236,10 +2236,14 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>   	if (guc) {
>   		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> +		if (ce) {
> +			/* This will reference count the request (if found) */
>   			hung_rq = intel_context_find_active_request(ce);
> +		}
>   	} else {
>   		hung_rq = intel_engine_execlist_find_hung_request(engine);
> +		if (hung_rq)
> +			hung_rq = i915_request_get_rcu(hung_rq);
>   	}

Not a fan of the behavior asymmetry, but given that the 2 flows are 
completely different I can't think of anything cleaner, so:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   
>   	if (hung_rq)
> @@ -2250,6 +2254,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	else
>   		intel_engine_dump_active_requests(&engine->sched_engine->requests,
>   						  hung_rq, m);
> +	if (hung_rq)
> +		i915_request_put(hung_rq);
>   }
>   
>   void intel_engine_dump(struct intel_engine_cs *engine,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..d123cbd90a919 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1702,6 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   			goto next_context;
>   
>   		guilty = false;
> +		/* NB: This gets a reference to the request */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq) {
>   			head = ce->ring->tail;
> @@ -1715,6 +1716,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   		head = intel_ring_wrap(ce->ring, rq->head);
>   
>   		__i915_request_reset(rq, guilty);
> +		i915_request_put(rq);
>   out_replay:
>   		guc_reset_state(ce, head, guilty);
>   next_context:
> @@ -4820,6 +4822,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> +		bool found;
> +
>   		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> @@ -4836,10 +4840,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> +			found = true;
> +			break;
> +		}
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		if (found) {
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> @@ -4847,6 +4859,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			xa_lock(&guc->context_lookup);
>   			goto done;
>   		}
> +
>   next:
>   		intel_context_put(ce);
>   		xa_lock(&guc->context_lookup);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..7ea36478ee52d 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	ce = intel_engine_get_hung_context(engine);
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
> +		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq || !i915_request_started(rq))
>   			goto no_request_capture;
> @@ -1618,21 +1619,18 @@ capture_engine(struct intel_engine_cs *engine,
>   		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>   			spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   			rq = intel_engine_execlist_find_hung_request(engine);
> +			if (rq)
> +				rq = i915_request_get_rcu(rq);
>   			spin_unlock_irqrestore(&engine->sched_engine->lock,
>   					       flags);
>   		}
>   	}
> -	if (rq)
> -		rq = i915_request_get_rcu(rq);
> -
>   	if (!rq)
>   		goto no_request_capture;
>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture) {
> -		i915_request_put(rq);
> +	if (!capture)
>   		goto no_request_capture;
> -	}
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>   
> @@ -1642,6 +1640,8 @@ capture_engine(struct intel_engine_cs *engine,
>   	return ee;
>   
>   no_request_capture:
> +	if (rq)
> +		i915_request_put(rq);
>   	kfree(ee);
>   	return NULL;
>   }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-20  1:12     ` Ceraolo Spurio, Daniele
  0 siblings, 0 replies; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  1:12 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX
  Cc: Andy Shevchenko, Michael Cheng, Alan Previn, Lucas De Marchi,
	DRI-Devel, Andrzej Hajda, Rodrigo Vivi, Tejas Upadhyay,
	Matthew Auld



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> When GuC support was added to error capture, the locking around the
> request object was broken. Fix it up.
>
> The context based search manages the spinlocking around the search
> internally. So it needs to grab the reference count internally as
> well. The execlist only request based search relies on external
> locking, so it needs an external reference count but within the
> spinlock not outside it.
>
> The only other caller of the context based search is the code for
> dumping engine state to debugfs. That code wasn't previously getting
> an explicit reference at all as it does everything while holding the
> execlist specific spinlock. So, that needs updaing as well as that
> spinlock doesn't help when using GuC submission. Rather than trying to
> conditionally get/put depending on submission model, just change it to
> always do the get/put.
>
> In addition, intel_guc_find_hung_context() was not acquiring the
> correct spinlock before searching the request list. So fix that up
> too. While at it, add some extra whitespace padding for readability.
>
> v2: Explicitly document adding an extra blank line in some dense code
> (Andy Shevchenko). Fix multiple potential null pointer derefs in case
> of no request found (some spotted by Tvrtko, but there was more!).
> Also fix a leaked request in case of !started and another in
> __guc_reset_context now that intel_context_find_active_request is
> actually reference counting the returned request.
>
> Fixes: dc0dad365c5e ("drm/i915/guc: Fix for error capture after full GPU reset with GuC")
> Fixes: 573ba126aef3 ("drm/i915/guc: Capture error state on context reset")
> Cc: Matthew Brost <matthew.brost@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Jani Nikula <jani.nikula@linux.intel.com>
> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
> Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Cc: Matthew Auld <matthew.auld@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> Cc: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Cc: Michael Cheng <michael.cheng@intel.com>
> Cc: Lucas De Marchi <lucas.demarchi@intel.com>
> Cc: Tejas Upadhyay <tejaskumarx.surendrakumar.upadhyay@intel.com>
> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: Aravind Iddamsetty <aravind.iddamsetty@intel.com>
> Cc: Alan Previn <alan.previn.teres.alexis@intel.com>
> Cc: Bruce Chang <yu.bruce.chang@intel.com>
> Cc: intel-gfx@lists.freedesktop.org
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c           |  2 ++
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c         |  8 +++++++-
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 13 +++++++++++++
>   drivers/gpu/drm/i915/i915_gpu_error.c             | 12 ++++++------
>   4 files changed, 28 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e94365b08f1ef..e7c5509c48ef1 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -552,6 +552,8 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)

Might be worth renaming this to intel_context_active_request_get(..), to 
make it clear that it does take a reference. Not a blocker.

>   
>   		active = rq;
>   	}
> +	if (active)
> +		active = i915_request_get_rcu(active);
>   	spin_unlock_irqrestore(&parent->guc_state.lock, flags);
>   
>   	return active;
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 922f1bb22dc68..6a082658d0082 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2236,10 +2236,14 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
>   	if (guc) {
>   		ce = intel_engine_get_hung_context(engine);
> -		if (ce)
> +		if (ce) {
> +			/* This will reference count the request (if found) */
>   			hung_rq = intel_context_find_active_request(ce);
> +		}
>   	} else {
>   		hung_rq = intel_engine_execlist_find_hung_request(engine);
> +		if (hung_rq)
> +			hung_rq = i915_request_get_rcu(hung_rq);
>   	}

Not a fan of the behavior asymmetry, but given that the 2 flows are 
completely different I can't think of anything cleaner, so:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

>   
>   	if (hung_rq)
> @@ -2250,6 +2254,8 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	else
>   		intel_engine_dump_active_requests(&engine->sched_engine->requests,
>   						  hung_rq, m);
> +	if (hung_rq)
> +		i915_request_put(hung_rq);
>   }
>   
>   void intel_engine_dump(struct intel_engine_cs *engine,
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index b436dd7f12e42..d123cbd90a919 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1702,6 +1702,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   			goto next_context;
>   
>   		guilty = false;
> +		/* NB: This gets a reference to the request */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq) {
>   			head = ce->ring->tail;
> @@ -1715,6 +1716,7 @@ static void __guc_reset_context(struct intel_context *ce, intel_engine_mask_t st
>   		head = intel_ring_wrap(ce->ring, rq->head);
>   
>   		__i915_request_reset(rq, guilty);
> +		i915_request_put(rq);
>   out_replay:
>   		guc_reset_state(ce, head, guilty);
>   next_context:
> @@ -4820,6 +4822,8 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   
>   	xa_lock_irqsave(&guc->context_lookup, flags);
>   	xa_for_each(&guc->context_lookup, index, ce) {
> +		bool found;
> +
>   		if (!kref_get_unless_zero(&ce->ref))
>   			continue;
>   
> @@ -4836,10 +4840,18 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   				goto next;
>   		}
>   
> +		found = false;
> +		spin_lock(&ce->guc_state.lock);
>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>   				continue;
>   
> +			found = true;
> +			break;
> +		}
> +		spin_unlock(&ce->guc_state.lock);
> +
> +		if (found) {
>   			intel_engine_set_hung_context(engine, ce);
>   
>   			/* Can only cope with one hang at a time... */
> @@ -4847,6 +4859,7 @@ void intel_guc_find_hung_context(struct intel_engine_cs *engine)
>   			xa_lock(&guc->context_lookup);
>   			goto done;
>   		}
> +
>   next:
>   		intel_context_put(ce);
>   		xa_lock(&guc->context_lookup);
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 9d5d5a397b64e..7ea36478ee52d 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1607,6 +1607,7 @@ capture_engine(struct intel_engine_cs *engine,
>   	ce = intel_engine_get_hung_context(engine);
>   	if (ce) {
>   		intel_engine_clear_hung_context(engine);
> +		/* This will reference count the request (if found) */
>   		rq = intel_context_find_active_request(ce);
>   		if (!rq || !i915_request_started(rq))
>   			goto no_request_capture;
> @@ -1618,21 +1619,18 @@ capture_engine(struct intel_engine_cs *engine,
>   		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
>   			spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   			rq = intel_engine_execlist_find_hung_request(engine);
> +			if (rq)
> +				rq = i915_request_get_rcu(rq);
>   			spin_unlock_irqrestore(&engine->sched_engine->lock,
>   					       flags);
>   		}
>   	}
> -	if (rq)
> -		rq = i915_request_get_rcu(rq);
> -
>   	if (!rq)
>   		goto no_request_capture;
>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture) {
> -		i915_request_put(rq);
> +	if (!capture)
>   		goto no_request_capture;
> -	}
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>   
> @@ -1642,6 +1640,8 @@ capture_engine(struct intel_engine_cs *engine,
>   	return ee;
>   
>   no_request_capture:
> +	if (rq)
> +		i915_request_put(rq);
>   	kfree(ee);
>   	return NULL;
>   }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 2/6] drm/i915: Fix up locking around dumping requests lists
  2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-20  1:41   ` Ceraolo Spurio, Daniele
  -1 siblings, 0 replies; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  1:41 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> The debugfs dump of requests was confused about what state requires
> the execlist lock versus the GuC lock. There was also a bunch of
> duplicated messy code between it and the error capture code.
>
> So refactor the hung request search into a re-usable function. And
> reduce the span of the execlist state lock to only the execlist
> specific code paths.
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_context.c   | 29 +++++++++++++
>   drivers/gpu/drm/i915/gt/intel_context.h   |  3 ++
>   drivers/gpu/drm/i915/gt/intel_engine_cs.c | 51 +++++++++++------------
>   drivers/gpu/drm/i915/i915_gpu_error.c     | 27 ++----------
>   4 files changed, 60 insertions(+), 50 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.c b/drivers/gpu/drm/i915/gt/intel_context.c
> index e7c5509c48ef1..a61f052092ed9 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.c
> +++ b/drivers/gpu/drm/i915/gt/intel_context.c
> @@ -559,6 +559,35 @@ struct i915_request *intel_context_find_active_request(struct intel_context *ce)
>   	return active;
>   }
>   
> +void intel_get_hung_entity(struct intel_engine_cs *engine,
> +			   struct intel_context **ce, struct i915_request **rq)
IMO this belongs in the engine_cs.c file, given that the engine is the 
input. Might also be worth renaming to intel_engine_*

> +{
> +	unsigned long flags;
> +
> +	*ce = intel_engine_get_hung_context(engine);
> +	if (*ce) {
> +		intel_engine_clear_hung_context(engine);
> +
> +		/* This will reference count the request (if found) */
> +		*rq = intel_context_find_active_request(*ce);
> +
> +		return;
> +	}
> +
> +	/*
> +	 * Getting here with GuC enabled means it is a forced error capture
> +	 * with no actual hang. So, no need to attempt the execlist search.
> +	 */
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
> +		return;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +	*rq = intel_engine_execlist_find_hung_request(engine);
> +	if (*rq)
> +		*rq = i915_request_get_rcu(*rq);
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
>   void intel_context_bind_parent_child(struct intel_context *parent,
>   				     struct intel_context *child)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_context.h b/drivers/gpu/drm/i915/gt/intel_context.h
> index fb62b7b8cbcda..ca50f3312a941 100644
> --- a/drivers/gpu/drm/i915/gt/intel_context.h
> +++ b/drivers/gpu/drm/i915/gt/intel_context.h
> @@ -271,6 +271,9 @@ struct i915_request *intel_context_create_request(struct intel_context *ce);
>   struct i915_request *
>   intel_context_find_active_request(struct intel_context *ce);
>   
> +void intel_get_hung_entity(struct intel_engine_cs *engine,
> +			   struct intel_context **ce, struct i915_request **rq);
> +
>   static inline bool intel_context_is_barrier(const struct intel_context *ce)
>   {
>   	return test_bit(CONTEXT_BARRIER_BIT, &ce->flags);
> diff --git a/drivers/gpu/drm/i915/gt/intel_engine_cs.c b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> index 6a082658d0082..5e173dfc8849e 100644
> --- a/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> +++ b/drivers/gpu/drm/i915/gt/intel_engine_cs.c
> @@ -2216,11 +2216,27 @@ void intel_engine_dump_active_requests(struct list_head *requests,
>   	}
>   }
>   
> -static void engine_dump_active_requests(struct intel_engine_cs *engine, struct drm_printer *m)
> +static void execlist_dump_active_requests(struct intel_engine_cs *engine,

Might be worth moving this to the execlists submission file, to be 
symmetrical with the GuC submission one. Other execlists functions are 
in this file though, so not a blocker.

> +					  struct i915_request *hung_rq,
> +					  struct drm_printer *m)
>   {
> +	unsigned long flags;
> +
> +	spin_lock_irqsave(&engine->sched_engine->lock, flags);
> +
> +	intel_engine_dump_active_requests(&engine->sched_engine->requests, hung_rq, m);
> +
> +	drm_printf(m, "\tOn hold?: %lu\n",
> +		   list_count(&engine->sched_engine->hold));
> +
> +	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> +}
> +
> +static void engine_dump_active_requests(struct intel_engine_cs *engine,
> +					struct drm_printer *m)
> +{
> +	struct intel_context *hung_ce = NULL;
>   	struct i915_request *hung_rq = NULL;
> -	struct intel_context *ce;
> -	bool guc;
>   
>   	/*
>   	 * No need for an engine->irq_seqno_barrier() before the seqno reads.
> @@ -2229,31 +2245,20 @@ static void engine_dump_active_requests(struct intel_engine_cs *engine, struct d
>   	 * But the intention here is just to report an instantaneous snapshot
>   	 * so that's fine.
>   	 */
> -	lockdep_assert_held(&engine->sched_engine->lock);
> +	intel_get_hung_entity(engine, &hung_ce, &hung_rq);
>   
>   	drm_printf(m, "\tRequests:\n");
>   
> -	guc = intel_uc_uses_guc_submission(&engine->gt->uc);
> -	if (guc) {
> -		ce = intel_engine_get_hung_context(engine);
> -		if (ce) {
> -			/* This will reference count the request (if found) */
> -			hung_rq = intel_context_find_active_request(ce);
> -		}
> -	} else {
> -		hung_rq = intel_engine_execlist_find_hung_request(engine);
> -		if (hung_rq)
> -			hung_rq = i915_request_get_rcu(hung_rq);
> -	}
> -
>   	if (hung_rq)
>   		engine_dump_request(hung_rq, m, "\t\thung");
> +	else if (hung_ce)
> +		drm_printf(m, "\t\tGot hung ce but no hung rq!\n");
>   
> -	if (guc)
> +	if (intel_uc_uses_guc_submission(&engine->gt->uc))
>   		intel_guc_dump_active_requests(engine, hung_rq, m);
>   	else
> -		intel_engine_dump_active_requests(&engine->sched_engine->requests,
> -						  hung_rq, m);
> +		execlist_dump_active_requests(engine, hung_rq, m);
> +
>   	if (hung_rq)
>   		i915_request_put(hung_rq);
>   }
> @@ -2265,7 +2270,6 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   	struct i915_gpu_error * const error = &engine->i915->gpu_error;
>   	struct i915_request *rq;
>   	intel_wakeref_t wakeref;
> -	unsigned long flags;
>   	ktime_t dummy;
>   
>   	if (header) {
> @@ -2302,13 +2306,8 @@ void intel_engine_dump(struct intel_engine_cs *engine,
>   		   i915_reset_count(error));
>   	print_properties(engine, m);
>   
> -	spin_lock_irqsave(&engine->sched_engine->lock, flags);
>   	engine_dump_active_requests(engine, m);
>   
> -	drm_printf(m, "\tOn hold?: %lu\n",
> -		   list_count(&engine->sched_engine->hold));

This print moves from the global function to the execlists submission 
one. AFAICS this is only used in execlists mode so not an issue, but a 
note in the commit message would've been nice.

> -	spin_unlock_irqrestore(&engine->sched_engine->lock, flags);
> -
>   	drm_printf(m, "\tMMIO base:  0x%08x\n", engine->mmio_base);
>   	wakeref = intel_runtime_pm_get_if_in_use(engine->uncore->rpm);
>   	if (wakeref) {
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 7ea36478ee52d..78cf95e4dd230 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1596,36 +1596,15 @@ capture_engine(struct intel_engine_cs *engine,
>   {
>   	struct intel_engine_capture_vma *capture = NULL;
>   	struct intel_engine_coredump *ee;
> -	struct intel_context *ce;
> +	struct intel_context *ce = NULL;
>   	struct i915_request *rq = NULL;
> -	unsigned long flags;
>   
>   	ee = intel_engine_coredump_alloc(engine, ALLOW_FAIL, dump_flags);
>   	if (!ee)
>   		return NULL;
>   
> -	ce = intel_engine_get_hung_context(engine);
> -	if (ce) {
> -		intel_engine_clear_hung_context(engine);
> -		/* This will reference count the request (if found) */
> -		rq = intel_context_find_active_request(ce);
> -		if (!rq || !i915_request_started(rq))
> -			goto no_request_capture;
> -	} else {
> -		/*
> -		 * Getting here with GuC enabled means it is a forced error capture
> -		 * with no actual hang. So, no need to attempt the execlist search.
> -		 */
> -		if (!intel_uc_uses_guc_submission(&engine->gt->uc)) {
> -			spin_lock_irqsave(&engine->sched_engine->lock, flags);
> -			rq = intel_engine_execlist_find_hung_request(engine);
> -			if (rq)
> -				rq = i915_request_get_rcu(rq);
> -			spin_unlock_irqrestore(&engine->sched_engine->lock,
> -					       flags);
> -		}
> -	}
> -	if (!rq)
> +	intel_get_hung_entity(engine, &ce, &rq);
> +	if (!rq || !i915_request_started(rq))
>   		goto no_request_capture;

This has been a pain to review, but AFAICS all the locking is correct, 
so with the get_hung function moved to the intel_engine file this is:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

>   
>   	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 3/6] drm/i915: Allow error capture without a request
  2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2023-01-20  1:54   ` Ceraolo Spurio, Daniele
  2023-01-20 17:36     ` John Harrison
  -1 siblings, 1 reply; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  1:54 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> There was a report of error captures occurring without any hung
> context being indicated despite the capture being initiated by a 'hung
> context notification' from GuC. The problem was not reproducible.
> However, it is possible to happen if the context in question has no
> active requests. For example, if the hang was in the context switch
> itself then the breadcrumb write would have occurred and the KMD would
> see an idle context.
>
> In the interests of attempting to provide as much information as
> possible about a hang, it seems wise to include the engine info
> regardless of whether a request was found or not. As opposed to just
> prentending there was no hang at all.
>
> So update the error capture code to always record engine information
> if a context is given. Which means updating record_context() to take a
> context instead of a request (which it only ever used to find the
> context anyway). And split the request agnostic parts of
> intel_engine_coredump_add_request() out into a seaprate function.
>
> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
> pointer.
> v3: Tidy up request locking code flow (Tvrtko)
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   drivers/gpu/drm/i915/i915_gpu_error.c | 70 ++++++++++++++++++---------
>   1 file changed, 48 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
> index 78cf95e4dd230..743614fff5472 100644
> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
>   }
>   
>   static bool record_context(struct i915_gem_context_coredump *e,
> -			   const struct i915_request *rq)
> +			   struct intel_context *ce)
>   {
>   	struct i915_gem_context *ctx;
>   	struct task_struct *task;
>   	bool simulated;
>   
>   	rcu_read_lock();
> -	ctx = rcu_dereference(rq->context->gem_context);
> +	ctx = rcu_dereference(ce->gem_context);
>   	if (ctx && !kref_get_unless_zero(&ctx->ref))
>   		ctx = NULL;
>   	rcu_read_unlock();
> @@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
>   	e->guilty = atomic_read(&ctx->guilty_count);
>   	e->active = atomic_read(&ctx->active_count);
>   
> -	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
> -	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
> +	e->total_runtime = intel_context_get_total_runtime_ns(ce);
> +	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>   
>   	simulated = i915_gem_context_no_error_capture(ctx);
>   
> @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
>   	return ee;
>   }
>   
> +static struct intel_engine_capture_vma *
> +engine_coredump_add_context(struct intel_engine_coredump *ee,
> +			    struct intel_context *ce,
> +			    gfp_t gfp)
> +{
> +	struct intel_engine_capture_vma *vma = NULL;
> +
> +	ee->simulated |= record_context(&ee->context, ce);
> +	if (ee->simulated)
> +		return NULL;
> +
> +	/*
> +	 * We need to copy these to an anonymous buffer
> +	 * as the simplest method to avoid being overwritten
> +	 * by userspace.
> +	 */
> +	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
> +	vma = capture_vma(vma, ce->state, "HW context", gfp);
> +
> +	return vma;
> +}
> +
>   struct intel_engine_capture_vma *
>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   				  struct i915_request *rq,
>   				  gfp_t gfp)
>   {
> -	struct intel_engine_capture_vma *vma = NULL;
> +	struct intel_engine_capture_vma *vma;
>   
> -	ee->simulated |= record_context(&ee->context, rq);
> -	if (ee->simulated)
> +	vma = engine_coredump_add_context(ee, rq->context, gfp);
> +	if (!vma)
>   		return NULL;
>   
>   	/*
> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>   	 */
>   	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>   	vma = capture_user(vma, rq, gfp);
> -	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
> -	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>   
>   	ee->rq_head = rq->head;
>   	ee->rq_post = rq->postfix;
> @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
>   		return NULL;
>   
>   	intel_get_hung_entity(engine, &ce, &rq);
> -	if (!rq || !i915_request_started(rq))
> -		goto no_request_capture;
> +	if (rq && !i915_request_started(rq)) {
> +		drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
> +			 engine->name);

Shouldn't this print be inside the "else if" case below? otherwise if we 
don't have a rq at all we won't see it.

> +		i915_request_put(rq);
> +		rq = NULL;
> +	}
> +
> +	if (rq) {
> +		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> +		i915_request_put(rq);
> +	} else if (ce) {
> +		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
> +	}
>   
> -	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> -	if (!capture)
> -		goto no_request_capture;
>   	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>   		intel_guc_capture_get_matching_node(engine->gt, ee, ce);

Are you keeping this outside the "if (capture)" below to make sure we 
consume the GuC engine capture even if we fail to produce the error 
state? if so, a comment might be useful.

Daniele

>   
> -	intel_engine_coredump_add_vma(ee, capture, compress);
> -	i915_request_put(rq);
> +	if (capture) {
> +		intel_engine_coredump_add_vma(ee, capture, compress);
> +	} else {
> +		kfree(ee);
> +		ee = NULL;
> +	}
>   
>   	return ee;
> -
> -no_request_capture:
> -	if (rq)
> -		i915_request_put(rq);
> -	kfree(ee);
> -	return NULL;
>   }
>   
>   static void


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 5/6] drm/i915/guc: Look for a guilty context when an engine reset fails
  2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
@ 2023-01-20  2:06     ` Ceraolo Spurio, Daniele
  -1 siblings, 0 replies; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  2:06 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel, Tvrtko Ursulin



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC intentionally dies on such a situation,
> it is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
>
> v2: Change comment to be less alarming (Tvrtko)
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d123cbd90a919..7c5ea66218443 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4756,11 +4756,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be reliable to do this here because the GuC is
> +		 * toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);


I double checked and AFAICS this behavior applies to all our supported 
GuC versions, so:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
> -				      "GuC failed to reset engine mask=0x%x\n",
> +				      "GuC failed to reset engine mask=0x%x",
>   				      reset_fail_mask);
> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 5/6] drm/i915/guc: Look for a guilty context when an engine reset fails
@ 2023-01-20  2:06     ` Ceraolo Spurio, Daniele
  0 siblings, 0 replies; 30+ messages in thread
From: Ceraolo Spurio, Daniele @ 2023-01-20  2:06 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel



On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
>
> Engine resets are supposed to never fail. But in the case when one
> does (due to unknown reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC intentionally dies on such a situation,
> it is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
>
> v2: Change comment to be less alarming (Tvrtko)
>
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> ---
>   .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++++++--
>   1 file changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index d123cbd90a919..7c5ea66218443 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4756,11 +4756,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be reliable to do this here because the GuC is
> +		 * toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);


I double checked and AFAICS this behavior applies to all our supported 
GuC versions, so:

Reviewed-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>

Daniele

> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
> -				      "GuC failed to reset engine mask=0x%x\n",
> +				      "GuC failed to reset engine mask=0x%x",
>   				      reset_fail_mask);
> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [Intel-gfx] ✓ Fi.CI.IGT: success for Allow error capture without a request & fix locking issues
  2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
                   ` (7 preceding siblings ...)
  (?)
@ 2023-01-20  6:29 ` Patchwork
  -1 siblings, 0 replies; 30+ messages in thread
From: Patchwork @ 2023-01-20  6:29 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 19554 bytes --]

== Series Details ==

Series: Allow error capture without a request & fix locking issues
URL   : https://patchwork.freedesktop.org/series/113078/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_12608_full -> Patchwork_113078v1_full
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/index.html

Participating hosts (12 -> 10)
------------------------------

  Missing    (2): pig-skl-6260u pig-kbl-iris 

Known issues
------------

  Here are the changes found in Patchwork_113078v1_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-glk:          [PASS][1] -> [FAIL][2] ([i915#2842]) +1 similar issue
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-glk5/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-glk3/igt@gem_exec_fair@basic-pace-share@rcs0.html

  
#### Possible fixes ####

  * igt@drm_fdinfo@idle@rcs0:
    - {shard-rkl}:        [FAIL][3] ([i915#7742]) -> [PASS][4]
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@drm_fdinfo@idle@rcs0.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-4/igt@drm_fdinfo@idle@rcs0.html

  * igt@fbdev@pan:
    - {shard-rkl}:        [SKIP][5] ([i915#2582]) -> [PASS][6]
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-4/igt@fbdev@pan.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@fbdev@pan.html

  * igt@feature_discovery@psr1:
    - {shard-rkl}:        [SKIP][7] ([i915#658]) -> [PASS][8]
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@feature_discovery@psr1.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@feature_discovery@psr1.html

  * igt@gem_exec_fair@basic-none@vecs0:
    - shard-glk:          [FAIL][9] ([i915#2842]) -> [PASS][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-glk6/igt@gem_exec_fair@basic-none@vecs0.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-glk8/igt@gem_exec_fair@basic-none@vecs0.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - {shard-rkl}:        [FAIL][11] ([i915#2842]) -> [PASS][12] +2 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@gem_exec_fair@basic-pace@rcs0.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-5/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_exec_reloc@basic-gtt-read:
    - {shard-rkl}:        [SKIP][13] ([i915#3281]) -> [PASS][14] +8 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-4/igt@gem_exec_reloc@basic-gtt-read.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-5/igt@gem_exec_reloc@basic-gtt-read.html

  * igt@gem_pwrite_snooped:
    - {shard-rkl}:        [SKIP][15] ([i915#3282]) -> [PASS][16] +5 similar issues
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@gem_pwrite_snooped.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-5/igt@gem_pwrite_snooped.html

  * igt@gen9_exec_parse@cmd-crossing-page:
    - {shard-rkl}:        [SKIP][17] ([i915#2527]) -> [PASS][18]
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@gen9_exec_parse@cmd-crossing-page.html
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-5/igt@gen9_exec_parse@cmd-crossing-page.html

  * igt@i915_pm_rpm@system-suspend-modeset:
    - {shard-rkl}:        [SKIP][19] ([fdo#109308]) -> [PASS][20]
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@i915_pm_rpm@system-suspend-modeset.html
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@i915_pm_rpm@system-suspend-modeset.html

  * igt@kms_atomic@atomic_plane_damage:
    - {shard-rkl}:        [SKIP][21] ([i915#4098]) -> [PASS][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@kms_atomic@atomic_plane_damage.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@kms_atomic@atomic_plane_damage.html

  * igt@kms_big_fb@x-tiled-16bpp-rotate-0:
    - {shard-tglu}:       [SKIP][23] ([i915#1845] / [i915#7651]) -> [PASS][24] +5 similar issues
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-tglu-6/igt@kms_big_fb@x-tiled-16bpp-rotate-0.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-tglu-7/igt@kms_big_fb@x-tiled-16bpp-rotate-0.html

  * igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-indfb-draw-mmap-gtt:
    - {shard-tglu}:       [SKIP][25] ([i915#1849]) -> [PASS][26] +1 similar issue
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-tglu-6/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-indfb-draw-mmap-gtt.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-tglu-7/igt@kms_frontbuffer_tracking@fbc-1p-primscrn-pri-indfb-draw-mmap-gtt.html

  * igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-mmap-gtt:
    - {shard-rkl}:        [SKIP][27] ([i915#1849] / [i915#4098]) -> [PASS][28] +6 similar issues
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-1/igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-mmap-gtt.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@kms_frontbuffer_tracking@psr-1p-primscrn-pri-shrfb-draw-mmap-gtt.html

  * igt@kms_plane@plane-panning-bottom-right@pipe-a-planes:
    - {shard-tglu}:       [SKIP][29] ([i915#1849] / [i915#3558]) -> [PASS][30] +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-tglu-6/igt@kms_plane@plane-panning-bottom-right@pipe-a-planes.html
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-tglu-7/igt@kms_plane@plane-panning-bottom-right@pipe-a-planes.html

  * igt@kms_psr@primary_mmap_gtt:
    - {shard-rkl}:        [SKIP][31] ([i915#1072]) -> [PASS][32] +2 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-4/igt@kms_psr@primary_mmap_gtt.html
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@kms_psr@primary_mmap_gtt.html

  * igt@kms_universal_plane@universal-plane-pageflip-windowed-pipe-c:
    - {shard-tglu}:       [SKIP][33] ([fdo#109274]) -> [PASS][34]
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-tglu-6/igt@kms_universal_plane@universal-plane-pageflip-windowed-pipe-c.html
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-tglu-7/igt@kms_universal_plane@universal-plane-pageflip-windowed-pipe-c.html

  * igt@kms_vblank@pipe-a-wait-forked-busy-hang:
    - {shard-rkl}:        [SKIP][35] ([i915#1845] / [i915#4098]) -> [PASS][36] +16 similar issues
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-3/igt@kms_vblank@pipe-a-wait-forked-busy-hang.html
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-6/igt@kms_vblank@pipe-a-wait-forked-busy-hang.html

  * igt@kms_vblank@pipe-c-ts-continuation-idle-hang:
    - {shard-tglu}:       [SKIP][37] ([i915#7651]) -> [PASS][38] +4 similar issues
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-tglu-6/igt@kms_vblank@pipe-c-ts-continuation-idle-hang.html
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-tglu-7/igt@kms_vblank@pipe-c-ts-continuation-idle-hang.html

  * igt@perf@gen12-unprivileged-single-ctx-counters:
    - {shard-rkl}:        [SKIP][39] ([fdo#109289]) -> [PASS][40]
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12608/shard-rkl-5/igt@perf@gen12-unprivileged-single-ctx-counters.html
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/shard-rkl-1/igt@perf@gen12-unprivileged-single-ctx-counters.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#103375]: https://bugs.freedesktop.org/show_bug.cgi?id=103375
  [fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
  [fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
  [fdo#109283]: https://bugs.freedesktop.org/show_bug.cgi?id=109283
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
  [fdo#109291]: https://bugs.freedesktop.org/show_bug.cgi?id=109291
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#109303]: https://bugs.freedesktop.org/show_bug.cgi?id=109303
  [fdo#109307]: https://bugs.freedesktop.org/show_bug.cgi?id=109307
  [fdo#109308]: https://bugs.freedesktop.org/show_bug.cgi?id=109308
  [fdo#109309]: https://bugs.freedesktop.org/show_bug.cgi?id=109309
  [fdo#109312]: https://bugs.freedesktop.org/show_bug.cgi?id=109312
  [fdo#109314]: https://bugs.freedesktop.org/show_bug.cgi?id=109314
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#109506]: https://bugs.freedesktop.org/show_bug.cgi?id=109506
  [fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
  [fdo#110542]: https://bugs.freedesktop.org/show_bug.cgi?id=110542
  [fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
  [fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
  [fdo#111614]: https://bugs.freedesktop.org/show_bug.cgi?id=111614
  [fdo#111615]: https://bugs.freedesktop.org/show_bug.cgi?id=111615
  [fdo#111644]: https://bugs.freedesktop.org/show_bug.cgi?id=111644
  [fdo#111656]: https://bugs.freedesktop.org/show_bug.cgi?id=111656
  [fdo#111825]: https://bugs.freedesktop.org/show_bug.cgi?id=111825
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [fdo#112054]: https://bugs.freedesktop.org/show_bug.cgi?id=112054
  [fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#132]: https://gitlab.freedesktop.org/drm/intel/issues/132
  [i915#1397]: https://gitlab.freedesktop.org/drm/intel/issues/1397
  [i915#1755]: https://gitlab.freedesktop.org/drm/intel/issues/1755
  [i915#1769]: https://gitlab.freedesktop.org/drm/intel/issues/1769
  [i915#1825]: https://gitlab.freedesktop.org/drm/intel/issues/1825
  [i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#2433]: https://gitlab.freedesktop.org/drm/intel/issues/2433
  [i915#2434]: https://gitlab.freedesktop.org/drm/intel/issues/2434
  [i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
  [i915#2527]: https://gitlab.freedesktop.org/drm/intel/issues/2527
  [i915#2532]: https://gitlab.freedesktop.org/drm/intel/issues/2532
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
  [i915#2658]: https://gitlab.freedesktop.org/drm/intel/issues/2658
  [i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
  [i915#2681]: https://gitlab.freedesktop.org/drm/intel/issues/2681
  [i915#2705]: https://gitlab.freedesktop.org/drm/intel/issues/2705
  [i915#280]: https://gitlab.freedesktop.org/drm/intel/issues/280
  [i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
  [i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
  [i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
  [i915#2994]: https://gitlab.freedesktop.org/drm/intel/issues/2994
  [i915#3063]: https://gitlab.freedesktop.org/drm/intel/issues/3063
  [i915#3116]: https://gitlab.freedesktop.org/drm/intel/issues/3116
  [i915#315]: https://gitlab.freedesktop.org/drm/intel/issues/315
  [i915#3281]: https://gitlab.freedesktop.org/drm/intel/issues/3281
  [i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
  [i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
  [i915#3297]: https://gitlab.freedesktop.org/drm/intel/issues/3297
  [i915#3299]: https://gitlab.freedesktop.org/drm/intel/issues/3299
  [i915#3318]: https://gitlab.freedesktop.org/drm/intel/issues/3318
  [i915#3323]: https://gitlab.freedesktop.org/drm/intel/issues/3323
  [i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
  [i915#3361]: https://gitlab.freedesktop.org/drm/intel/issues/3361
  [i915#3458]: https://gitlab.freedesktop.org/drm/intel/issues/3458
  [i915#3469]: https://gitlab.freedesktop.org/drm/intel/issues/3469
  [i915#3528]: https://gitlab.freedesktop.org/drm/intel/issues/3528
  [i915#3536]: https://gitlab.freedesktop.org/drm/intel/issues/3536
  [i915#3539]: https://gitlab.freedesktop.org/drm/intel/issues/3539
  [i915#3546]: https://gitlab.freedesktop.org/drm/intel/issues/3546
  [i915#3547]: https://gitlab.freedesktop.org/drm/intel/issues/3547
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#3558]: https://gitlab.freedesktop.org/drm/intel/issues/3558
  [i915#3591]: https://gitlab.freedesktop.org/drm/intel/issues/3591
  [i915#3637]: https://gitlab.freedesktop.org/drm/intel/issues/3637
  [i915#3638]: https://gitlab.freedesktop.org/drm/intel/issues/3638
  [i915#3689]: https://gitlab.freedesktop.org/drm/intel/issues/3689
  [i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
  [i915#3734]: https://gitlab.freedesktop.org/drm/intel/issues/3734
  [i915#3742]: https://gitlab.freedesktop.org/drm/intel/issues/3742
  [i915#3804]: https://gitlab.freedesktop.org/drm/intel/issues/3804
  [i915#3826]: https://gitlab.freedesktop.org/drm/intel/issues/3826
  [i915#3840]: https://gitlab.freedesktop.org/drm/intel/issues/3840
  [i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
  [i915#3952]: https://gitlab.freedesktop.org/drm/intel/issues/3952
  [i915#3989]: https://gitlab.freedesktop.org/drm/intel/issues/3989
  [i915#4070]: https://gitlab.freedesktop.org/drm/intel/issues/4070
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4078]: https://gitlab.freedesktop.org/drm/intel/issues/4078
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4212]: https://gitlab.freedesktop.org/drm/intel/issues/4212
  [i915#4213]: https://gitlab.freedesktop.org/drm/intel/issues/4213
  [i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
  [i915#4349]: https://gitlab.freedesktop.org/drm/intel/issues/4349
  [i915#4538]: https://gitlab.freedesktop.org/drm/intel/issues/4538
  [i915#454]: https://gitlab.freedesktop.org/drm/intel/issues/454
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4767]: https://gitlab.freedesktop.org/drm/intel/issues/4767
  [i915#4771]: https://gitlab.freedesktop.org/drm/intel/issues/4771
  [i915#4818]: https://gitlab.freedesktop.org/drm/intel/issues/4818
  [i915#4833]: https://gitlab.freedesktop.org/drm/intel/issues/4833
  [i915#4852]: https://gitlab.freedesktop.org/drm/intel/issues/4852
  [i915#4859]: https://gitlab.freedesktop.org/drm/intel/issues/4859
  [i915#4860]: https://gitlab.freedesktop.org/drm/intel/issues/4860
  [i915#4880]: https://gitlab.freedesktop.org/drm/intel/issues/4880
  [i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
  [i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
  [i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
  [i915#5288]: https://gitlab.freedesktop.org/drm/intel/issues/5288
  [i915#5289]: https://gitlab.freedesktop.org/drm/intel/issues/5289
  [i915#5325]: https://gitlab.freedesktop.org/drm/intel/issues/5325
  [i915#5327]: https://gitlab.freedesktop.org/drm/intel/issues/5327
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#5439]: https://gitlab.freedesktop.org/drm/intel/issues/5439
  [i915#5461]: https://gitlab.freedesktop.org/drm/intel/issues/5461
  [i915#5563]: https://gitlab.freedesktop.org/drm/intel/issues/5563
  [i915#5784]: https://gitlab.freedesktop.org/drm/intel/issues/5784
  [i915#6095]: https://gitlab.freedesktop.org/drm/intel/issues/6095
  [i915#6117]: https://gitlab.freedesktop.org/drm/intel/issues/6117
  [i915#6227]: https://gitlab.freedesktop.org/drm/intel/issues/6227
  [i915#6230]: https://gitlab.freedesktop.org/drm/intel/issues/6230
  [i915#6245]: https://gitlab.freedesktop.org/drm/intel/issues/6245
  [i915#6247]: https://gitlab.freedesktop.org/drm/intel/issues/6247
  [i915#6248]: https://gitlab.freedesktop.org/drm/intel/issues/6248
  [i915#6268]: https://gitlab.freedesktop.org/drm/intel/issues/6268
  [i915#6301]: https://gitlab.freedesktop.org/drm/intel/issues/6301
  [i915#6335]: https://gitlab.freedesktop.org/drm/intel/issues/6335
  [i915#6344]: https://gitlab.freedesktop.org/drm/intel/issues/6344
  [i915#6355]: https://gitlab.freedesktop.org/drm/intel/issues/6355
  [i915#6412]: https://gitlab.freedesktop.org/drm/intel/issues/6412
  [i915#6433]: https://gitlab.freedesktop.org/drm/intel/issues/6433
  [i915#6497]: https://gitlab.freedesktop.org/drm/intel/issues/6497
  [i915#6524]: https://gitlab.freedesktop.org/drm/intel/issues/6524
  [i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
  [i915#6590]: https://gitlab.freedesktop.org/drm/intel/issues/6590
  [i915#6768]: https://gitlab.freedesktop.org/drm/intel/issues/6768
  [i915#6944]: https://gitlab.freedesktop.org/drm/intel/issues/6944
  [i915#6953]: https://gitlab.freedesktop.org/drm/intel/issues/6953
  [i915#7116]: https://gitlab.freedesktop.org/drm/intel/issues/7116
  [i915#7118]: https://gitlab.freedesktop.org/drm/intel/issues/7118
  [i915#7128]: https://gitlab.freedesktop.org/drm/intel/issues/7128
  [i915#7294]: https://gitlab.freedesktop.org/drm/intel/issues/7294
  [i915#7456]: https://gitlab.freedesktop.org/drm/intel/issues/7456
  [i915#7561]: https://gitlab.freedesktop.org/drm/intel/issues/7561
  [i915#7582]: https://gitlab.freedesktop.org/drm/intel/issues/7582
  [i915#7651]: https://gitlab.freedesktop.org/drm/intel/issues/7651
  [i915#7697]: https://gitlab.freedesktop.org/drm/intel/issues/7697
  [i915#7701]: https://gitlab.freedesktop.org/drm/intel/issues/7701
  [i915#7707]: https://gitlab.freedesktop.org/drm/intel/issues/7707
  [i915#7711]: https://gitlab.freedesktop.org/drm/intel/issues/7711
  [i915#7742]: https://gitlab.freedesktop.org/drm/intel/issues/7742
  [i915#7828]: https://gitlab.freedesktop.org/drm/intel/issues/7828


Build changes
-------------

  * Linux: CI_DRM_12608 -> Patchwork_113078v1
  * Piglit: piglit_4509 -> None

  CI-20190529: 20190529
  CI_DRM_12608: 9bc304ac2c7586f2679ad346d4627db64328dc9b @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7124: 6090d0a62e9c087429320c6fd3b6735ff68fde71 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_113078v1: 9bc304ac2c7586f2679ad346d4627db64328dc9b @ git://anongit.freedesktop.org/gfx-ci/linux
  piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_113078v1/index.html

[-- Attachment #2: Type: text/html, Size: 11683 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 3/6] drm/i915: Allow error capture without a request
  2023-01-20  1:54   ` Ceraolo Spurio, Daniele
@ 2023-01-20 17:36     ` John Harrison
  2023-01-20 17:55       ` John Harrison
  0 siblings, 1 reply; 30+ messages in thread
From: John Harrison @ 2023-01-20 17:36 UTC (permalink / raw)
  To: Ceraolo Spurio, Daniele, Intel-GFX; +Cc: DRI-Devel

On 1/19/2023 17:54, Ceraolo Spurio, Daniele wrote:
> On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> There was a report of error captures occurring without any hung
>> context being indicated despite the capture being initiated by a 'hung
>> context notification' from GuC. The problem was not reproducible.
>> However, it is possible to happen if the context in question has no
>> active requests. For example, if the hang was in the context switch
>> itself then the breadcrumb write would have occurred and the KMD would
>> see an idle context.
>>
>> In the interests of attempting to provide as much information as
>> possible about a hang, it seems wise to include the engine info
>> regardless of whether a request was found or not. As opposed to just
>> prentending there was no hang at all.
>>
>> So update the error capture code to always record engine information
>> if a context is given. Which means updating record_context() to take a
>> context instead of a request (which it only ever used to find the
>> context anyway). And split the request agnostic parts of
>> intel_engine_coredump_add_request() out into a seaprate function.
>>
>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
>> pointer.
>> v3: Tidy up request locking code flow (Tvrtko)
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> ---
>>   drivers/gpu/drm/i915/i915_gpu_error.c | 70 ++++++++++++++++++---------
>>   1 file changed, 48 insertions(+), 22 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 78cf95e4dd230..743614fff5472 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>> intel_engine_coredump *ee)
>>   }
>>     static bool record_context(struct i915_gem_context_coredump *e,
>> -               const struct i915_request *rq)
>> +               struct intel_context *ce)
>>   {
>>       struct i915_gem_context *ctx;
>>       struct task_struct *task;
>>       bool simulated;
>>         rcu_read_lock();
>> -    ctx = rcu_dereference(rq->context->gem_context);
>> +    ctx = rcu_dereference(ce->gem_context);
>>       if (ctx && !kref_get_unless_zero(&ctx->ref))
>>           ctx = NULL;
>>       rcu_read_unlock();
>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>> i915_gem_context_coredump *e,
>>       e->guilty = atomic_read(&ctx->guilty_count);
>>       e->active = atomic_read(&ctx->active_count);
>>   -    e->total_runtime = 
>> intel_context_get_total_runtime_ns(rq->context);
>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>         simulated = i915_gem_context_no_error_capture(ctx);
>>   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>       return ee;
>>   }
>>   +static struct intel_engine_capture_vma *
>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>> +                struct intel_context *ce,
>> +                gfp_t gfp)
>> +{
>> +    struct intel_engine_capture_vma *vma = NULL;
>> +
>> +    ee->simulated |= record_context(&ee->context, ce);
>> +    if (ee->simulated)
>> +        return NULL;
>> +
>> +    /*
>> +     * We need to copy these to an anonymous buffer
>> +     * as the simplest method to avoid being overwritten
>> +     * by userspace.
>> +     */
>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>> +
>> +    return vma;
>> +}
>> +
>>   struct intel_engine_capture_vma *
>>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>                     struct i915_request *rq,
>>                     gfp_t gfp)
>>   {
>> -    struct intel_engine_capture_vma *vma = NULL;
>> +    struct intel_engine_capture_vma *vma;
>>   -    ee->simulated |= record_context(&ee->context, rq);
>> -    if (ee->simulated)
>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>> +    if (!vma)
>>           return NULL;
>>         /*
>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>> intel_engine_coredump *ee,
>>        */
>>       vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>       vma = capture_user(vma, rq, gfp);
>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>         ee->rq_head = rq->head;
>>       ee->rq_post = rq->postfix;
>> @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
>>           return NULL;
>>         intel_get_hung_entity(engine, &ce, &rq);
>> -    if (!rq || !i915_request_started(rq))
>> -        goto no_request_capture;
>> +    if (rq && !i915_request_started(rq)) {
>> +        drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with no active request!\n",
>> +             engine->name);
>
> Shouldn't this print be inside the "else if" case below? otherwise if 
> we don't have a rq at all we won't see it.
The intention was that the message is specifically for the case where a 
request exists but is being ignored. It is obvious when looking at the 
capture if no request was found - there just isn't one in there. But 
there is no way to distinguish that case from the situation where a 
request was found but just was not considered 'active'. Hence wanting to 
add a dmesg log instead.

Could pull the better message in from the next patch to this one to make 
it more clear - "Got hung context on %s with active request %lld:%lld 
[0x%04X] not yet started".

>
>> +        i915_request_put(rq);
>> +        rq = NULL;
>> +    }
>> +
>> +    if (rq) {
>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> +        i915_request_put(rq);
>> +    } else if (ce) {
>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>> +    }
>>   -    capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>> -    if (!capture)
>> -        goto no_request_capture;
>>       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>           intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>
> Are you keeping this outside the "if (capture)" below to make sure we 
> consume the GuC engine capture even if we fail to produce the error 
> state? if so, a comment might be useful.
Yeah, hadn't really noticed that the original version was effectively 
leaking the GuC capture in the case of a failure. The buffer would 
overflow and discard the oldest entries eventually. But it seems better 
to deliberately free the buffer if the related capture is done whether 
the capture worked or not.

Can add a comment to the patch description about it.

John.

>
> Daniele
>
>>   -    intel_engine_coredump_add_vma(ee, capture, compress);
>> -    i915_request_put(rq);
>> +    if (capture) {
>> +        intel_engine_coredump_add_vma(ee, capture, compress);
>> +    } else {
>> +        kfree(ee);
>> +        ee = NULL;
>> +    }
>>         return ee;
>> -
>> -no_request_capture:
>> -    if (rq)
>> -        i915_request_put(rq);
>> -    kfree(ee);
>> -    return NULL;
>>   }
>>     static void
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 3/6] drm/i915: Allow error capture without a request
  2023-01-20 17:36     ` John Harrison
@ 2023-01-20 17:55       ` John Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-20 17:55 UTC (permalink / raw)
  To: Ceraolo Spurio, Daniele, Intel-GFX; +Cc: DRI-Devel

On 1/20/2023 09:36, John Harrison wrote:
> On 1/19/2023 17:54, Ceraolo Spurio, Daniele wrote:
>> On 1/18/2023 10:49 PM, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> There was a report of error captures occurring without any hung
>>> context being indicated despite the capture being initiated by a 'hung
>>> context notification' from GuC. The problem was not reproducible.
>>> However, it is possible to happen if the context in question has no
>>> active requests. For example, if the hang was in the context switch
>>> itself then the breadcrumb write would have occurred and the KMD would
>>> see an idle context.
>>>
>>> In the interests of attempting to provide as much information as
>>> possible about a hang, it seems wise to include the engine info
>>> regardless of whether a request was found or not. As opposed to just
>>> prentending there was no hang at all.
>>>
>>> So update the error capture code to always record engine information
>>> if a context is given. Which means updating record_context() to take a
>>> context instead of a request (which it only ever used to find the
>>> context anyway). And split the request agnostic parts of
>>> intel_engine_coredump_add_request() out into a seaprate function.
>>>
>>> v2: Remove a duplicate 'if' statement (Umesh) and fix a put of a null
>>> pointer.
>>> v3: Tidy up request locking code flow (Tvrtko)
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>> Acked-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/i915_gpu_error.c | 70 
>>> ++++++++++++++++++---------
>>>   1 file changed, 48 insertions(+), 22 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> index 78cf95e4dd230..743614fff5472 100644
>>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>>> intel_engine_coredump *ee)
>>>   }
>>>     static bool record_context(struct i915_gem_context_coredump *e,
>>> -               const struct i915_request *rq)
>>> +               struct intel_context *ce)
>>>   {
>>>       struct i915_gem_context *ctx;
>>>       struct task_struct *task;
>>>       bool simulated;
>>>         rcu_read_lock();
>>> -    ctx = rcu_dereference(rq->context->gem_context);
>>> +    ctx = rcu_dereference(ce->gem_context);
>>>       if (ctx && !kref_get_unless_zero(&ctx->ref))
>>>           ctx = NULL;
>>>       rcu_read_unlock();
>>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>>> i915_gem_context_coredump *e,
>>>       e->guilty = atomic_read(&ctx->guilty_count);
>>>       e->active = atomic_read(&ctx->active_count);
>>>   -    e->total_runtime = 
>>> intel_context_get_total_runtime_ns(rq->context);
>>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>>         simulated = i915_gem_context_no_error_capture(ctx);
>>>   @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>>       return ee;
>>>   }
>>>   +static struct intel_engine_capture_vma *
>>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>>> +                struct intel_context *ce,
>>> +                gfp_t gfp)
>>> +{
>>> +    struct intel_engine_capture_vma *vma = NULL;
>>> +
>>> +    ee->simulated |= record_context(&ee->context, ce);
>>> +    if (ee->simulated)
>>> +        return NULL;
>>> +
>>> +    /*
>>> +     * We need to copy these to an anonymous buffer
>>> +     * as the simplest method to avoid being overwritten
>>> +     * by userspace.
>>> +     */
>>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>>> +
>>> +    return vma;
>>> +}
>>> +
>>>   struct intel_engine_capture_vma *
>>>   intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>>                     struct i915_request *rq,
>>>                     gfp_t gfp)
>>>   {
>>> -    struct intel_engine_capture_vma *vma = NULL;
>>> +    struct intel_engine_capture_vma *vma;
>>>   -    ee->simulated |= record_context(&ee->context, rq);
>>> -    if (ee->simulated)
>>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>>> +    if (!vma)
>>>           return NULL;
>>>         /*
>>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>>> intel_engine_coredump *ee,
>>>        */
>>>       vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>>       vma = capture_user(vma, rq, gfp);
>>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>>         ee->rq_head = rq->head;
>>>       ee->rq_post = rq->postfix;
>>> @@ -1604,25 +1624,31 @@ capture_engine(struct intel_engine_cs *engine,
>>>           return NULL;
>>>         intel_get_hung_entity(engine, &ce, &rq);
>>> -    if (!rq || !i915_request_started(rq))
>>> -        goto no_request_capture;
>>> +    if (rq && !i915_request_started(rq)) {
>>> +        drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>>> with no active request!\n",
>>> +             engine->name);
>>
>> Shouldn't this print be inside the "else if" case below? otherwise if 
>> we don't have a rq at all we won't see it.
> The intention was that the message is specifically for the case where 
> a request exists but is being ignored. It is obvious when looking at 
> the capture if no request was found - there just isn't one in there. 
> But there is no way to distinguish that case from the situation where 
> a request was found but just was not considered 'active'. Hence 
> wanting to add a dmesg log instead.
>
> Could pull the better message in from the next patch to this one to 
> make it more clear - "Got hung context on %s with active request 
> %lld:%lld [0x%04X] not yet started".
>
>>
>>> +        i915_request_put(rq);
>>> +        rq = NULL;
>>> +    }
>>> +
>>> +    if (rq) {
>>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>>> ATOMIC_MAYFAIL);
>>> +        i915_request_put(rq);
>>> +    } else if (ce) {
>>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>>> +    }
>>>   -    capture = intel_engine_coredump_add_request(ee, rq, 
>>> ATOMIC_MAYFAIL);
>>> -    if (!capture)
>>> -        goto no_request_capture;
>>>       if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>>>           intel_guc_capture_get_matching_node(engine->gt, ee, ce);
>>
>> Are you keeping this outside the "if (capture)" below to make sure we 
>> consume the GuC engine capture even if we fail to produce the error 
>> state? if so, a comment might be useful.
> Yeah, hadn't really noticed that the original version was effectively 
> leaking the GuC capture in the case of a failure. The buffer would 
> overflow and discard the oldest entries eventually. But it seems 
> better to deliberately free the buffer if the related capture is done 
> whether the capture worked or not.
>
> Can add a comment to the patch description about it.
Hmm. Actually, this will now be genuine leaking the capture state if the 
vma stuff fails. I think. It looks like the state object ownership gets 
transferred to the ee but then the ee gets freed below without freeing 
up any owned resources. So need to either not pull out the register 
state if !capture or explicitly free it on failure.

John.


>
> John.
>
>>
>> Daniele
>>
>>>   -    intel_engine_coredump_add_vma(ee, capture, compress);
>>> -    i915_request_put(rq);
>>> +    if (capture) {
>>> +        intel_engine_coredump_add_vma(ee, capture, compress);
>>> +    } else {
>>> +        kfree(ee);
>>> +        ee = NULL;
>>> +    }
>>>         return ee;
>>> -
>>> -no_request_capture:
>>> -    if (rq)
>>> -        i915_request_put(rq);
>>> -    kfree(ee);
>>> -    return NULL;
>>>   }
>>>     static void
>>
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-19 15:16     ` [Intel-gfx] " Andy Shevchenko
@ 2023-01-20 23:06       ` John Harrison
  -1 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-20 23:06 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Tvrtko Ursulin, Intel-GFX, Daniele Ceraolo Spurio, DRI-Devel

On 1/19/2023 07:16, Andy Shevchenko wrote:
> On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count but within the
>> spinlock not outside it.
>>
>> The only other caller of the context based search is the code for
>> dumping engine state to debugfs. That code wasn't previously getting
>> an explicit reference at all as it does everything while holding the
>> execlist specific spinlock. So, that needs updaing as well as that
>> spinlock doesn't help when using GuC submission. Rather than trying to
>> conditionally get/put depending on submission model, just change it to
>> always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up
>> too. While at it, add some extra whitespace padding for readability.
> ...
>
>> +		found = false;
>> +		spin_lock(&ce->guc_state.lock);
>>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>>   				continue;
>>   
>> +			found = true;
>> +			break;
>> +		}
> This can be combined to (see also below)
>
> 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> 			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
> 				break;
> 		}
>
>> +		spin_unlock(&ce->guc_state.lock);
> Instead of 'found' you can check the current entry pointer
>
> 		if (!list_entry_is_head(...))
>
> And because requests can only be messed up with the guc_state itself, I think
> you don't need to perform the above check under spinlock, so it's safe.
I'm not following the argument as to why it is safe to test a guc_state 
owned list outside of holding the guc_state spinlock.

I also think that having an explicit 'found' flag makes the code more 
readable and immediately obvious as to what is going on. For the sake of 
one bool (which the compiler would optimise out anyway), I don't think 
it is worth the obfuscation of behaviour and the risk of "I think this 
will work".

John.


>
>> +		if (found) {
>>   			intel_engine_set_hung_context(engine, ce);


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-20 23:06       ` John Harrison
  0 siblings, 0 replies; 30+ messages in thread
From: John Harrison @ 2023-01-20 23:06 UTC (permalink / raw)
  To: Andy Shevchenko; +Cc: Intel-GFX, DRI-Devel

On 1/19/2023 07:16, Andy Shevchenko wrote:
> On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> When GuC support was added to error capture, the locking around the
>> request object was broken. Fix it up.
>>
>> The context based search manages the spinlocking around the search
>> internally. So it needs to grab the reference count internally as
>> well. The execlist only request based search relies on external
>> locking, so it needs an external reference count but within the
>> spinlock not outside it.
>>
>> The only other caller of the context based search is the code for
>> dumping engine state to debugfs. That code wasn't previously getting
>> an explicit reference at all as it does everything while holding the
>> execlist specific spinlock. So, that needs updaing as well as that
>> spinlock doesn't help when using GuC submission. Rather than trying to
>> conditionally get/put depending on submission model, just change it to
>> always do the get/put.
>>
>> In addition, intel_guc_find_hung_context() was not acquiring the
>> correct spinlock before searching the request list. So fix that up
>> too. While at it, add some extra whitespace padding for readability.
> ...
>
>> +		found = false;
>> +		spin_lock(&ce->guc_state.lock);
>>   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
>>   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
>>   				continue;
>>   
>> +			found = true;
>> +			break;
>> +		}
> This can be combined to (see also below)
>
> 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> 			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
> 				break;
> 		}
>
>> +		spin_unlock(&ce->guc_state.lock);
> Instead of 'found' you can check the current entry pointer
>
> 		if (!list_entry_is_head(...))
>
> And because requests can only be messed up with the guc_state itself, I think
> you don't need to perform the above check under spinlock, so it's safe.
I'm not following the argument as to why it is safe to test a guc_state 
owned list outside of holding the guc_state spinlock.

I also think that having an explicit 'found' flag makes the code more 
readable and immediately obvious as to what is going on. For the sake of 
one bool (which the compiler would optimise out anyway), I don't think 
it is worth the obfuscation of behaviour and the risk of "I think this 
will work".

John.


>
>> +		if (found) {
>>   			intel_engine_set_hung_context(engine, ce);


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
  2023-01-20 23:06       ` [Intel-gfx] " John Harrison
@ 2023-01-21 11:50         ` Andy Shevchenko
  -1 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-21 11:50 UTC (permalink / raw)
  To: John Harrison
  Cc: Tvrtko Ursulin, Intel-GFX, Daniele Ceraolo Spurio, DRI-Devel

On Fri, Jan 20, 2023 at 03:06:02PM -0800, John Harrison wrote:
> On 1/19/2023 07:16, Andy Shevchenko wrote:
> > On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:

...

> > > +		found = false;
> > > +		spin_lock(&ce->guc_state.lock);
> > >   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> > >   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
> > >   				continue;
> > > +			found = true;
> > > +			break;
> > > +		}
> > This can be combined to (see also below)
> > 
> > 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> > 			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
> > 				break;
> > 		}
> > 
> > > +		spin_unlock(&ce->guc_state.lock);
> > Instead of 'found' you can check the current entry pointer
> > 
> > 		if (!list_entry_is_head(...))
> > 
> > And because requests can only be messed up with the guc_state itself, I think
> > you don't need to perform the above check under spinlock, so it's safe.
> I'm not following the argument as to why it is safe to test a guc_state
> owned list outside of holding the guc_state spinlock.

The very same reasons why found is not checked inside the lock.
If something bad to the list head pointer happens, it would mean
that we have much bigger issues. And list_entry_is_head() is specifically
to test the loop exit condition.

> I also think that having an explicit 'found' flag makes the code more
> readable and immediately obvious as to what is going on.

It depends on the perception. With boolean I have to go somewhere to be sure
that found has false when loop is fully revolved. (Sometimes it may be the
inverted loops like

    found = true;
    for (...loop...) {
        if (...cond...) {
	   found = false;
	   break;
        }
    }

while with the helper it's obvious)

> For the sake of one
> bool (which the compiler would optimise out anyway),

Is it really optimized away?

> I don't think it is worth the obfuscation of behaviour and the risk of "I
> think this will work".

Whatever, not big deal :)

> > > +		if (found) {
> > >   			intel_engine_set_hung_context(engine, ce);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [Intel-gfx] [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump
@ 2023-01-21 11:50         ` Andy Shevchenko
  0 siblings, 0 replies; 30+ messages in thread
From: Andy Shevchenko @ 2023-01-21 11:50 UTC (permalink / raw)
  To: John Harrison; +Cc: Intel-GFX, DRI-Devel

On Fri, Jan 20, 2023 at 03:06:02PM -0800, John Harrison wrote:
> On 1/19/2023 07:16, Andy Shevchenko wrote:
> > On Wed, Jan 18, 2023 at 10:49:55PM -0800, John.C.Harrison@Intel.com wrote:

...

> > > +		found = false;
> > > +		spin_lock(&ce->guc_state.lock);
> > >   		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> > >   			if (i915_test_request_state(rq) != I915_REQUEST_ACTIVE)
> > >   				continue;
> > > +			found = true;
> > > +			break;
> > > +		}
> > This can be combined to (see also below)
> > 
> > 		list_for_each_entry(rq, &ce->guc_state.requests, sched.link) {
> > 			if (i915_test_request_state(rq) == I915_REQUEST_ACTIVE)
> > 				break;
> > 		}
> > 
> > > +		spin_unlock(&ce->guc_state.lock);
> > Instead of 'found' you can check the current entry pointer
> > 
> > 		if (!list_entry_is_head(...))
> > 
> > And because requests can only be messed up with the guc_state itself, I think
> > you don't need to perform the above check under spinlock, so it's safe.
> I'm not following the argument as to why it is safe to test a guc_state
> owned list outside of holding the guc_state spinlock.

The very same reasons why found is not checked inside the lock.
If something bad to the list head pointer happens, it would mean
that we have much bigger issues. And list_entry_is_head() is specifically
to test the loop exit condition.

> I also think that having an explicit 'found' flag makes the code more
> readable and immediately obvious as to what is going on.

It depends on the perception. With boolean I have to go somewhere to be sure
that found has false when loop is fully revolved. (Sometimes it may be the
inverted loops like

    found = true;
    for (...loop...) {
        if (...cond...) {
	   found = false;
	   break;
        }
    }

while with the helper it's obvious)

> For the sake of one
> bool (which the compiler would optimise out anyway),

Is it really optimized away?

> I don't think it is worth the obfuscation of behaviour and the risk of "I
> think this will work".

Whatever, not big deal :)

> > > +		if (found) {
> > >   			intel_engine_set_hung_context(engine, ce);

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2023-01-21 11:51 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-19  6:49 [PATCH v3 0/6] Allow error capture without a request & fix locking issues John.C.Harrison
2023-01-19  6:49 ` [Intel-gfx] " John.C.Harrison
2023-01-19  6:49 ` [PATCH v3 1/6] drm/i915: Fix request locking during error capture & debugfs dump John.C.Harrison
2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
2023-01-19 15:16   ` Andy Shevchenko
2023-01-19 15:16     ` [Intel-gfx] " Andy Shevchenko
2023-01-20 23:06     ` John Harrison
2023-01-20 23:06       ` [Intel-gfx] " John Harrison
2023-01-21 11:50       ` Andy Shevchenko
2023-01-21 11:50         ` [Intel-gfx] " Andy Shevchenko
2023-01-20  1:12   ` Ceraolo Spurio, Daniele
2023-01-20  1:12     ` [Intel-gfx] " Ceraolo Spurio, Daniele
2023-01-19  6:49 ` [PATCH v3 2/6] drm/i915: Fix up locking around dumping requests lists John.C.Harrison
2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
2023-01-20  1:41   ` Ceraolo Spurio, Daniele
2023-01-19  6:49 ` [PATCH v3 3/6] drm/i915: Allow error capture without a request John.C.Harrison
2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
2023-01-20  1:54   ` Ceraolo Spurio, Daniele
2023-01-20 17:36     ` John Harrison
2023-01-20 17:55       ` John Harrison
2023-01-19  6:49 ` [PATCH v3 4/6] drm/i915: Allow error capture of a pending request John.C.Harrison
2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
2023-01-19  6:49 ` [PATCH v3 5/6] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2023-01-19  6:49   ` [Intel-gfx] " John.C.Harrison
2023-01-20  2:06   ` Ceraolo Spurio, Daniele
2023-01-20  2:06     ` [Intel-gfx] " Ceraolo Spurio, Daniele
2023-01-19  6:50 ` [PATCH v3 6/6] drm/i915/guc: Add a debug print on GuC triggered reset John.C.Harrison
2023-01-19  6:50   ` [Intel-gfx] " John.C.Harrison
2023-01-19  9:30 ` [Intel-gfx] ✓ Fi.CI.BAT: success for Allow error capture without a request & fix locking issues Patchwork
2023-01-20  6:29 ` [Intel-gfx] ✓ Fi.CI.IGT: " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.