[PATCH 0/2] Allow error capture without a request / on reset failure

dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] Allow error capture without a request / on reset failure
@ 2022-11-29 21:12 John.C.Harrison
  2022-11-29 21:12 ` [PATCH 1/2] drm/i915: Allow error capture without a request John.C.Harrison
  2022-11-29 21:12 ` [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
  0 siblings, 2 replies; 9+ messages in thread
From: John.C.Harrison @ 2022-11-29 21:12 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

It is technically possible to get a hung context without a valid
request. In such a situation, try to provide as much information in
the error capture as possible rather than just aborting and capturing
nothing.

Similarly, in the case of a engine reset failure the GuC is not able
to report the guilty context. So try a manual search instead of
reporting nothing.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (2):
  drm/i915: Allow error capture without a request
  drm/i915/guc: Look for a guilty context when an engine reset fails

 .../gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++-
 drivers/gpu/drm/i915/i915_gpu_error.c         | 55 ++++++++++++++-----
 2 files changed, 54 insertions(+), 16 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/2] drm/i915: Allow error capture without a request
  2022-11-29 21:12 [PATCH 0/2] Allow error capture without a request / on reset failure John.C.Harrison
@ 2022-11-29 21:12 ` John.C.Harrison
  2022-12-13  1:52   ` [Intel-gfx] " Umesh Nerlige Ramappa
  2022-11-29 21:12 ` [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
  1 sibling, 1 reply; 9+ messages in thread
From: John.C.Harrison @ 2022-11-29 21:12 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

There was a report of error captures occurring without any hung
context being indicated despite the capture being initiated by a 'hung
context notification' from GuC. The problem was not reproducible.
However, it is possible to happen if the context in question has no
active requests. For example, if the hang was in the context switch
itself then the breadcrumb write would have occurred and the KMD would
see an idle context.

In the interests of attempting to provide as much information as
possible about a hang, it seems wise to include the engine info
regardless of whether a request was found or not. As opposed to just
prentending there was no hang at all.

So update the error capture code to always record engine information
if an engine is given. Which means updating record_context() to take a
context instead of a request (which it only ever used to find the
context anyway). And split the request agnostic parts of
intel_engine_coredump_add_request() out into a seaprate function.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/i915_gpu_error.c | 55 +++++++++++++++++++--------
 1 file changed, 40 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index 9d5d5a397b64e..2ed1c84c9fab4 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
 }
 
 static bool record_context(struct i915_gem_context_coredump *e,
-			   const struct i915_request *rq)
+			   struct intel_context *ce)
 {
 	struct i915_gem_context *ctx;
 	struct task_struct *task;
 	bool simulated;
 
 	rcu_read_lock();
-	ctx = rcu_dereference(rq->context->gem_context);
+	ctx = rcu_dereference(ce->gem_context);
 	if (ctx && !kref_get_unless_zero(&ctx->ref))
 		ctx = NULL;
 	rcu_read_unlock();
@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
 	e->guilty = atomic_read(&ctx->guilty_count);
 	e->active = atomic_read(&ctx->active_count);
 
-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
 
 	simulated = i915_gem_context_no_error_capture(ctx);
 
@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
 	return ee;
 }
 
+static struct intel_engine_capture_vma *
+engine_coredump_add_context(struct intel_engine_coredump *ee,
+			    struct intel_context *ce,
+			    gfp_t gfp)
+{
+	struct intel_engine_capture_vma *vma = NULL;
+
+	ee->simulated |= record_context(&ee->context, ce);
+	if (ee->simulated)
+		return NULL;
+
+	/*
+	 * We need to copy these to an anonymous buffer
+	 * as the simplest method to avoid being overwritten
+	 * by userspace.
+	 */
+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
+	vma = capture_vma(vma, ce->state, "HW context", gfp);
+
+	return vma;
+}
+
 struct intel_engine_capture_vma *
 intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 				  struct i915_request *rq,
 				  gfp_t gfp)
 {
-	struct intel_engine_capture_vma *vma = NULL;
+	struct intel_engine_capture_vma *vma;
 
-	ee->simulated |= record_context(&ee->context, rq);
-	if (ee->simulated)
+	vma = engine_coredump_add_context(ee, rq->context, gfp);
+	if (!vma)
 		return NULL;
 
 	/*
@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
 	 */
 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
 	vma = capture_user(vma, rq, gfp);
-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
 
 	ee->rq_head = rq->head;
 	ee->rq_post = rq->postfix;
@@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
 	if (ce) {
 		intel_engine_clear_hung_context(engine);
 		rq = intel_context_find_active_request(ce);
-		if (!rq || !i915_request_started(rq))
-			goto no_request_capture;
+		if (rq && !i915_request_started(rq)) {
+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
+				 engine->name);
+			rq = NULL;
+		}
 	} else {
 		/*
 		 * Getting here with GuC enabled means it is a forced error capture
@@ -1625,12 +1648,14 @@ capture_engine(struct intel_engine_cs *engine,
 	if (rq)
 		rq = i915_request_get_rcu(rq);
 
-	if (!rq)
-		goto no_request_capture;
+	if (rq)
+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
+	else if (ce)
+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
 
-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
 	if (!capture) {
-		i915_request_put(rq);
+		if (rq)
+			i915_request_put(rq);
 		goto no_request_capture;
 	}
 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
  2022-11-29 21:12 [PATCH 0/2] Allow error capture without a request / on reset failure John.C.Harrison
  2022-11-29 21:12 ` [PATCH 1/2] drm/i915: Allow error capture without a request John.C.Harrison
@ 2022-11-29 21:12 ` John.C.Harrison
  2022-11-30  8:30   ` [Intel-gfx] " Tvrtko Ursulin
  2022-12-13  2:00   ` Umesh Nerlige Ramappa
  1 sibling, 2 replies; 9+ messages in thread
From: John.C.Harrison @ 2022-11-29 21:12 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Engine resets are supposed to never happen. But in the case when one
does (due to unknwon reasons that normally come down to a missing
w/a), it is useful to get as much information out of the system as
possible. Given that the GuC effectively dies on such a situation, it
is not possible to get a guilty context notification back. So do a
manual search instead. Given that GuC is dead, this is safe because
GuC won't be changing the engine state asynchronously.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 0a42f1807f52c..c82730804a1c4 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
 	guc->submission_state.reset_fail_mask = 0;
 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
 
-	if (likely(reset_fail_mask))
+	if (likely(reset_fail_mask)) {
+		struct intel_engine_cs *engine;
+		enum intel_engine_id id;
+
+		/*
+		 * GuC is toast at this point - it dead loops after sending the failed
+		 * reset notification. So need to manually determine the guilty context.
+		 * Note that it should be safe/reliable to do this here because the GuC
+		 * is toast and will not be scheduling behind the KMD's back.
+		 */
+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
+			intel_guc_find_hung_context(engine);
+
 		intel_gt_handle_error(gt, reset_fail_mask,
 				      I915_ERROR_CAPTURE,
 				      "GuC failed to reset engine mask=0x%x\n",
 				      reset_fail_mask);
+	}
 }
 
 int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
  2022-11-29 21:12 ` [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
@ 2022-11-30  8:30   ` Tvrtko Ursulin
  2022-11-30 21:04     ` John Harrison
  2022-12-13  2:00   ` Umesh Nerlige Ramappa
  1 sibling, 1 reply; 9+ messages in thread
From: Tvrtko Ursulin @ 2022-11-30  8:30 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> Engine resets are supposed to never happen. But in the case when one

Engine resets or engine reset failures? Hopefully the latter.

> does (due to unknwon reasons that normally come down to a missing
> w/a), it is useful to get as much information out of the system as
> possible. Given that the GuC effectively dies on such a situation, it
> is not possible to get a guilty context notification back. So do a
> manual search instead. Given that GuC is dead, this is safe because
> GuC won't be changing the engine state asynchronously.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>   1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 0a42f1807f52c..c82730804a1c4 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
>   	guc->submission_state.reset_fail_mask = 0;
>   	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>   
> -	if (likely(reset_fail_mask))
> +	if (likely(reset_fail_mask)) {
> +		struct intel_engine_cs *engine;
> +		enum intel_engine_id id;
> +
> +		/*
> +		 * GuC is toast at this point - it dead loops after sending the failed
> +		 * reset notification. So need to manually determine the guilty context.
> +		 * Note that it should be safe/reliable to do this here because the GuC
> +		 * is toast and will not be scheduling behind the KMD's back.
> +		 */
> +		for_each_engine_masked(engine, gt, reset_fail_mask, id)
> +			intel_guc_find_hung_context(engine);
> +
>   		intel_gt_handle_error(gt, reset_fail_mask,
>   				      I915_ERROR_CAPTURE,
>   				      "GuC failed to reset engine mask=0x%x\n",
>   				      reset_fail_mask);

If GuC is defined by ABI contract to be dead, should the flow be 
attempting to do a full GPU reset here, or maybe it happens somewhere 
else as a consequence anyway? (In which case is the engine reset here 
even needed?)

Regards,

Tvrtko

> +	}
>   }
>   
>   int intel_guc_engine_failure_process_msg(struct intel_guc *guc,

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
  2022-11-30  8:30   ` [Intel-gfx] " Tvrtko Ursulin
@ 2022-11-30 21:04     ` John Harrison
  2022-12-01 10:21       ` Tvrtko Ursulin
  0 siblings, 1 reply; 9+ messages in thread
From: John Harrison @ 2022-11-30 21:04 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 11/30/2022 00:30, Tvrtko Ursulin wrote:
> On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> Engine resets are supposed to never happen. But in the case when one
>
> Engine resets or engine reset failures? Hopefully the latter.
>
Oops. Yes, that was meant to say "engine resets are never supposed to fail."

>> does (due to unknwon reasons that normally come down to a missing
unknwon -> unknown

>> w/a), it is useful to get as much information out of the system as
>> possible. Given that the GuC effectively dies on such a situation, it
>> is not possible to get a guilty context notification back. So do a
>> manual search instead. Given that GuC is dead, this is safe because
>> GuC won't be changing the engine state asynchronously.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 0a42f1807f52c..c82730804a1c4 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct 
>> work_struct *w)
>>       guc->submission_state.reset_fail_mask = 0;
>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>   -    if (likely(reset_fail_mask))
>> +    if (likely(reset_fail_mask)) {
>> +        struct intel_engine_cs *engine;
>> +        enum intel_engine_id id;
>> +
>> +        /*
>> +         * GuC is toast at this point - it dead loops after sending 
>> the failed
>> +         * reset notification. So need to manually determine the 
>> guilty context.
>> +         * Note that it should be safe/reliable to do this here 
>> because the GuC
>> +         * is toast and will not be scheduling behind the KMD's back.
>> +         */
>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>> +            intel_guc_find_hung_context(engine);
>> +
>>           intel_gt_handle_error(gt, reset_fail_mask,
>>                         I915_ERROR_CAPTURE,
>>                         "GuC failed to reset engine mask=0x%x\n",
>>                         reset_fail_mask);
>
> If GuC is defined by ABI contract to be dead, should the flow be 
> attempting to do a full GPU reset here, or maybe it happens somewhere 
> else as a consequence anyway? (In which case is the engine reset here 
> even needed?)
This is a full GT reset. i915 is not allowed to perform an engine reset 
when using GuC submission. Those can only be done by GuC. So any forced 
reset by i915 will be escalated to full GT internally.

John.

>
> Regards,
>
> Tvrtko
>
>> +    }
>>   }
>>     int intel_guc_engine_failure_process_msg(struct intel_guc *guc,


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
  2022-11-30 21:04     ` John Harrison
@ 2022-12-01 10:21       ` Tvrtko Ursulin
  0 siblings, 0 replies; 9+ messages in thread
From: Tvrtko Ursulin @ 2022-12-01 10:21 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 30/11/2022 21:04, John Harrison wrote:
> On 11/30/2022 00:30, Tvrtko Ursulin wrote:
>> On 29/11/2022 21:12, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> Engine resets are supposed to never happen. But in the case when one
>>
>> Engine resets or engine reset failures? Hopefully the latter.
>>
> Oops. Yes, that was meant to say "engine resets are never supposed to 
> fail."
> 
>>> does (due to unknwon reasons that normally come down to a missing
> unknwon -> unknown
> 
>>> w/a), it is useful to get as much information out of the system as
>>> possible. Given that the GuC effectively dies on such a situation, it
>>> is not possible to get a guilty context notification back. So do a
>>> manual search instead. Given that GuC is dead, this is safe because
>>> GuC won't be changing the engine state asynchronously.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
>>>   1 file changed, 14 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 0a42f1807f52c..c82730804a1c4 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct 
>>> work_struct *w)
>>>       guc->submission_state.reset_fail_mask = 0;
>>>       spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>>>   -    if (likely(reset_fail_mask))
>>> +    if (likely(reset_fail_mask)) {
>>> +        struct intel_engine_cs *engine;
>>> +        enum intel_engine_id id;
>>> +
>>> +        /*
>>> +         * GuC is toast at this point - it dead loops after sending 
>>> the failed
>>> +         * reset notification. So need to manually determine the 
>>> guilty context.
>>> +         * Note that it should be safe/reliable to do this here 
>>> because the GuC
>>> +         * is toast and will not be scheduling behind the KMD's back.
>>> +         */
>>> +        for_each_engine_masked(engine, gt, reset_fail_mask, id)
>>> +            intel_guc_find_hung_context(engine);
>>> +
>>>           intel_gt_handle_error(gt, reset_fail_mask,
>>>                         I915_ERROR_CAPTURE,
>>>                         "GuC failed to reset engine mask=0x%x\n",
>>>                         reset_fail_mask);
>>
>> If GuC is defined by ABI contract to be dead, should the flow be 
>> attempting to do a full GPU reset here, or maybe it happens somewhere 
>> else as a consequence anyway? (In which case is the engine reset here 
>> even needed?)
> This is a full GT reset. i915 is not allowed to perform an engine reset 
> when using GuC submission. Those can only be done by GuC. So any forced 
> reset by i915 will be escalated to full GT internally.

Okay, I saw passing in of the engine mask and drew the wrong conclusion.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH 1/2] drm/i915: Allow error capture without a request
  2022-11-29 21:12 ` [PATCH 1/2] drm/i915: Allow error capture without a request John.C.Harrison
@ 2022-12-13  1:52   ` Umesh Nerlige Ramappa
  2022-12-16 21:06     ` John Harrison
  0 siblings, 1 reply; 9+ messages in thread
From: Umesh Nerlige Ramappa @ 2022-12-13  1:52 UTC (permalink / raw)
  To: John.C.Harrison; +Cc: Intel-GFX, DRI-Devel

On Tue, Nov 29, 2022 at 01:12:52PM -0800, John.C.Harrison@Intel.com wrote:
>From: John Harrison <John.C.Harrison@Intel.com>
>
>There was a report of error captures occurring without any hung
>context being indicated despite the capture being initiated by a 'hung
>context notification' from GuC. The problem was not reproducible.
>However, it is possible to happen if the context in question has no
>active requests. For example, if the hang was in the context switch
>itself then the breadcrumb write would have occurred and the KMD would
>see an idle context.
>
>In the interests of attempting to provide as much information as
>possible about a hang, it seems wise to include the engine info
>regardless of whether a request was found or not. As opposed to just
>prentending there was no hang at all.
>
>So update the error capture code to always record engine information
>if an engine is given. Which means updating record_context() to take a
>context instead of a request (which it only ever used to find the
>context anyway). And split the request agnostic parts of
>intel_engine_coredump_add_request() out into a seaprate function.
>
>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>---
> drivers/gpu/drm/i915/i915_gpu_error.c | 55 +++++++++++++++++++--------
> 1 file changed, 40 insertions(+), 15 deletions(-)
>
>diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
>index 9d5d5a397b64e..2ed1c84c9fab4 100644
>--- a/drivers/gpu/drm/i915/i915_gpu_error.c
>+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>@@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct intel_engine_coredump *ee)
> }
>
> static bool record_context(struct i915_gem_context_coredump *e,
>-			   const struct i915_request *rq)
>+			   struct intel_context *ce)
> {
> 	struct i915_gem_context *ctx;
> 	struct task_struct *task;
> 	bool simulated;
>
> 	rcu_read_lock();
>-	ctx = rcu_dereference(rq->context->gem_context);
>+	ctx = rcu_dereference(ce->gem_context);
> 	if (ctx && !kref_get_unless_zero(&ctx->ref))
> 		ctx = NULL;
> 	rcu_read_unlock();
>@@ -1396,8 +1396,8 @@ static bool record_context(struct i915_gem_context_coredump *e,
> 	e->guilty = atomic_read(&ctx->guilty_count);
> 	e->active = atomic_read(&ctx->active_count);
>
>-	e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
>-	e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>+	e->total_runtime = intel_context_get_total_runtime_ns(ce);
>+	e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>
> 	simulated = i915_gem_context_no_error_capture(ctx);
>
>@@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct intel_engine_cs *engine, gfp_t gfp, u32 dump_
> 	return ee;
> }
>
>+static struct intel_engine_capture_vma *
>+engine_coredump_add_context(struct intel_engine_coredump *ee,
>+			    struct intel_context *ce,
>+			    gfp_t gfp)
>+{
>+	struct intel_engine_capture_vma *vma = NULL;
>+
>+	ee->simulated |= record_context(&ee->context, ce);
>+	if (ee->simulated)
>+		return NULL;
>+
>+	/*
>+	 * We need to copy these to an anonymous buffer
>+	 * as the simplest method to avoid being overwritten
>+	 * by userspace.
>+	 */
>+	vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>+	vma = capture_vma(vma, ce->state, "HW context", gfp);
>+
>+	return vma;
>+}
>+
> struct intel_engine_capture_vma *
> intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
> 				  struct i915_request *rq,
> 				  gfp_t gfp)
> {
>-	struct intel_engine_capture_vma *vma = NULL;
>+	struct intel_engine_capture_vma *vma;
>
>-	ee->simulated |= record_context(&ee->context, rq);
>-	if (ee->simulated)
>+	vma = engine_coredump_add_context(ee, rq->context, gfp);
>+	if (!vma)
> 		return NULL;
>
> 	/*
>@@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
> 	 */
> 	vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
> 	vma = capture_user(vma, rq, gfp);
>-	vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>-	vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>
> 	ee->rq_head = rq->head;
> 	ee->rq_post = rq->postfix;
>@@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
> 	if (ce) {
> 		intel_engine_clear_hung_context(engine);
> 		rq = intel_context_find_active_request(ce);
>-		if (!rq || !i915_request_started(rq))
>-			goto no_request_capture;
>+		if (rq && !i915_request_started(rq)) {
>+			drm_info(&engine->gt->i915->drm, "Got hung context on %s with no active request!\n",
>+				 engine->name);
>+			rq = NULL;
>+		}
> 	} else {
> 		/*
> 		 * Getting here with GuC enabled means it is a forced error capture
>@@ -1625,12 +1648,14 @@ capture_engine(struct intel_engine_cs *engine,
> 	if (rq)
> 		rq = i915_request_get_rcu(rq);
>
>-	if (!rq)
>-		goto no_request_capture;
>+	if (rq)
>+		capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);

2 back-to-back if (rq) could merge together,


otherwise, lgtm

Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Umesh
>+	else if (ce)
>+		capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>
>-	capture = intel_engine_coredump_add_request(ee, rq, ATOMIC_MAYFAIL);
> 	if (!capture) {
>-		i915_request_put(rq);
>+		if (rq)
>+			i915_request_put(rq);
> 		goto no_request_capture;
> 	}
> 	if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>-- 
>2.37.3
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails
  2022-11-29 21:12 ` [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
  2022-11-30  8:30   ` [Intel-gfx] " Tvrtko Ursulin
@ 2022-12-13  2:00   ` Umesh Nerlige Ramappa
  1 sibling, 0 replies; 9+ messages in thread
From: Umesh Nerlige Ramappa @ 2022-12-13  2:00 UTC (permalink / raw)
  To: John.C.Harrison; +Cc: Intel-GFX, DRI-Devel

On Tue, Nov 29, 2022 at 01:12:53PM -0800, John.C.Harrison@Intel.com wrote:
>From: John Harrison <John.C.Harrison@Intel.com>
>
>Engine resets are supposed to never happen. But in the case when one
>does (due to unknwon reasons that normally come down to a missing
>w/a), it is useful to get as much information out of the system as
>possible. Given that the GuC effectively dies on such a situation, it
>is not possible to get a guilty context notification back. So do a
>manual search instead. Given that GuC is dead, this is safe because
>GuC won't be changing the engine state asynchronously.
>
>Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>index 0a42f1807f52c..c82730804a1c4 100644
>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>@@ -4751,11 +4751,24 @@ static void reset_fail_worker_func(struct work_struct *w)
> 	guc->submission_state.reset_fail_mask = 0;
> 	spin_unlock_irqrestore(&guc->submission_state.lock, flags);
>
>-	if (likely(reset_fail_mask))
>+	if (likely(reset_fail_mask)) {
>+		struct intel_engine_cs *engine;
>+		enum intel_engine_id id;
>+
>+		/*
>+		 * GuC is toast at this point - it dead loops after sending the failed
>+		 * reset notification. So need to manually determine the guilty context.
>+		 * Note that it should be safe/reliable to do this here because the GuC
>+		 * is toast and will not be scheduling behind the KMD's back.
>+		 */

Is that defined by the kmd-GuC interface that following a failed reset notification, GuC 
will always dead-loop OR not schedule anything (even on other engines) until KMD takes 
some action? What action should KMD take?

Regards,
Umesh

>+		for_each_engine_masked(engine, gt, reset_fail_mask, id)
>+			intel_guc_find_hung_context(engine);
>+
> 		intel_gt_handle_error(gt, reset_fail_mask,
> 				      I915_ERROR_CAPTURE,
> 				      "GuC failed to reset engine mask=0x%x\n",
> 				      reset_fail_mask);
>+	}
> }
>
> int intel_guc_engine_failure_process_msg(struct intel_guc *guc,
>-- 
>2.37.3
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Intel-gfx] [PATCH 1/2] drm/i915: Allow error capture without a request
  2022-12-13  1:52   ` [Intel-gfx] " Umesh Nerlige Ramappa
@ 2022-12-16 21:06     ` John Harrison
  0 siblings, 0 replies; 9+ messages in thread
From: John Harrison @ 2022-12-16 21:06 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: Intel-GFX, DRI-Devel



On 12/12/2022 17:52, Umesh Nerlige Ramappa wrote:
> On Tue, Nov 29, 2022 at 01:12:52PM -0800, John.C.Harrison@Intel.com 
> wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> There was a report of error captures occurring without any hung
>> context being indicated despite the capture being initiated by a 'hung
>> context notification' from GuC. The problem was not reproducible.
>> However, it is possible to happen if the context in question has no
>> active requests. For example, if the hang was in the context switch
>> itself then the breadcrumb write would have occurred and the KMD would
>> see an idle context.
>>
>> In the interests of attempting to provide as much information as
>> possible about a hang, it seems wise to include the engine info
>> regardless of whether a request was found or not. As opposed to just
>> prentending there was no hang at all.
>>
>> So update the error capture code to always record engine information
>> if an engine is given. Which means updating record_context() to take a
>> context instead of a request (which it only ever used to find the
>> context anyway). And split the request agnostic parts of
>> intel_engine_coredump_add_request() out into a seaprate function.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>> drivers/gpu/drm/i915/i915_gpu_error.c | 55 +++++++++++++++++++--------
>> 1 file changed, 40 insertions(+), 15 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c 
>> b/drivers/gpu/drm/i915/i915_gpu_error.c
>> index 9d5d5a397b64e..2ed1c84c9fab4 100644
>> --- a/drivers/gpu/drm/i915/i915_gpu_error.c
>> +++ b/drivers/gpu/drm/i915/i915_gpu_error.c
>> @@ -1370,14 +1370,14 @@ static void engine_record_execlists(struct 
>> intel_engine_coredump *ee)
>> }
>>
>> static bool record_context(struct i915_gem_context_coredump *e,
>> -               const struct i915_request *rq)
>> +               struct intel_context *ce)
>> {
>>     struct i915_gem_context *ctx;
>>     struct task_struct *task;
>>     bool simulated;
>>
>>     rcu_read_lock();
>> -    ctx = rcu_dereference(rq->context->gem_context);
>> +    ctx = rcu_dereference(ce->gem_context);
>>     if (ctx && !kref_get_unless_zero(&ctx->ref))
>>         ctx = NULL;
>>     rcu_read_unlock();
>> @@ -1396,8 +1396,8 @@ static bool record_context(struct 
>> i915_gem_context_coredump *e,
>>     e->guilty = atomic_read(&ctx->guilty_count);
>>     e->active = atomic_read(&ctx->active_count);
>>
>> -    e->total_runtime = intel_context_get_total_runtime_ns(rq->context);
>> -    e->avg_runtime = intel_context_get_avg_runtime_ns(rq->context);
>> +    e->total_runtime = intel_context_get_total_runtime_ns(ce);
>> +    e->avg_runtime = intel_context_get_avg_runtime_ns(ce);
>>
>>     simulated = i915_gem_context_no_error_capture(ctx);
>>
>> @@ -1532,15 +1532,37 @@ intel_engine_coredump_alloc(struct 
>> intel_engine_cs *engine, gfp_t gfp, u32 dump_
>>     return ee;
>> }
>>
>> +static struct intel_engine_capture_vma *
>> +engine_coredump_add_context(struct intel_engine_coredump *ee,
>> +                struct intel_context *ce,
>> +                gfp_t gfp)
>> +{
>> +    struct intel_engine_capture_vma *vma = NULL;
>> +
>> +    ee->simulated |= record_context(&ee->context, ce);
>> +    if (ee->simulated)
>> +        return NULL;
>> +
>> +    /*
>> +     * We need to copy these to an anonymous buffer
>> +     * as the simplest method to avoid being overwritten
>> +     * by userspace.
>> +     */
>> +    vma = capture_vma(vma, ce->ring->vma, "ring", gfp);
>> +    vma = capture_vma(vma, ce->state, "HW context", gfp);
>> +
>> +    return vma;
>> +}
>> +
>> struct intel_engine_capture_vma *
>> intel_engine_coredump_add_request(struct intel_engine_coredump *ee,
>>                   struct i915_request *rq,
>>                   gfp_t gfp)
>> {
>> -    struct intel_engine_capture_vma *vma = NULL;
>> +    struct intel_engine_capture_vma *vma;
>>
>> -    ee->simulated |= record_context(&ee->context, rq);
>> -    if (ee->simulated)
>> +    vma = engine_coredump_add_context(ee, rq->context, gfp);
>> +    if (!vma)
>>         return NULL;
>>
>>     /*
>> @@ -1550,8 +1572,6 @@ intel_engine_coredump_add_request(struct 
>> intel_engine_coredump *ee,
>>      */
>>     vma = capture_vma_snapshot(vma, rq->batch_res, gfp, "batch");
>>     vma = capture_user(vma, rq, gfp);
>> -    vma = capture_vma(vma, rq->ring->vma, "ring", gfp);
>> -    vma = capture_vma(vma, rq->context->state, "HW context", gfp);
>>
>>     ee->rq_head = rq->head;
>>     ee->rq_post = rq->postfix;
>> @@ -1608,8 +1628,11 @@ capture_engine(struct intel_engine_cs *engine,
>>     if (ce) {
>>         intel_engine_clear_hung_context(engine);
>>         rq = intel_context_find_active_request(ce);
>> -        if (!rq || !i915_request_started(rq))
>> -            goto no_request_capture;
>> +        if (rq && !i915_request_started(rq)) {
>> +            drm_info(&engine->gt->i915->drm, "Got hung context on %s 
>> with no active request!\n",
>> +                 engine->name);
>> +            rq = NULL;
>> +        }
>>     } else {
>>         /*
>>          * Getting here with GuC enabled means it is a forced error 
>> capture
>> @@ -1625,12 +1648,14 @@ capture_engine(struct intel_engine_cs *engine,
>>     if (rq)
>>         rq = i915_request_get_rcu(rq);
>>
>> -    if (!rq)
>> -        goto no_request_capture;
>> +    if (rq)
>> +        capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>
> 2 back-to-back if (rq) could merge together,
>
>
> otherwise, lgtm
>
> Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>
> Umesh
There was actually an unconditional put of rq just after the block 
below. So will need to repost to fix that anyway. Not sure how it didn't 
manage to bang when testing captures with the rq forced to null!?

John.

>> +    else if (ce)
>> +        capture = engine_coredump_add_context(ee, ce, ATOMIC_MAYFAIL);
>>
>> -    capture = intel_engine_coredump_add_request(ee, rq, 
>> ATOMIC_MAYFAIL);
>>     if (!capture) {
>> -        i915_request_put(rq);
>> +        if (rq)
>> +            i915_request_put(rq);
>>         goto no_request_capture;
>>     }
>>     if (dump_flags & CORE_DUMP_FLAG_IS_GUC_CAPTURE)
>> -- 
>> 2.37.3
>>


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-12-16 21:06 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-29 21:12 [PATCH 0/2] Allow error capture without a request / on reset failure John.C.Harrison
2022-11-29 21:12 ` [PATCH 1/2] drm/i915: Allow error capture without a request John.C.Harrison
2022-12-13  1:52   ` [Intel-gfx] " Umesh Nerlige Ramappa
2022-12-16 21:06     ` John Harrison
2022-11-29 21:12 ` [PATCH 2/2] drm/i915/guc: Look for a guilty context when an engine reset fails John.C.Harrison
2022-11-30  8:30   ` [Intel-gfx] " Tvrtko Ursulin
2022-11-30 21:04     ` John Harrison
2022-12-01 10:21       ` Tvrtko Ursulin
2022-12-13  2:00   ` Umesh Nerlige Ramappa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).