All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<intel-gfx@lists.freedesktop.org>
Cc: <stable@vger.kernel.org>, <dri-devel@lists.freedesktop.org>,
	Matthew Brost <matthew.brost@intel.com>
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: clear stalled request after a reset
Date: Fri, 12 Aug 2022 08:31:18 -0700	[thread overview]
Message-ID: <c36cf67c-c32f-4883-b56e-9e5322720431@intel.com> (raw)
In-Reply-To: <bd3abbb2-f3e8-b143-a19d-2cbf9463f7b3@linux.intel.com>



On 8/12/2022 12:29 AM, Tvrtko Ursulin wrote:
>
> On 11/08/2022 22:08, Daniele Ceraolo Spurio wrote:
>> If the GuC CTs are full and we need to stall the request submission
>> while waiting for space, we save the stalled request and where the stall
>> occurred; when the CTs have space again we pick up the request 
>> submission
>> from where we left off.
>
> How serious is it? Statement always was CT buffers can never get full 
> outside the pathological IGT test cases. So I am wondering if this is 
> in the category of fix for correctness or actually the CT buffers can 
> get full in normal use so it is imperative to fix.

The CT buffers being full is indeed something that is normally only 
observed with IGTs that hammer the submission path, but it is still 
something that a user can do so IMO we do have to fix it. However, the 
bug is still extremely unlikely to happen out in the wild as it needs 2 
relatively rare things to happen:

- We need to hit the pathological case of the GuC CTs being full and the 
stall kicking in
- Something needs to go wrong and escalated to a full GT reset

The bug report that triggered my investigation into this came from what 
look like faulty HW: the HW seems to suddenly just stop with no errors 
anywhere, which leads to the buffers filling up because the GuC is no 
longer processing them, followed by a GT reset as we try to recover the 
HW. To replicate this locally I had to add a debugfs to kill the GuC in 
the middle of the test to simulate this "HW silently dies" scenario.

Daniele

>
> Regards,
>
> Tvrtko
>
>> If a full GT reset occurs, the state of all contexts is cleared and all
>> non-guilty requests are unsubmitted, therefore we need to restart the
>> stalled request submission from scratch. To make sure that we do so,
>> clear the saved request after a reset.
>>
>> Fixes note: the patch that introduced the bug is in 5.15, but no
>> officially supported platform had GuC submission enabled by default
>> in that kernel, so the backport to that particular version (and only
>> that one) can potentially be skipped.
>>
>> Fixes: 925dc1cf58ed ("drm/i915/guc: Implement GuC submission tasklet")
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <john.c.harrison@intel.com>
>> Cc: <stable@vger.kernel.org> # v5.15+
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 0d17da77e787..0d56b615bf78 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4002,6 +4002,13 @@ static inline void guc_init_lrc_mapping(struct 
>> intel_guc *guc)
>>       /* make sure all descriptors are clean... */
>>       xa_destroy(&guc->context_lookup);
>>   +    /*
>> +     * A reset might have occurred while we had a pending stalled 
>> request,
>> +     * so make sure we clean that up.
>> +     */
>> +    guc->stalled_request = NULL;
>> +    guc->submission_stall_reason = STALL_NONE;
>> +
>>       /*
>>        * Some contexts might have been pinned before we enabled GuC
>>        * submission, so we need to add them to the GuC bookeeping.


WARNING: multiple messages have this Message-ID (diff)
From: "Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<intel-gfx@lists.freedesktop.org>
Cc: Matthew Brost <matthew.brost@intel.com>,
	dri-devel@lists.freedesktop.org, stable@vger.kernel.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: clear stalled request after a reset
Date: Fri, 12 Aug 2022 08:31:18 -0700	[thread overview]
Message-ID: <c36cf67c-c32f-4883-b56e-9e5322720431@intel.com> (raw)
In-Reply-To: <bd3abbb2-f3e8-b143-a19d-2cbf9463f7b3@linux.intel.com>



On 8/12/2022 12:29 AM, Tvrtko Ursulin wrote:
>
> On 11/08/2022 22:08, Daniele Ceraolo Spurio wrote:
>> If the GuC CTs are full and we need to stall the request submission
>> while waiting for space, we save the stalled request and where the stall
>> occurred; when the CTs have space again we pick up the request 
>> submission
>> from where we left off.
>
> How serious is it? Statement always was CT buffers can never get full 
> outside the pathological IGT test cases. So I am wondering if this is 
> in the category of fix for correctness or actually the CT buffers can 
> get full in normal use so it is imperative to fix.

The CT buffers being full is indeed something that is normally only 
observed with IGTs that hammer the submission path, but it is still 
something that a user can do so IMO we do have to fix it. However, the 
bug is still extremely unlikely to happen out in the wild as it needs 2 
relatively rare things to happen:

- We need to hit the pathological case of the GuC CTs being full and the 
stall kicking in
- Something needs to go wrong and escalated to a full GT reset

The bug report that triggered my investigation into this came from what 
look like faulty HW: the HW seems to suddenly just stop with no errors 
anywhere, which leads to the buffers filling up because the GuC is no 
longer processing them, followed by a GT reset as we try to recover the 
HW. To replicate this locally I had to add a debugfs to kill the GuC in 
the middle of the test to simulate this "HW silently dies" scenario.

Daniele

>
> Regards,
>
> Tvrtko
>
>> If a full GT reset occurs, the state of all contexts is cleared and all
>> non-guilty requests are unsubmitted, therefore we need to restart the
>> stalled request submission from scratch. To make sure that we do so,
>> clear the saved request after a reset.
>>
>> Fixes note: the patch that introduced the bug is in 5.15, but no
>> officially supported platform had GuC submission enabled by default
>> in that kernel, so the backport to that particular version (and only
>> that one) can potentially be skipped.
>>
>> Fixes: 925dc1cf58ed ("drm/i915/guc: Implement GuC submission tasklet")
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <john.c.harrison@intel.com>
>> Cc: <stable@vger.kernel.org> # v5.15+
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 0d17da77e787..0d56b615bf78 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4002,6 +4002,13 @@ static inline void guc_init_lrc_mapping(struct 
>> intel_guc *guc)
>>       /* make sure all descriptors are clean... */
>>       xa_destroy(&guc->context_lookup);
>>   +    /*
>> +     * A reset might have occurred while we had a pending stalled 
>> request,
>> +     * so make sure we clean that up.
>> +     */
>> +    guc->stalled_request = NULL;
>> +    guc->submission_stall_reason = STALL_NONE;
>> +
>>       /*
>>        * Some contexts might have been pinned before we enabled GuC
>>        * submission, so we need to add them to the GuC bookeeping.


WARNING: multiple messages have this Message-ID (diff)
From: "Ceraolo Spurio, Daniele" <daniele.ceraolospurio@intel.com>
To: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>,
	<intel-gfx@lists.freedesktop.org>
Cc: dri-devel@lists.freedesktop.org, stable@vger.kernel.org
Subject: Re: [Intel-gfx] [PATCH] drm/i915/guc: clear stalled request after a reset
Date: Fri, 12 Aug 2022 08:31:18 -0700	[thread overview]
Message-ID: <c36cf67c-c32f-4883-b56e-9e5322720431@intel.com> (raw)
In-Reply-To: <bd3abbb2-f3e8-b143-a19d-2cbf9463f7b3@linux.intel.com>



On 8/12/2022 12:29 AM, Tvrtko Ursulin wrote:
>
> On 11/08/2022 22:08, Daniele Ceraolo Spurio wrote:
>> If the GuC CTs are full and we need to stall the request submission
>> while waiting for space, we save the stalled request and where the stall
>> occurred; when the CTs have space again we pick up the request 
>> submission
>> from where we left off.
>
> How serious is it? Statement always was CT buffers can never get full 
> outside the pathological IGT test cases. So I am wondering if this is 
> in the category of fix for correctness or actually the CT buffers can 
> get full in normal use so it is imperative to fix.

The CT buffers being full is indeed something that is normally only 
observed with IGTs that hammer the submission path, but it is still 
something that a user can do so IMO we do have to fix it. However, the 
bug is still extremely unlikely to happen out in the wild as it needs 2 
relatively rare things to happen:

- We need to hit the pathological case of the GuC CTs being full and the 
stall kicking in
- Something needs to go wrong and escalated to a full GT reset

The bug report that triggered my investigation into this came from what 
look like faulty HW: the HW seems to suddenly just stop with no errors 
anywhere, which leads to the buffers filling up because the GuC is no 
longer processing them, followed by a GT reset as we try to recover the 
HW. To replicate this locally I had to add a debugfs to kill the GuC in 
the middle of the test to simulate this "HW silently dies" scenario.

Daniele

>
> Regards,
>
> Tvrtko
>
>> If a full GT reset occurs, the state of all contexts is cleared and all
>> non-guilty requests are unsubmitted, therefore we need to restart the
>> stalled request submission from scratch. To make sure that we do so,
>> clear the saved request after a reset.
>>
>> Fixes note: the patch that introduced the bug is in 5.15, but no
>> officially supported platform had GuC submission enabled by default
>> in that kernel, so the backport to that particular version (and only
>> that one) can potentially be skipped.
>>
>> Fixes: 925dc1cf58ed ("drm/i915/guc: Implement GuC submission tasklet")
>> Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
>> Cc: Matthew Brost <matthew.brost@intel.com>
>> Cc: John Harrison <john.c.harrison@intel.com>
>> Cc: <stable@vger.kernel.org> # v5.15+
>> ---
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 7 +++++++
>>   1 file changed, 7 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 0d17da77e787..0d56b615bf78 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4002,6 +4002,13 @@ static inline void guc_init_lrc_mapping(struct 
>> intel_guc *guc)
>>       /* make sure all descriptors are clean... */
>>       xa_destroy(&guc->context_lookup);
>>   +    /*
>> +     * A reset might have occurred while we had a pending stalled 
>> request,
>> +     * so make sure we clean that up.
>> +     */
>> +    guc->stalled_request = NULL;
>> +    guc->submission_stall_reason = STALL_NONE;
>> +
>>       /*
>>        * Some contexts might have been pinned before we enabled GuC
>>        * submission, so we need to add them to the GuC bookeeping.


  reply	other threads:[~2022-08-12 15:31 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-11 21:08 [PATCH] drm/i915/guc: clear stalled request after a reset Daniele Ceraolo Spurio
2022-08-11 21:08 ` Daniele Ceraolo Spurio
2022-08-11 21:08 ` [Intel-gfx] " Daniele Ceraolo Spurio
2022-08-11 22:19 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-08-12  7:29 ` [Intel-gfx] [PATCH] " Tvrtko Ursulin
2022-08-12  7:29   ` Tvrtko Ursulin
2022-08-12 15:31   ` Ceraolo Spurio, Daniele [this message]
2022-08-12 15:31     ` Ceraolo Spurio, Daniele
2022-08-12 15:31     ` Ceraolo Spurio, Daniele
2022-08-30 14:13     ` Tvrtko Ursulin
2022-08-30 14:13       ` Tvrtko Ursulin
2022-08-30 14:13       ` Tvrtko Ursulin
2022-08-12  8:19 ` [Intel-gfx] ✓ Fi.CI.IGT: success for " Patchwork
2022-08-15 22:27 ` [PATCH] " John Harrison
2022-08-15 22:27   ` [Intel-gfx] " John Harrison
2022-08-15 22:27   ` John Harrison

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c36cf67c-c32f-4883-b56e-9e5322720431@intel.com \
    --to=daniele.ceraolospurio@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=intel-gfx@lists.freedesktop.org \
    --cc=matthew.brost@intel.com \
    --cc=stable@vger.kernel.org \
    --cc=tvrtko.ursulin@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.