intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
@ 2023-12-05  8:52 Nirmoy Das
  2023-12-05  9:10 ` John Harrison
  2023-12-05 13:10 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for " Patchwork
  0 siblings, 2 replies; 6+ messages in thread
From: Nirmoy Das @ 2023-12-05  8:52 UTC (permalink / raw)
  To: intel-gfx; +Cc: Andrzej Hajda, Tvrtko Ursulin, dri-devel, Nirmoy Das

gen8_engine_reset_prepare() can fail when HW fails to set
RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
error as driver will retry.

Convert the log to a trace log for debugging without triggering
unnecessary concerns in CI or for end-users during non-fatal scenarios.

v2: Improve commit message(Tvrtko)

Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Cc: John Harrison <John.C.Harrison@Intel.com>
Cc: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Andrzej Hajda <andrzej.hajda@intel.com>
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index d5ed904f355d..e6fbc6202c80 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine)
 	ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
 					   700, 0, NULL);
 	if (ret)
-		gt_err(engine->gt,
-		       "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n",
-		       engine->name, request,
-		       intel_uncore_read_fw(uncore, reg));
+		GT_TRACE(engine->gt,
+			 "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n",
+			 engine->name, request,
+			 intel_uncore_read_fw(uncore, reg));
 
 	return ret;
 }
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
  2023-12-05  8:52 [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace Nirmoy Das
@ 2023-12-05  9:10 ` John Harrison
  2023-12-05 10:39   ` Nirmoy Das
  2023-12-05 13:10 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for " Patchwork
  1 sibling, 1 reply; 6+ messages in thread
From: John Harrison @ 2023-12-05  9:10 UTC (permalink / raw)
  To: Nirmoy Das, intel-gfx; +Cc: Andrzej Hajda, dri-devel, Tvrtko Ursulin

On 12/5/2023 00:52, Nirmoy Das wrote:
> gen8_engine_reset_prepare() can fail when HW fails to set
> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
> error as driver will retry.
>
> Convert the log to a trace log for debugging without triggering
> unnecessary concerns in CI or for end-users during non-fatal scenarios.
I strongly disagree with this change. The hardware spec for the 
RESET_CTL and GDRST registers are that they will self clear within a 
matter of microseconds. If something is so badly wrong with the hardware 
that it can't even manage to reset then that is something that very much 
warrants more than a completely silent trace event. It most certainly 
should be flagged as a failure in CI.

Just because the driver will retry does not mean that this is not a 
serious error. And if the first attempt failed, why would a subsequent 
attempt succeed? Escalating to FLR may have more success, but that is 
not something that i915 currently does.

John.


>
> v2: Improve commit message(Tvrtko)
>
> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
> Cc: John Harrison <John.C.Harrison@Intel.com>
> Cc: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>   1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index d5ed904f355d..e6fbc6202c80 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct intel_engine_cs *engine)
>   	ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>   					   700, 0, NULL);
>   	if (ret)
> -		gt_err(engine->gt,
> -		       "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n",
> -		       engine->name, request,
> -		       intel_uncore_read_fw(uncore, reg));
> +		GT_TRACE(engine->gt,
> +			 "%s reset request timed out: {request: %08x, RESET_CTL: %08x}\n",
> +			 engine->name, request,
> +			 intel_uncore_read_fw(uncore, reg));
>   
>   	return ret;
>   }


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
  2023-12-05  9:10 ` John Harrison
@ 2023-12-05 10:39   ` Nirmoy Das
  2023-12-05 19:50     ` John Harrison
  0 siblings, 1 reply; 6+ messages in thread
From: Nirmoy Das @ 2023-12-05 10:39 UTC (permalink / raw)
  To: John Harrison, Nirmoy Das, intel-gfx
  Cc: dri-devel, Andrzej Hajda, Tvrtko Ursulin

Hi John,

On 12/5/2023 10:10 AM, John Harrison wrote:
> On 12/5/2023 00:52, Nirmoy Das wrote:
>> gen8_engine_reset_prepare() can fail when HW fails to set
>> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
>> error as driver will retry.
>>
>> Convert the log to a trace log for debugging without triggering
>> unnecessary concerns in CI or for end-users during non-fatal scenarios.
> I strongly disagree with this change. The hardware spec for the 
> RESET_CTL and GDRST registers are that they will self clear within a 
> matter of microseconds. If something is so badly wrong with the 
> hardware that it can't even manage to reset


This message is for reset readiness  poll timeout not that the reset is 
failed which doesn't sound so serious if the subsequent attempt managed 
reset the engine.

I couldn't get enough details when this can happen that HW takes very 
long time to set the readiness bit.


> then that is something that very much warrants more than a completely 
> silent trace event. It most certainly should be flagged as a failure 
> in CI.
>
> Just because the driver will retry does not mean that this is not a 
> serious error. And if the first attempt failed, why would a subsequent 
> attempt succeed?

The patch is not ignoring the failure. If the subsequent attempt fails 
then driver load will fail or it will be wedged if that happens after 
driver load.


> Escalating to FLR may have more success, but that is not something 
> that i915 currently does.

Do we still need to do FLR if a subsequent engine reset failure ?


Regards,

Nirmoy

>
> John.
>
>
>>
>> v2: Improve commit message(Tvrtko)
>>
>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>> Cc: John Harrison <John.C.Harrison@Intel.com>
>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index d5ed904f355d..e6fbc6202c80 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
>> intel_engine_cs *engine)
>>       ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>>                          700, 0, NULL);
>>       if (ret)
>> -        gt_err(engine->gt,
>> -               "%s reset request timed out: {request: %08x, 
>> RESET_CTL: %08x}\n",
>> -               engine->name, request,
>> -               intel_uncore_read_fw(uncore, reg));
>> +        GT_TRACE(engine->gt,
>> +             "%s reset request timed out: {request: %08x, RESET_CTL: 
>> %08x}\n",
>> +             engine->name, request,
>> +             intel_uncore_read_fw(uncore, reg));
>>         return ret;
>>   }
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915/gt: Convert reset prepare failure log to trace
  2023-12-05  8:52 [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace Nirmoy Das
  2023-12-05  9:10 ` John Harrison
@ 2023-12-05 13:10 ` Patchwork
  1 sibling, 0 replies; 6+ messages in thread
From: Patchwork @ 2023-12-05 13:10 UTC (permalink / raw)
  To: Nirmoy Das; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 8044 bytes --]

== Series Details ==

Series: drm/i915/gt: Convert reset prepare failure log to trace
URL   : https://patchwork.freedesktop.org/series/127344/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_13981 -> Patchwork_127344v1
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_127344v1 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_127344v1, please notify your bug team (I915-ci-infra@lists.freedesktop.org) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/index.html

Participating hosts (35 -> 36)
------------------------------

  Additional (2): bat-kbl-2 bat-atsm-1 
  Missing    (1): fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_127344v1:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_pipe_crc_basic@nonblocking-crc-frame-sequence:
    - bat-adlp-9:         NOTRUN -> [SKIP][1] +3 other tests skip
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-adlp-9/igt@kms_pipe_crc_basic@nonblocking-crc-frame-sequence.html

  
Known issues
------------

  Here are the changes found in Patchwork_127344v1 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@core_hotunplug@unbind-rebind:
    - bat-kbl-2:          NOTRUN -> [ABORT][2] ([i915#9793])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-kbl-2/igt@core_hotunplug@unbind-rebind.html
    - bat-atsm-1:         NOTRUN -> [ABORT][3] ([i915#8213])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@core_hotunplug@unbind-rebind.html

  * igt@fbdev@info:
    - bat-kbl-2:          NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#1849])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-kbl-2/igt@fbdev@info.html

  * igt@gem_mmap@basic:
    - bat-atsm-1:         NOTRUN -> [SKIP][5] ([i915#4083])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@gem_mmap@basic.html

  * igt@gem_tiled_fence_blits@basic:
    - bat-atsm-1:         NOTRUN -> [SKIP][6] ([i915#4077]) +2 other tests skip
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@gem_tiled_fence_blits@basic.html

  * igt@gem_tiled_pread_basic:
    - bat-atsm-1:         NOTRUN -> [SKIP][7] ([i915#4079]) +1 other test skip
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@gem_tiled_pread_basic.html

  * igt@i915_pm_rps@basic-api:
    - bat-atsm-1:         NOTRUN -> [SKIP][8] ([i915#6621])
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@i915_pm_rps@basic-api.html

  * igt@kms_addfb_basic@size-max:
    - bat-atsm-1:         NOTRUN -> [SKIP][9] ([i915#6077]) +36 other tests skip
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_addfb_basic@size-max.html

  * igt@kms_addfb_basic@tile-pitch-mismatch:
    - bat-atsm-1:         NOTRUN -> [SKIP][10] ([i915#5608] / [i915#6077])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_addfb_basic@tile-pitch-mismatch.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
    - bat-atsm-1:         NOTRUN -> [SKIP][11] ([i915#5608] / [i915#6078]) +1 other test skip
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html

  * igt@kms_cursor_legacy@basic-flip-after-cursor-atomic:
    - bat-atsm-1:         NOTRUN -> [SKIP][12] ([i915#6078]) +19 other tests skip
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_cursor_legacy@basic-flip-after-cursor-atomic.html
    - bat-adlp-11:        [PASS][13] -> [DMESG-WARN][14] ([i915#6868])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13981/bat-adlp-11/igt@kms_cursor_legacy@basic-flip-after-cursor-atomic.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-adlp-11/igt@kms_cursor_legacy@basic-flip-after-cursor-atomic.html

  * igt@kms_force_connector_basic@prune-stale-modes:
    - bat-atsm-1:         NOTRUN -> [SKIP][15] ([i915#6093]) +4 other tests skip
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_force_connector_basic@prune-stale-modes.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-xr24:
    - bat-atsm-1:         NOTRUN -> [SKIP][16] ([i915#1836]) +6 other tests skip
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-xr24.html

  * igt@kms_prop_blob@basic:
    - bat-atsm-1:         NOTRUN -> [SKIP][17] ([i915#7357])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_prop_blob@basic.html

  * igt@kms_psr@psr_primary_mmap_gtt:
    - bat-kbl-2:          NOTRUN -> [SKIP][18] ([fdo#109271]) +35 other tests skip
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-kbl-2/igt@kms_psr@psr_primary_mmap_gtt.html

  * igt@kms_setmode@basic-clone-single-crtc:
    - bat-atsm-1:         NOTRUN -> [SKIP][19] ([i915#6094])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@kms_setmode@basic-clone-single-crtc.html

  * igt@prime_vgem@basic-fence-flip:
    - bat-atsm-1:         NOTRUN -> [SKIP][20] ([fdo#109295] / [i915#6078])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@prime_vgem@basic-fence-flip.html

  * igt@prime_vgem@basic-fence-mmap:
    - bat-atsm-1:         NOTRUN -> [SKIP][21] ([fdo#109295] / [i915#4077]) +1 other test skip
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@prime_vgem@basic-fence-mmap.html

  * igt@prime_vgem@basic-write:
    - bat-atsm-1:         NOTRUN -> [SKIP][22] ([fdo#109295]) +2 other tests skip
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/bat-atsm-1/igt@prime_vgem@basic-write.html

  
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [i915#1836]: https://gitlab.freedesktop.org/drm/intel/issues/1836
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#4077]: https://gitlab.freedesktop.org/drm/intel/issues/4077
  [i915#4079]: https://gitlab.freedesktop.org/drm/intel/issues/4079
  [i915#4083]: https://gitlab.freedesktop.org/drm/intel/issues/4083
  [i915#5608]: https://gitlab.freedesktop.org/drm/intel/issues/5608
  [i915#6077]: https://gitlab.freedesktop.org/drm/intel/issues/6077
  [i915#6078]: https://gitlab.freedesktop.org/drm/intel/issues/6078
  [i915#6093]: https://gitlab.freedesktop.org/drm/intel/issues/6093
  [i915#6094]: https://gitlab.freedesktop.org/drm/intel/issues/6094
  [i915#6621]: https://gitlab.freedesktop.org/drm/intel/issues/6621
  [i915#6868]: https://gitlab.freedesktop.org/drm/intel/issues/6868
  [i915#7357]: https://gitlab.freedesktop.org/drm/intel/issues/7357
  [i915#8213]: https://gitlab.freedesktop.org/drm/intel/issues/8213
  [i915#9793]: https://gitlab.freedesktop.org/drm/intel/issues/9793


Build changes
-------------

  * Linux: CI_DRM_13981 -> Patchwork_127344v1

  CI-20190529: 20190529
  CI_DRM_13981: aaf3a2f69283b9783afb92c0aa5463f7176d20dd @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7620: 6714b814e7f82743abf45c33361fbe057a22880a @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_127344v1: aaf3a2f69283b9783afb92c0aa5463f7176d20dd @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

fb6238475efd drm/i915/gt: Convert reset prepare failure log to trace

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_127344v1/index.html

[-- Attachment #2: Type: text/html, Size: 9712 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
  2023-12-05 10:39   ` Nirmoy Das
@ 2023-12-05 19:50     ` John Harrison
  2023-12-06 10:35       ` Nirmoy Das
  0 siblings, 1 reply; 6+ messages in thread
From: John Harrison @ 2023-12-05 19:50 UTC (permalink / raw)
  To: Nirmoy Das, Nirmoy Das, intel-gfx
  Cc: dri-devel, Andrzej Hajda, Tvrtko Ursulin

On 12/5/2023 02:39, Nirmoy Das wrote:
> Hi John,
>
> On 12/5/2023 10:10 AM, John Harrison wrote:
>> On 12/5/2023 00:52, Nirmoy Das wrote:
>>> gen8_engine_reset_prepare() can fail when HW fails to set
>>> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
>>> error as driver will retry.
>>>
>>> Convert the log to a trace log for debugging without triggering
>>> unnecessary concerns in CI or for end-users during non-fatal scenarios.
>> I strongly disagree with this change. The hardware spec for the 
>> RESET_CTL and GDRST registers are that they will self clear within a 
>> matter of microseconds. If something is so badly wrong with the 
>> hardware that it can't even manage to reset
>
>
> This message is for reset readiness  poll timeout not that the reset 
> is failed which doesn't sound so serious if the subsequent attempt 
> managed reset the engine.
Not sure what the distinction is. The reset procedure is poke RESET_CTL 
wait for it to clear, poke GDRST and wait for it to clear. Just because 
step one is failing rather than step 2 does not mean that the reset as a 
whole has not failed.

Note that the purpose of RESET_CTL is to pause a bunch of stuff like the 
command streamers to prevent them from issuing new memory requests while 
the reset is in progress. If it fails, it likely means that a CS is 
refusing to stop. Most probably because it can't reach a stopping point 
because it is stuck waiting on a lost memory request or similar. And the 
point of stopping further memory requests during reset is that if the 
memory channel gets out of sync (because only the GT side is reset 
during a GT reset) then that can result in total system failure. As in 
potentially even the CPU can no longer get to memory if it is an 
integrated platform. So yes, it can be quite a serious failure indeed.


>
> I couldn't get enough details when this can happen that HW takes very 
> long time to set the readiness bit.
Is it simply 'taking a long time' or is never clearing at all? If it is 
just that the timeout is too short then the proper fix would be to 
increase the timeout. But if it is taking seconds or longer or just 
never succeeding at all, then something is very bad.

>
>
>> then that is something that very much warrants more than a completely 
>> silent trace event. It most certainly should be flagged as a failure 
>> in CI.
>>
>> Just because the driver will retry does not mean that this is not a 
>> serious error. And if the first attempt failed, why would a 
>> subsequent attempt succeed?
>
> The patch is not ignoring the failure. If the subsequent attempt fails 
> then driver load will fail or it will be wedged if that happens after 
> driver load.
One thing I really hate about our driver is the total lack of 
information when something goes wrong during load. The driver wedges in 
total silence. There are many error paths that have no reporting at all. 
Which means you are left with a totally useless bug report.


>
>
>> Escalating to FLR may have more success, but that is not something 
>> that i915 currently does.
>
> Do we still need to do FLR if a subsequent engine reset failure ?
Assuming that we are talking about modern(ish) platforms, an engine 
reset failure would be hit by GuC rather than i915, but that would be 
escalated to an i915 based full GT reset. Generally speaking though, if 
the engine reset fails the GT reset isn't going to do much better. It 
would fix a dead GuC problem but it can't help with memory related 
issues. If the full GT reset fails then we are out of escalation routes 
as there is no FLR path at present (I think we have that at driver 
unload on MTL but not for general reset?). The FLR resets a lot more 
than just the GT, so it does have a chance to fix some issues that a GT 
reset can't. After driver-level FLR, there is PCI level FLR. Not sure if 
that involves a full power down and restart, but if not then that would 
be the last escalation possible. A power cycle really should fix any 
issues, if it doesn't then it's time to return the system as being 
totally dead!

My recollection is that the vast majority of engine reset failures I've 
looked at have been completely catastrophic and the system only 
recovered after a reboot. I.e. after the card was power cycled. Such 
issues were generally caused by bad memory. Once the path to memory has 
died, there really is not much of the GPU that can do anything at all 
and there isn't much that can be done to recover it.

John.


>
>
> Regards,
>
> Nirmoy
>
>>
>> John.
>>
>>
>>>
>>> v2: Improve commit message(Tvrtko)
>>>
>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>> Cc: John Harrison <John.C.Harrison@Intel.com>
>>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> index d5ed904f355d..e6fbc6202c80 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
>>> intel_engine_cs *engine)
>>>       ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>>>                          700, 0, NULL);
>>>       if (ret)
>>> -        gt_err(engine->gt,
>>> -               "%s reset request timed out: {request: %08x, 
>>> RESET_CTL: %08x}\n",
>>> -               engine->name, request,
>>> -               intel_uncore_read_fw(uncore, reg));
>>> +        GT_TRACE(engine->gt,
>>> +             "%s reset request timed out: {request: %08x, 
>>> RESET_CTL: %08x}\n",
>>> +             engine->name, request,
>>> +             intel_uncore_read_fw(uncore, reg));
>>>         return ret;
>>>   }
>>


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace
  2023-12-05 19:50     ` John Harrison
@ 2023-12-06 10:35       ` Nirmoy Das
  0 siblings, 0 replies; 6+ messages in thread
From: Nirmoy Das @ 2023-12-06 10:35 UTC (permalink / raw)
  To: John Harrison, Nirmoy Das, intel-gfx
  Cc: dri-devel, Andrzej Hajda, Tvrtko Ursulin

Hi John,

On 12/5/2023 8:50 PM, John Harrison wrote:
> On 12/5/2023 02:39, Nirmoy Das wrote:
>> Hi John,
>>
>> On 12/5/2023 10:10 AM, John Harrison wrote:
>>> On 12/5/2023 00:52, Nirmoy Das wrote:
>>>> gen8_engine_reset_prepare() can fail when HW fails to set
>>>> RESET_CTL_READY_TO_RESET bit. In some cases this is not fatal
>>>> error as driver will retry.
>>>>
>>>> Convert the log to a trace log for debugging without triggering
>>>> unnecessary concerns in CI or for end-users during non-fatal 
>>>> scenarios.
>>> I strongly disagree with this change. The hardware spec for the 
>>> RESET_CTL and GDRST registers are that they will self clear within a 
>>> matter of microseconds. If something is so badly wrong with the 
>>> hardware that it can't even manage to reset
>>
>>
>> This message is for reset readiness  poll timeout not that the reset 
>> is failed which doesn't sound so serious if the subsequent attempt 
>> managed reset the engine.
> Not sure what the distinction is. The reset procedure is poke 
> RESET_CTL wait for it to clear, poke GDRST and wait for it to clear. 
> Just because step one is failing rather than step 2 does not mean that 
> the reset as a whole has not failed.
>
> Note that the purpose of RESET_CTL is to pause a bunch of stuff like 
> the command streamers to prevent them from issuing new memory requests 
> while the reset is in progress. If it fails, it likely means that a CS 
> is refusing to stop. Most probably because it can't reach a stopping 
> point because it is stuck waiting on a lost memory request or similar. 
> And the point of stopping further memory requests during reset is that 
> if the memory channel gets out of sync (because only the GT side is 
> reset during a GT reset) then that can result in total system failure. 
> As in potentially even the CPU can no longer get to memory if it is an 
> integrated platform. So yes, it can be quite a serious failure indeed.
>

Thanks bspec didn't explain those details. My intention was to 
acknowledge that engine reset is a complicated process which why the 
driver retries  and don't spook CI/user if subsequent reset works but I 
get your objection on this.

>>
>> I couldn't get enough details when this can happen that HW takes very 
>> long time to set the readiness bit.
> Is it simply 'taking a long time' or is never clearing at all? If it 
> is just that the timeout is too short then the proper fix would be to 
> increase the timeout. But if it is taking seconds or longer or just 
> never succeeding at all, then something is very bad.

I tried with 10x timeout without any help so I think the CS is stuck 
though re-try works. I will try to get more details from HW team on this 
issue.

>
>>
>>
>>> then that is something that very much warrants more than a 
>>> completely silent trace event. It most certainly should be flagged 
>>> as a failure in CI.
>>>
>>> Just because the driver will retry does not mean that this is not a 
>>> serious error. And if the first attempt failed, why would a 
>>> subsequent attempt succeed?
>>
>> The patch is not ignoring the failure. If the subsequent attempt 
>> fails then driver load will fail or it will be wedged if that happens 
>> after driver load.
> One thing I really hate about our driver is the total lack of 
> information when something goes wrong during load. The driver wedges 
> in total silence. There are many error paths that have no reporting at 
> all. Which means you are left with a totally useless bug report.
>
>
>>
>>
>>> Escalating to FLR may have more success, but that is not something 
>>> that i915 currently does.
>>
>> Do we still need to do FLR if a subsequent engine reset failure ?
> Assuming that we are talking about modern(ish) platforms, an engine 
> reset failure would be hit by GuC rather than i915, but that would be 
> escalated to an i915 based full GT reset. Generally speaking though, 
> if the engine reset fails the GT reset isn't going to do much better. 
> It would fix a dead GuC problem but it can't help with memory related 
> issues. If the full GT reset fails then we are out of escalation 
> routes as there is no FLR path at present (I think we have that at 
> driver unload on MTL but not for general reset?). The FLR resets a lot 
> more than just the GT, so it does have a chance to fix some issues 
> that a GT reset can't. After driver-level FLR, there is PCI level FLR. 
> Not sure if that involves a full power down and restart, but if not 
> then that would be the last escalation possible. A power cycle really 
> should fix any issues, if it doesn't then it's time to return the 
> system as being totally dead!
>
> My recollection is that the vast majority of engine reset failures 
> I've looked at have been completely catastrophic and the system only 
> recovered after a reboot. I.e. after the card was power cycled. Such 
> issues were generally caused by bad memory. Once the path to memory 
> has died, there really is not much of the GPU that can do anything at 
> all and there isn't much that can be done to recover it.


Thanks,

Nirmoy

>
> John.
>
>
>>
>>
>> Regards,
>>
>> Nirmoy
>>
>>>
>>> John.
>>>
>>>
>>>>
>>>> v2: Improve commit message(Tvrtko)
>>>>
>>>> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>>>> Cc: John Harrison <John.C.Harrison@Intel.com>
>>>> Cc: Andi Shyti <andi.shyti@linux.intel.com>
>>>> Cc: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5591
>>>> Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>> Reviewed-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_reset.c | 8 ++++----
>>>>   1 file changed, 4 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> index d5ed904f355d..e6fbc6202c80 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> @@ -593,10 +593,10 @@ static int gen8_engine_reset_prepare(struct 
>>>> intel_engine_cs *engine)
>>>>       ret = __intel_wait_for_register_fw(uncore, reg, mask, ack,
>>>>                          700, 0, NULL);
>>>>       if (ret)
>>>> -        gt_err(engine->gt,
>>>> -               "%s reset request timed out: {request: %08x, 
>>>> RESET_CTL: %08x}\n",
>>>> -               engine->name, request,
>>>> -               intel_uncore_read_fw(uncore, reg));
>>>> +        GT_TRACE(engine->gt,
>>>> +             "%s reset request timed out: {request: %08x, 
>>>> RESET_CTL: %08x}\n",
>>>> +             engine->name, request,
>>>> +             intel_uncore_read_fw(uncore, reg));
>>>>         return ret;
>>>>   }
>>>
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-12-06 10:35 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-12-05  8:52 [Intel-gfx] [PATCH v2] drm/i915/gt: Convert reset prepare failure log to trace Nirmoy Das
2023-12-05  9:10 ` John Harrison
2023-12-05 10:39   ` Nirmoy Das
2023-12-05 19:50     ` John Harrison
2023-12-06 10:35       ` Nirmoy Das
2023-12-05 13:10 ` [Intel-gfx] ✗ Fi.CI.BAT: failure for " Patchwork

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).