* [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
@ 2022-09-26 21:54 Andrzej Hajda
2022-09-26 22:44 ` Andi Shyti
` (2 more replies)
0 siblings, 3 replies; 18+ messages in thread
From: Andrzej Hajda @ 2022-09-26 21:54 UTC (permalink / raw)
To: intel-gfx; +Cc: chris, Andrzej Hajda, Matthew Auld
Capturing error state is time consuming (up to 350ms on DG2), so it should
be avoided if possible. Context reset triggered by context removal is a
good example.
With this patch multiple igt tests will not timeout and should run faster.
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
---
drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 22ba66e48a9b01..cb58029208afe1 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct intel_guc *guc,
trace_intel_context_reset(ce);
if (likely(!intel_context_is_banned(ce))) {
- capture_error_state(guc, ce);
+ if (!intel_context_is_exiting(ce))
+ capture_error_state(guc, ce);
guc_context_replay(ce);
} else {
drm_info(&guc_to_gt(guc)->i915->drm,
--
2.34.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-26 21:54 [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context Andrzej Hajda
@ 2022-09-26 22:44 ` Andi Shyti
2022-09-26 23:34 ` Ceraolo Spurio, Daniele
2022-09-27 2:07 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-09-27 13:50 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2 siblings, 1 reply; 18+ messages in thread
From: Andi Shyti @ 2022-09-26 22:44 UTC (permalink / raw)
To: Andrzej Hajda; +Cc: intel-gfx, chris, Matthew Auld
Hi Andrzej,
On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
> Capturing error state is time consuming (up to 350ms on DG2), so it should
> be avoided if possible. Context reset triggered by context removal is a
> good example.
> With this patch multiple igt tests will not timeout and should run faster.
>
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
fine for me:
Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
Just to be on the safe side, can we also have the ack from any of
the GuC folks? Daniele, John?
Andi
> ---
> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 22ba66e48a9b01..cb58029208afe1 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct intel_guc *guc,
> trace_intel_context_reset(ce);
>
> if (likely(!intel_context_is_banned(ce))) {
> - capture_error_state(guc, ce);
> + if (!intel_context_is_exiting(ce))
> + capture_error_state(guc, ce);
> guc_context_replay(ce);
> } else {
> drm_info(&guc_to_gt(guc)->i915->drm,
> --
> 2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-26 22:44 ` Andi Shyti
@ 2022-09-26 23:34 ` Ceraolo Spurio, Daniele
2022-09-27 6:49 ` Andrzej Hajda
2022-09-27 10:14 ` Andrzej Hajda
0 siblings, 2 replies; 18+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-09-26 23:34 UTC (permalink / raw)
To: Andi Shyti, Andrzej Hajda, Tvrtko Ursulin; +Cc: intel-gfx, Matthew Auld, chris
On 9/26/2022 3:44 PM, Andi Shyti wrote:
> Hi Andrzej,
>
> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>> Capturing error state is time consuming (up to 350ms on DG2), so it should
>> be avoided if possible. Context reset triggered by context removal is a
>> good example.
>> With this patch multiple igt tests will not timeout and should run faster.
>>
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
> fine for me:
>
> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>
> Just to be on the safe side, can we also have the ack from any of
> the GuC folks? Daniele, John?
>
> Andi
>
>
>> ---
>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> index 22ba66e48a9b01..cb58029208afe1 100644
>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct intel_guc *guc,
>> trace_intel_context_reset(ce);
>>
>> if (likely(!intel_context_is_banned(ce))) {
>> - capture_error_state(guc, ce);
>> + if (!intel_context_is_exiting(ce))
>> + capture_error_state(guc, ce);
>> guc_context_replay(ce);
You definitely don't want to replay requests of a context that is going
away.
This seems at least in part due to
https://patchwork.freedesktop.org/patch/487531/, where we replaced the
"context_ban" with "context_exiting". There are several places where we
skipped operations if the context was banned (here included) which are
now not covered anymore for exiting contexts. Maybe we need a new
checker function to check both flags in places where we don't care why
the context is being removed (ban vs exiting), just that it is?
Daniele
>> } else {
>> drm_info(&guc_to_gt(guc)->i915->drm,
>> --
>> 2.34.1
^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915/guc: do not capture error state on exiting context
2022-09-26 21:54 [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context Andrzej Hajda
2022-09-26 22:44 ` Andi Shyti
@ 2022-09-27 2:07 ` Patchwork
2022-09-27 13:50 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2 siblings, 0 replies; 18+ messages in thread
From: Patchwork @ 2022-09-27 2:07 UTC (permalink / raw)
To: Andrzej Hajda; +Cc: intel-gfx
[-- Attachment #1: Type: text/plain, Size: 9866 bytes --]
== Series Details ==
Series: drm/i915/guc: do not capture error state on exiting context
URL : https://patchwork.freedesktop.org/series/109087/
State : success
== Summary ==
CI Bug Log - changes from CI_DRM_12185 -> Patchwork_109087v1
====================================================
Summary
-------
**SUCCESS**
No regressions found.
External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/index.html
Participating hosts (46 -> 43)
------------------------------
Additional (1): fi-rkl-11600
Missing (4): fi-icl-u2 fi-tgl-mst fi-bdw-samus fi-pnv-d510
Known issues
------------
Here are the changes found in Patchwork_109087v1 that come from known issues:
### IGT changes ###
#### Issues hit ####
* igt@gem_huc_copy@huc-copy:
- fi-rkl-11600: NOTRUN -> [SKIP][1] ([i915#2190])
[1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@gem_huc_copy@huc-copy.html
* igt@gem_lmem_swapping@basic:
- fi-rkl-11600: NOTRUN -> [SKIP][2] ([i915#4613]) +3 similar issues
[2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@gem_lmem_swapping@basic.html
* igt@gem_tiled_pread_basic:
- fi-rkl-11600: NOTRUN -> [SKIP][3] ([i915#3282])
[3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@gem_tiled_pread_basic.html
* igt@i915_pm_backlight@basic-brightness:
- fi-rkl-11600: NOTRUN -> [SKIP][4] ([i915#3012])
[4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@i915_pm_backlight@basic-brightness.html
* igt@i915_selftest@live@gt_heartbeat:
- fi-glk-j4005: [PASS][5] -> [DMESG-FAIL][6] ([i915#5334])
[5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html
[6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html
* igt@i915_selftest@live@hangcheck:
- fi-snb-2600: [PASS][7] -> [INCOMPLETE][8] ([i915#3921])
[7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
[8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-snb-2600/igt@i915_selftest@live@hangcheck.html
* igt@i915_suspend@basic-s3-without-i915:
- fi-rkl-11600: NOTRUN -> [INCOMPLETE][9] ([i915#5982])
[9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@i915_suspend@basic-s3-without-i915.html
* igt@kms_chamelium@hdmi-edid-read:
- fi-rkl-11600: NOTRUN -> [SKIP][10] ([fdo#111827]) +7 similar issues
[10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@kms_chamelium@hdmi-edid-read.html
* igt@kms_cursor_legacy@basic-busy-flip-before-cursor:
- fi-rkl-11600: NOTRUN -> [SKIP][11] ([i915#4103])
[11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@kms_cursor_legacy@basic-busy-flip-before-cursor.html
* igt@kms_force_connector_basic@force-load-detect:
- fi-rkl-11600: NOTRUN -> [SKIP][12] ([fdo#109285] / [i915#4098])
[12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@kms_force_connector_basic@force-load-detect.html
* igt@kms_psr@primary_page_flip:
- fi-rkl-11600: NOTRUN -> [SKIP][13] ([i915#1072]) +3 similar issues
[13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@kms_psr@primary_page_flip.html
* igt@kms_setmode@basic-clone-single-crtc:
- fi-rkl-11600: NOTRUN -> [SKIP][14] ([i915#3555] / [i915#4098])
[14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@kms_setmode@basic-clone-single-crtc.html
* igt@prime_vgem@basic-read:
- fi-rkl-11600: NOTRUN -> [SKIP][15] ([fdo#109295] / [i915#3291] / [i915#3708]) +2 similar issues
[15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@prime_vgem@basic-read.html
* igt@prime_vgem@basic-userptr:
- fi-rkl-11600: NOTRUN -> [SKIP][16] ([fdo#109295] / [i915#3301] / [i915#3708])
[16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-rkl-11600/igt@prime_vgem@basic-userptr.html
#### Possible fixes ####
* igt@gem_ringfill@basic-all:
- {bat-dg2-9}: [FAIL][17] ([i915#5886]) -> [PASS][18]
[17]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/bat-dg2-9/igt@gem_ringfill@basic-all.html
[18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/bat-dg2-9/igt@gem_ringfill@basic-all.html
* igt@i915_selftest@live@late_gt_pm:
- fi-cfl-8109u: [DMESG-WARN][19] ([i915#5904]) -> [PASS][20] +30 similar issues
[19]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-cfl-8109u/igt@i915_selftest@live@late_gt_pm.html
[20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-cfl-8109u/igt@i915_selftest@live@late_gt_pm.html
* igt@i915_selftest@live@requests:
- {bat-rpls-1}: [INCOMPLETE][21] ([i915#4983] / [i915#6257] / [i915#6380]) -> [PASS][22]
[21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/bat-rpls-1/igt@i915_selftest@live@requests.html
[22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/bat-rpls-1/igt@i915_selftest@live@requests.html
* igt@i915_selftest@live@slpc:
- {bat-rplp-1}: [DMESG-FAIL][23] -> [PASS][24]
[23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/bat-rplp-1/igt@i915_selftest@live@slpc.html
[24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/bat-rplp-1/igt@i915_selftest@live@slpc.html
* igt@i915_suspend@basic-s2idle-without-i915:
- fi-cfl-8109u: [DMESG-WARN][25] ([i915#5904] / [i915#62]) -> [PASS][26]
[25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-cfl-8109u/igt@i915_suspend@basic-s2idle-without-i915.html
[26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-cfl-8109u/igt@i915_suspend@basic-s2idle-without-i915.html
* igt@kms_frontbuffer_tracking@basic:
- fi-cfl-8109u: [DMESG-FAIL][27] ([i915#62]) -> [PASS][28] +1 similar issue
[27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html
[28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-cfl-8109u/igt@kms_frontbuffer_tracking@basic.html
* igt@kms_pipe_crc_basic@nonblocking-crc-frame-sequence@pipe-c-dp-1:
- fi-cfl-8109u: [DMESG-WARN][29] ([i915#62]) -> [PASS][30] +10 similar issues
[29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/fi-cfl-8109u/igt@kms_pipe_crc_basic@nonblocking-crc-frame-sequence@pipe-c-dp-1.html
[30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/fi-cfl-8109u/igt@kms_pipe_crc_basic@nonblocking-crc-frame-sequence@pipe-c-dp-1.html
* igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-2:
- {bat-dg2-11}: [FAIL][31] ([i915#6818]) -> [PASS][32]
[31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/bat-dg2-11/igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-2.html
[32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/bat-dg2-11/igt@kms_pipe_crc_basic@suspend-read-crc@pipe-d-dp-2.html
{name}: This element is suppressed. This means it is ignored when computing
the status of the difference (SUCCESS, WARNING, or FAILURE).
[fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
[fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
[fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
[i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
[i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
[i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
[i915#3012]: https://gitlab.freedesktop.org/drm/intel/issues/3012
[i915#3282]: https://gitlab.freedesktop.org/drm/intel/issues/3282
[i915#3291]: https://gitlab.freedesktop.org/drm/intel/issues/3291
[i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
[i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
[i915#3708]: https://gitlab.freedesktop.org/drm/intel/issues/3708
[i915#3921]: https://gitlab.freedesktop.org/drm/intel/issues/3921
[i915#4098]: https://gitlab.freedesktop.org/drm/intel/issues/4098
[i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
[i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
[i915#4983]: https://gitlab.freedesktop.org/drm/intel/issues/4983
[i915#5334]: https://gitlab.freedesktop.org/drm/intel/issues/5334
[i915#5886]: https://gitlab.freedesktop.org/drm/intel/issues/5886
[i915#5904]: https://gitlab.freedesktop.org/drm/intel/issues/5904
[i915#5982]: https://gitlab.freedesktop.org/drm/intel/issues/5982
[i915#62]: https://gitlab.freedesktop.org/drm/intel/issues/62
[i915#6257]: https://gitlab.freedesktop.org/drm/intel/issues/6257
[i915#6367]: https://gitlab.freedesktop.org/drm/intel/issues/6367
[i915#6380]: https://gitlab.freedesktop.org/drm/intel/issues/6380
[i915#6816]: https://gitlab.freedesktop.org/drm/intel/issues/6816
[i915#6818]: https://gitlab.freedesktop.org/drm/intel/issues/6818
Build changes
-------------
* Linux: CI_DRM_12185 -> Patchwork_109087v1
CI-20190529: 20190529
CI_DRM_12185: ae6a4bb62f9524823ef5b00552e27231f7936da3 @ git://anongit.freedesktop.org/gfx-ci/linux
IGT_6663: 5e232c77cd762147e0882c337a984121fabb1c75 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
Patchwork_109087v1: ae6a4bb62f9524823ef5b00552e27231f7936da3 @ git://anongit.freedesktop.org/gfx-ci/linux
### Linux commits
29d9905bbe15 drm/i915/guc: do not capture error state on exiting context
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/index.html
[-- Attachment #2: Type: text/html, Size: 11217 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-26 23:34 ` Ceraolo Spurio, Daniele
@ 2022-09-27 6:49 ` Andrzej Hajda
2022-09-27 7:45 ` Tvrtko Ursulin
2022-09-27 10:14 ` Andrzej Hajda
1 sibling, 1 reply; 18+ messages in thread
From: Andrzej Hajda @ 2022-09-27 6:49 UTC (permalink / raw)
To: Ceraolo Spurio, Daniele, Andi Shyti, Tvrtko Ursulin
Cc: intel-gfx, Matthew Auld, chris
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>
>
> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>> Hi Andrzej,
>>
>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>> Capturing error state is time consuming (up to 350ms on DG2), so it
>>> should
>>> be avoided if possible. Context reset triggered by context removal is a
>>> good example.
>>> With this patch multiple igt tests will not timeout and should run
>>> faster.
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>> fine for me:
>>
>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>
>> Just to be on the safe side, can we also have the ack from any of
>> the GuC folks? Daniele, John?
>>
>> Andi
>>
>>
>>> ---
>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>> intel_guc *guc,
>>> trace_intel_context_reset(ce);
>>> if (likely(!intel_context_is_banned(ce))) {
>>> - capture_error_state(guc, ce);
>>> + if (!intel_context_is_exiting(ce))
>>> + capture_error_state(guc, ce);
>>> guc_context_replay(ce);
>
> You definitely don't want to replay requests of a context that is
> going away.
My intention was to just avoid error capture, but that's even better,
only condition change:
- if (likely(!intel_context_is_banned(ce))) {
+ if (likely(intel_context_is_schedulable(ce))) {
>
> This seems at least in part due to
> https://patchwork.freedesktop.org/patch/487531/, where we replaced the
> "context_ban" with "context_exiting". There are several places where
> we skipped operations if the context was banned (here included) which
> are now not covered anymore for exiting contexts. Maybe we need a new
> checker function to check both flags in places where we don't care why
> the context is being removed (ban vs exiting), just that it is?
>
> Daniele
>
>>> } else {
>>> drm_info(&guc_to_gt(guc)->i915->drm,
And maybe degrade above to drm_dbg, to avoid spamming dmesg?
Regards
Andrzej
>>> --
>>> 2.34.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-27 6:49 ` Andrzej Hajda
@ 2022-09-27 7:45 ` Tvrtko Ursulin
2022-09-27 8:16 ` Andrzej Hajda
2022-09-27 21:36 ` Ceraolo Spurio, Daniele
0 siblings, 2 replies; 18+ messages in thread
From: Tvrtko Ursulin @ 2022-09-27 7:45 UTC (permalink / raw)
To: Andrzej Hajda, Ceraolo Spurio, Daniele, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 27/09/2022 07:49, Andrzej Hajda wrote:
>
>
> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>
>>
>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>> Hi Andrzej,
>>>
>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>> Capturing error state is time consuming (up to 350ms on DG2), so it
>>>> should
>>>> be avoided if possible. Context reset triggered by context removal is a
>>>> good example.
>>>> With this patch multiple igt tests will not timeout and should run
>>>> faster.
>>>>
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>> fine for me:
>>>
>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>
>>> Just to be on the safe side, can we also have the ack from any of
>>> the GuC folks? Daniele, John?
>>>
>>> Andi
>>>
>>>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>>> intel_guc *guc,
>>>> trace_intel_context_reset(ce);
>>>> if (likely(!intel_context_is_banned(ce))) {
>>>> - capture_error_state(guc, ce);
>>>> + if (!intel_context_is_exiting(ce))
>>>> + capture_error_state(guc, ce);
I am not sure here - if we have a persistent context which caused a GPU
hang I'd expect we'd still want error capture.
What causes the reset in the affected IGTs? Always preemption timeout?
>>>> guc_context_replay(ce);
>>
>> You definitely don't want to replay requests of a context that is
>> going away.
>
> My intention was to just avoid error capture, but that's even better,
> only condition change:
> - if (likely(!intel_context_is_banned(ce))) {
> + if (likely(intel_context_is_schedulable(ce))) {
Yes that helper was intended to be used for contexts which should not be
scheduled post exit or ban.
Daniele - you say there are some misses in the GuC backend. Should most,
or even all in intel_guc_submission.c be converted to use
intel_context_is_schedulable? My idea indeed was that "ban" should be a
level up from the backends. Backend should only distinguish between
"should I run this or not", and not the reason.
Regards,
Tvrtko
>
>>
>> This seems at least in part due to
>> https://patchwork.freedesktop.org/patch/487531/, where we replaced the
>> "context_ban" with "context_exiting". There are several places where
>> we skipped operations if the context was banned (here included) which
>> are now not covered anymore for exiting contexts. Maybe we need a new
>> checker function to check both flags in places where we don't care why
>> the context is being removed (ban vs exiting), just that it is?
>>
>> Daniele
>>
>>>> } else {
>>>> drm_info(&guc_to_gt(guc)->i915->drm,
>
> And maybe degrade above to drm_dbg, to avoid spamming dmesg?
>
> Regards
> Andrzej
>
>
>>>> --
>>>> 2.34.1
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-27 7:45 ` Tvrtko Ursulin
@ 2022-09-27 8:16 ` Andrzej Hajda
2022-09-27 21:36 ` Ceraolo Spurio, Daniele
1 sibling, 0 replies; 18+ messages in thread
From: Andrzej Hajda @ 2022-09-27 8:16 UTC (permalink / raw)
To: Tvrtko Ursulin, Ceraolo Spurio, Daniele, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 27.09.2022 09:45, Tvrtko Ursulin wrote:
>
> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>
>>
>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>
>>>
>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>> Hi Andrzej,
>>>>
>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>> it should
>>>>> be avoided if possible. Context reset triggered by context removal
>>>>> is a
>>>>> good example.
>>>>> With this patch multiple igt tests will not timeout and should run
>>>>> faster.
>>>>>
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> fine for me:
>>>>
>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>
>>>> Just to be on the safe side, can we also have the ack from any of
>>>> the GuC folks? Daniele, John?
>>>>
>>>> Andi
>>>>
>>>>
>>>>> ---
>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>>>> intel_guc *guc,
>>>>> trace_intel_context_reset(ce);
>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>> - capture_error_state(guc, ce);
>>>>> + if (!intel_context_is_exiting(ce))
>>>>> + capture_error_state(guc, ce);
>
> I am not sure here - if we have a persistent context which caused a
> GPU hang I'd expect we'd still want error capture.
>
> What causes the reset in the affected IGTs? Always preemption timeout?
Affected tests performs always context destroy with bb having
IGT_SPIN_NO_PREEMPTION, and "preempt_timeout_ms" set to 50.
So I guess yes.
Regards
Andrzej
>
>>>>> guc_context_replay(ce);
>>>
>>> You definitely don't want to replay requests of a context that is
>>> going away.
>>
>> My intention was to just avoid error capture, but that's even better,
>> only condition change:
>> - if (likely(!intel_context_is_banned(ce))) {
>> + if (likely(intel_context_is_schedulable(ce))) {
>
> Yes that helper was intended to be used for contexts which should not
> be scheduled post exit or ban.
>
> Daniele - you say there are some misses in the GuC backend. Should
> most, or even all in intel_guc_submission.c be converted to use
> intel_context_is_schedulable? My idea indeed was that "ban" should be
> a level up from the backends. Backend should only distinguish between
> "should I run this or not", and not the reason.
>
> Regards,
>
> Tvrtko
>
>>
>>>
>>> This seems at least in part due to
>>> https://patchwork.freedesktop.org/patch/487531/, where we replaced
>>> the "context_ban" with "context_exiting". There are several places
>>> where we skipped operations if the context was banned (here
>>> included) which are now not covered anymore for exiting contexts.
>>> Maybe we need a new checker function to check both flags in places
>>> where we don't care why the context is being removed (ban vs
>>> exiting), just that it is?
>>>
>>> Daniele
>>>
>>>>> } else {
>>>>> drm_info(&guc_to_gt(guc)->i915->drm,
>>
>> And maybe degrade above to drm_dbg, to avoid spamming dmesg?
>>
>> Regards
>> Andrzej
>>
>>
>>>>> --
>>>>> 2.34.1
>>>
>>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-26 23:34 ` Ceraolo Spurio, Daniele
2022-09-27 6:49 ` Andrzej Hajda
@ 2022-09-27 10:14 ` Andrzej Hajda
2022-09-27 21:33 ` Ceraolo Spurio, Daniele
1 sibling, 1 reply; 18+ messages in thread
From: Andrzej Hajda @ 2022-09-27 10:14 UTC (permalink / raw)
To: Ceraolo Spurio, Daniele, Andi Shyti, Tvrtko Ursulin
Cc: intel-gfx, Matthew Auld, chris
On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>
>
> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>> Hi Andrzej,
>>
>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>> Capturing error state is time consuming (up to 350ms on DG2), so it
>>> should
>>> be avoided if possible. Context reset triggered by context removal is a
>>> good example.
>>> With this patch multiple igt tests will not timeout and should run
>>> faster.
>>>
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>> fine for me:
>>
>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>
>> Just to be on the safe side, can we also have the ack from any of
>> the GuC folks? Daniele, John?
>>
>> Andi
>>
>>
>>> ---
>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>> intel_guc *guc,
>>> trace_intel_context_reset(ce);
>>> if (likely(!intel_context_is_banned(ce))) {
>>> - capture_error_state(guc, ce);
>>> + if (!intel_context_is_exiting(ce))
>>> + capture_error_state(guc, ce);
>>> guc_context_replay(ce);
>
> You definitely don't want to replay requests of a context that is going
> away.
Without guc_context_replay I see timeouts. Probably because
guc_context_replay calls __guc_reset_context. I am not sure if there is
need to dig deeper, stay with my initial proposition, or sth like:
if (likely(!intel_context_is_banned(ce))) {
if (!intel_context_is_exiting(ce)) {
capture_error_state(guc, ce);
guc_context_replay(ce);
} else {
__guc_reset_context(ce, ce->engine->mask);
}
} else {
The latter is also working.
Regards
Andrzej
>
> This seems at least in part due to
> https://patchwork.freedesktop.org/patch/487531/, where we replaced the
> "context_ban" with "context_exiting". There are several places where we
> skipped operations if the context was banned (here included) which are
> now not covered anymore for exiting contexts. Maybe we need a new
> checker function to check both flags in places where we don't care why
> the context is being removed (ban vs exiting), just that it is?
>
> Daniele
>
>>> } else {
>>> drm_info(&guc_to_gt(guc)->i915->drm,
>>> --
>>> 2.34.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/guc: do not capture error state on exiting context
2022-09-26 21:54 [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context Andrzej Hajda
2022-09-26 22:44 ` Andi Shyti
2022-09-27 2:07 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
@ 2022-09-27 13:50 ` Patchwork
2 siblings, 0 replies; 18+ messages in thread
From: Patchwork @ 2022-09-27 13:50 UTC (permalink / raw)
To: Andrzej Hajda; +Cc: intel-gfx
[-- Attachment #1: Type: text/plain, Size: 26035 bytes --]
== Series Details ==
Series: drm/i915/guc: do not capture error state on exiting context
URL : https://patchwork.freedesktop.org/series/109087/
State : failure
== Summary ==
CI Bug Log - changes from CI_DRM_12185_full -> Patchwork_109087v1_full
====================================================
Summary
-------
**FAILURE**
Serious unknown changes coming with Patchwork_109087v1_full absolutely need to be
verified manually.
If you think the reported changes have nothing to do with the changes
introduced in Patchwork_109087v1_full, please notify your bug team to allow them
to document this new failure mode, which will reduce false positives in CI.
Participating hosts (12 -> 11)
------------------------------
Missing (1): shard-dg1
Possible new issues
-------------------
Here are the unknown changes that may have been introduced in Patchwork_109087v1_full:
### IGT changes ###
#### Possible regressions ####
* igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-modifiers@pipe-d-edp-1:
- shard-tglb: [PASS][1] -> [INCOMPLETE][2]
[1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-tglb3/igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-modifiers@pipe-d-edp-1.html
[2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-tglb3/igt@kms_plane_scaling@plane-scaler-with-clipping-clamping-modifiers@pipe-d-edp-1.html
Known issues
------------
Here are the changes found in Patchwork_109087v1_full that come from known issues:
### IGT changes ###
#### Issues hit ####
* igt@feature_discovery@display-3x:
- shard-iclb: NOTRUN -> [SKIP][3] ([i915#1839])
[3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@feature_discovery@display-3x.html
* igt@gem_eio@in-flight-contexts-10ms:
- shard-snb: [PASS][4] -> [FAIL][5] ([i915#4409])
[4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-snb2/igt@gem_eio@in-flight-contexts-10ms.html
[5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-snb7/igt@gem_eio@in-flight-contexts-10ms.html
* igt@gem_exec_balancer@parallel-bb-first:
- shard-iclb: [PASS][6] -> [SKIP][7] ([i915#4525]) +1 similar issue
[6]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb1/igt@gem_exec_balancer@parallel-bb-first.html
[7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb3/igt@gem_exec_balancer@parallel-bb-first.html
* igt@gem_exec_capture@capture-recoverable:
- shard-iclb: NOTRUN -> [SKIP][8] ([i915#6344])
[8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_exec_capture@capture-recoverable.html
* igt@gem_exec_fair@basic-none@vcs1:
- shard-iclb: NOTRUN -> [FAIL][9] ([i915#2842])
[9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb1/igt@gem_exec_fair@basic-none@vcs1.html
* igt@gem_exec_whisper@basic-fds-priority-all:
- shard-glk: [PASS][10] -> [DMESG-WARN][11] ([i915#118])
[10]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-glk3/igt@gem_exec_whisper@basic-fds-priority-all.html
[11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-glk2/igt@gem_exec_whisper@basic-fds-priority-all.html
* igt@gem_huc_copy@huc-copy:
- shard-iclb: NOTRUN -> [SKIP][12] ([i915#2190])
[12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_huc_copy@huc-copy.html
* igt@gem_lmem_swapping@verify-ccs:
- shard-apl: NOTRUN -> [SKIP][13] ([fdo#109271] / [i915#4613])
[13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@gem_lmem_swapping@verify-ccs.html
* igt@gem_pxp@create-regular-context-2:
- shard-apl: NOTRUN -> [SKIP][14] ([fdo#109271]) +11 similar issues
[14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@gem_pxp@create-regular-context-2.html
* igt@gem_pxp@reject-modify-context-protection-on:
- shard-iclb: NOTRUN -> [SKIP][15] ([i915#4270])
[15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_pxp@reject-modify-context-protection-on.html
* igt@gem_render_copy@yf-tiled-to-vebox-linear:
- shard-iclb: NOTRUN -> [SKIP][16] ([i915#768])
[16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_render_copy@yf-tiled-to-vebox-linear.html
* igt@gen3_render_tiledy_blits:
- shard-iclb: NOTRUN -> [SKIP][17] ([fdo#109289])
[17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gen3_render_tiledy_blits.html
* igt@gen9_exec_parse@bb-start-out:
- shard-iclb: NOTRUN -> [SKIP][18] ([i915#2856])
[18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gen9_exec_parse@bb-start-out.html
* igt@i915_pm_dc@dc3co-vpb-simulation:
- shard-apl: NOTRUN -> [SKIP][19] ([fdo#109271] / [i915#658])
[19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@i915_pm_dc@dc3co-vpb-simulation.html
* igt@i915_pm_rps@engine-order:
- shard-apl: [PASS][20] -> [FAIL][21] ([i915#6537])
[20]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl6/igt@i915_pm_rps@engine-order.html
[21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@i915_pm_rps@engine-order.html
* igt@kms_big_fb@4-tiled-max-hw-stride-32bpp-rotate-0-async-flip:
- shard-iclb: NOTRUN -> [SKIP][22] ([i915#5286])
[22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_big_fb@4-tiled-max-hw-stride-32bpp-rotate-0-async-flip.html
* igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-0:
- shard-iclb: NOTRUN -> [SKIP][23] ([fdo#110723])
[23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_big_fb@yf-tiled-max-hw-stride-64bpp-rotate-0.html
* igt@kms_ccs@pipe-b-crc-primary-basic-y_tiled_gen12_mc_ccs:
- shard-iclb: NOTRUN -> [SKIP][24] ([fdo#109278] / [i915#3886]) +4 similar issues
[24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_ccs@pipe-b-crc-primary-basic-y_tiled_gen12_mc_ccs.html
* igt@kms_ccs@pipe-d-crc-primary-basic-4_tiled_dg2_rc_ccs:
- shard-iclb: NOTRUN -> [SKIP][25] ([fdo#109278]) +5 similar issues
[25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_ccs@pipe-d-crc-primary-basic-4_tiled_dg2_rc_ccs.html
* igt@kms_color_chamelium@ctm-negative:
- shard-iclb: NOTRUN -> [SKIP][26] ([fdo#109284] / [fdo#111827]) +2 similar issues
[26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_color_chamelium@ctm-negative.html
* igt@kms_cursor_crc@cursor-offscreen-512x170:
- shard-iclb: NOTRUN -> [SKIP][27] ([fdo#109279] / [i915#3359])
[27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_cursor_crc@cursor-offscreen-512x170.html
* igt@kms_cursor_crc@cursor-sliding-512x170:
- shard-iclb: NOTRUN -> [SKIP][28] ([i915#3359])
[28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_cursor_crc@cursor-sliding-512x170.html
* igt@kms_cursor_crc@cursor-suspend@pipe-c-dp-1:
- shard-apl: [PASS][29] -> [DMESG-WARN][30] ([i915#180]) +4 similar issues
[29]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl2/igt@kms_cursor_crc@cursor-suspend@pipe-c-dp-1.html
[30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl3/igt@kms_cursor_crc@cursor-suspend@pipe-c-dp-1.html
* igt@kms_cursor_legacy@2x-long-flip-vs-cursor-legacy:
- shard-glk: [PASS][31] -> [FAIL][32] ([i915#72])
[31]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-glk1/igt@kms_cursor_legacy@2x-long-flip-vs-cursor-legacy.html
[32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-glk1/igt@kms_cursor_legacy@2x-long-flip-vs-cursor-legacy.html
* igt@kms_cursor_legacy@flip-vs-cursor@atomic-transitions-varying-size:
- shard-glk: [PASS][33] -> [FAIL][34] ([i915#2346])
[33]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-glk9/igt@kms_cursor_legacy@flip-vs-cursor@atomic-transitions-varying-size.html
[34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-glk3/igt@kms_cursor_legacy@flip-vs-cursor@atomic-transitions-varying-size.html
* igt@kms_fbcon_fbt@fbc-suspend:
- shard-apl: [PASS][35] -> [INCOMPLETE][36] ([i915#180] / [i915#1982] / [i915#4939])
[35]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl7/igt@kms_fbcon_fbt@fbc-suspend.html
[36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl2/igt@kms_fbcon_fbt@fbc-suspend.html
* igt@kms_flip@2x-busy-flip:
- shard-iclb: NOTRUN -> [SKIP][37] ([fdo#109274])
[37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_flip@2x-busy-flip.html
* igt@kms_flip@dpms-off-confusion@a-dp1:
- shard-apl: [PASS][38] -> [DMESG-WARN][39] ([i915#1982] / [i915#62])
[38]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl1/igt@kms_flip@dpms-off-confusion@a-dp1.html
[39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@kms_flip@dpms-off-confusion@a-dp1.html
* igt@kms_flip@dpms-off-confusion@c-dp1:
- shard-apl: [PASS][40] -> [DMESG-WARN][41] ([i915#62]) +25 similar issues
[40]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl1/igt@kms_flip@dpms-off-confusion@c-dp1.html
[41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@kms_flip@dpms-off-confusion@c-dp1.html
* igt@kms_flip_scaled_crc@flip-32bpp-4tile-to-32bpp-4tiledg2rcccs-downscaling@pipe-a-valid-mode:
- shard-iclb: NOTRUN -> [SKIP][42] ([i915#2587] / [i915#2672]) +7 similar issues
[42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb5/igt@kms_flip_scaled_crc@flip-32bpp-4tile-to-32bpp-4tiledg2rcccs-downscaling@pipe-a-valid-mode.html
* igt@kms_flip_scaled_crc@flip-32bpp-linear-to-64bpp-linear-downscaling@pipe-a-default-mode:
- shard-iclb: NOTRUN -> [SKIP][43] ([i915#3555])
[43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-linear-to-64bpp-linear-downscaling@pipe-a-default-mode.html
* igt@kms_flip_scaled_crc@flip-32bpp-yftile-to-32bpp-yftileccs-downscaling@pipe-a-default-mode:
- shard-iclb: NOTRUN -> [SKIP][44] ([i915#6375])
[44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-yftile-to-32bpp-yftileccs-downscaling@pipe-a-default-mode.html
* igt@kms_flip_scaled_crc@flip-64bpp-4tile-to-16bpp-4tile-upscaling@pipe-a-default-mode:
- shard-iclb: NOTRUN -> [SKIP][45] ([i915#2672]) +5 similar issues
[45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb2/igt@kms_flip_scaled_crc@flip-64bpp-4tile-to-16bpp-4tile-upscaling@pipe-a-default-mode.html
* igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-blt:
- shard-iclb: NOTRUN -> [SKIP][46] ([fdo#109280]) +8 similar issues
[46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_frontbuffer_tracking@fbc-2p-scndscrn-cur-indfb-draw-blt.html
* igt@kms_plane_scaling@plane-downscale-with-rotation-factor-0-25@pipe-c-edp-1:
- shard-iclb: NOTRUN -> [SKIP][47] ([i915#5176]) +2 similar issues
[47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_plane_scaling@plane-downscale-with-rotation-factor-0-25@pipe-c-edp-1.html
* igt@kms_plane_scaling@planes-upscale-factor-0-25-downscale-factor-0-25@pipe-b-edp-1:
- shard-iclb: NOTRUN -> [SKIP][48] ([i915#5235]) +2 similar issues
[48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@kms_plane_scaling@planes-upscale-factor-0-25-downscale-factor-0-25@pipe-b-edp-1.html
* igt@kms_psr2_su@page_flip-xrgb8888:
- shard-iclb: [PASS][49] -> [SKIP][50] ([fdo#109642] / [fdo#111068] / [i915#658])
[49]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_psr2_su@page_flip-xrgb8888.html
[50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb5/igt@kms_psr2_su@page_flip-xrgb8888.html
* igt@kms_psr@psr2_cursor_mmap_cpu:
- shard-iclb: [PASS][51] -> [SKIP][52] ([fdo#109441])
[51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_psr@psr2_cursor_mmap_cpu.html
[52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb6/igt@kms_psr@psr2_cursor_mmap_cpu.html
* igt@kms_psr_stress_test@flip-primary-invalidate-overlay:
- shard-tglb: [PASS][53] -> [SKIP][54] ([i915#5519])
[53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-tglb3/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html
[54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-tglb1/igt@kms_psr_stress_test@flip-primary-invalidate-overlay.html
* igt@perf@stress-open-close:
- shard-glk: [PASS][55] -> [INCOMPLETE][56] ([i915#5213])
[55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-glk7/igt@perf@stress-open-close.html
[56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-glk7/igt@perf@stress-open-close.html
* igt@perf_pmu@event-wait@rcs0:
- shard-iclb: NOTRUN -> [SKIP][57] ([fdo#112283])
[57]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@perf_pmu@event-wait@rcs0.html
#### Possible fixes ####
* igt@gem_ctx_exec@basic-nohangcheck:
- shard-tglb: [FAIL][58] ([i915#6268]) -> [PASS][59]
[58]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-tglb7/igt@gem_ctx_exec@basic-nohangcheck.html
[59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-tglb7/igt@gem_ctx_exec@basic-nohangcheck.html
* igt@gem_exec_balancer@parallel-keep-in-fence:
- shard-iclb: [SKIP][60] ([i915#4525]) -> [PASS][61] +2 similar issues
[60]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb6/igt@gem_exec_balancer@parallel-keep-in-fence.html
[61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_exec_balancer@parallel-keep-in-fence.html
* igt@gem_exec_fair@basic-flow@rcs0:
- shard-tglb: [FAIL][62] ([i915#2842]) -> [PASS][63]
[62]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-tglb7/igt@gem_exec_fair@basic-flow@rcs0.html
[63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-tglb3/igt@gem_exec_fair@basic-flow@rcs0.html
* igt@gem_exec_suspend@basic-s3@smem:
- shard-apl: [DMESG-WARN][64] ([i915#180]) -> [PASS][65]
[64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl1/igt@gem_exec_suspend@basic-s3@smem.html
[65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@gem_exec_suspend@basic-s3@smem.html
* igt@gem_exec_whisper@basic-contexts-forked:
- shard-iclb: [INCOMPLETE][66] ([i915#6453]) -> [PASS][67]
[66]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb7/igt@gem_exec_whisper@basic-contexts-forked.html
[67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb4/igt@gem_exec_whisper@basic-contexts-forked.html
* igt@gem_huc_copy@huc-copy:
- shard-tglb: [SKIP][68] ([i915#2190]) -> [PASS][69]
[68]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-tglb7/igt@gem_huc_copy@huc-copy.html
[69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-tglb1/igt@gem_huc_copy@huc-copy.html
* igt@i915_pm_dc@dc9-dpms:
- shard-iclb: [SKIP][70] ([i915#4281]) -> [PASS][71]
[70]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb3/igt@i915_pm_dc@dc9-dpms.html
[71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb6/igt@i915_pm_dc@dc9-dpms.html
* igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1:
- shard-iclb: [SKIP][72] ([i915#5176]) -> [PASS][73] +2 similar issues
[72]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1.html
[73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb5/igt@kms_plane_scaling@plane-downscale-with-pixel-format-factor-0-5@pipe-b-edp-1.html
* igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1:
- shard-iclb: [SKIP][74] ([i915#5235]) -> [PASS][75] +2 similar issues
[74]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1.html
[75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb6/igt@kms_plane_scaling@planes-downscale-factor-0-5@pipe-a-edp-1.html
* igt@kms_psr@psr2_sprite_mmap_gtt:
- shard-iclb: [SKIP][76] ([fdo#109441]) -> [PASS][77] +2 similar issues
[76]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb7/igt@kms_psr@psr2_sprite_mmap_gtt.html
[77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb2/igt@kms_psr@psr2_sprite_mmap_gtt.html
#### Warnings ####
* igt@gem_exec_balancer@parallel-ordering:
- shard-iclb: [FAIL][78] ([i915#6117]) -> [SKIP][79] ([i915#4525])
[78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb4/igt@gem_exec_balancer@parallel-ordering.html
[79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb3/igt@gem_exec_balancer@parallel-ordering.html
* igt@i915_pm_dc@dc3co-vpb-simulation:
- shard-iclb: [SKIP][80] ([i915#588]) -> [SKIP][81] ([i915#658])
[80]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@i915_pm_dc@dc3co-vpb-simulation.html
[81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb6/igt@i915_pm_dc@dc3co-vpb-simulation.html
* igt@kms_content_protection@legacy:
- shard-apl: [TIMEOUT][82] ([i915#1319]) -> [FAIL][83] ([fdo#110321] / [fdo#110336])
[82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl8/igt@kms_content_protection@legacy.html
[83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl8/igt@kms_content_protection@legacy.html
* igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb:
- shard-apl: [FAIL][84] ([fdo#108145] / [i915#265]) -> [DMESG-FAIL][85] ([fdo#108145] / [i915#62])
[84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl1/igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb.html
[85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl7/igt@kms_plane_alpha_blend@pipe-a-alpha-opaque-fb.html
* igt@kms_psr2_sf@overlay-plane-move-continuous-sf:
- shard-iclb: [SKIP][86] ([i915#2920]) -> [SKIP][87] ([i915#658])
[86]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_psr2_sf@overlay-plane-move-continuous-sf.html
[87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb5/igt@kms_psr2_sf@overlay-plane-move-continuous-sf.html
* igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area:
- shard-iclb: [SKIP][88] ([fdo#111068] / [i915#658]) -> [SKIP][89] ([i915#2920]) +1 similar issue
[88]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb7/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area.html
[89]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb2/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area.html
* igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area:
- shard-iclb: [SKIP][90] ([i915#2920]) -> [SKIP][91] ([fdo#111068] / [i915#658]) +1 similar issue
[90]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area.html
[91]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb7/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area.html
* igt@kms_psr2_su@page_flip-p010:
- shard-iclb: [FAIL][92] ([i915#5939]) -> [SKIP][93] ([fdo#109642] / [fdo#111068] / [i915#658])
[92]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-iclb2/igt@kms_psr2_su@page_flip-p010.html
[93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-iclb7/igt@kms_psr2_su@page_flip-p010.html
* igt@kms_writeback@writeback-fb-id:
- shard-glk: [SKIP][94] ([fdo#109271]) -> [SKIP][95] ([fdo#109271] / [i915#2437]) +3 similar issues
[94]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-glk3/igt@kms_writeback@writeback-fb-id.html
[95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-glk2/igt@kms_writeback@writeback-fb-id.html
* igt@kms_writeback@writeback-pixel-formats:
- shard-apl: [SKIP][96] ([fdo#109271]) -> [SKIP][97] ([fdo#109271] / [i915#2437]) +3 similar issues
[96]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12185/shard-apl8/igt@kms_writeback@writeback-pixel-formats.html
[97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/shard-apl1/igt@kms_writeback@writeback-pixel-formats.html
[fdo#108145]: https://bugs.freedesktop.org/show_bug.cgi?id=108145
[fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
[fdo#109274]: https://bugs.freedesktop.org/show_bug.cgi?id=109274
[fdo#109278]: https://bugs.freedesktop.org/show_bug.cgi?id=109278
[fdo#109279]: https://bugs.freedesktop.org/show_bug.cgi?id=109279
[fdo#109280]: https://bugs.freedesktop.org/show_bug.cgi?id=109280
[fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
[fdo#109289]: https://bugs.freedesktop.org/show_bug.cgi?id=109289
[fdo#109441]: https://bugs.freedesktop.org/show_bug.cgi?id=109441
[fdo#109642]: https://bugs.freedesktop.org/show_bug.cgi?id=109642
[fdo#110321]: https://bugs.freedesktop.org/show_bug.cgi?id=110321
[fdo#110336]: https://bugs.freedesktop.org/show_bug.cgi?id=110336
[fdo#110723]: https://bugs.freedesktop.org/show_bug.cgi?id=110723
[fdo#111068]: https://bugs.freedesktop.org/show_bug.cgi?id=111068
[fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
[fdo#112283]: https://bugs.freedesktop.org/show_bug.cgi?id=112283
[i915#118]: https://gitlab.freedesktop.org/drm/intel/issues/118
[i915#1319]: https://gitlab.freedesktop.org/drm/intel/issues/1319
[i915#180]: https://gitlab.freedesktop.org/drm/intel/issues/180
[i915#1839]: https://gitlab.freedesktop.org/drm/intel/issues/1839
[i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
[i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
[i915#2346]: https://gitlab.freedesktop.org/drm/intel/issues/2346
[i915#2437]: https://gitlab.freedesktop.org/drm/intel/issues/2437
[i915#2587]: https://gitlab.freedesktop.org/drm/intel/issues/2587
[i915#265]: https://gitlab.freedesktop.org/drm/intel/issues/265
[i915#2672]: https://gitlab.freedesktop.org/drm/intel/issues/2672
[i915#2842]: https://gitlab.freedesktop.org/drm/intel/issues/2842
[i915#2856]: https://gitlab.freedesktop.org/drm/intel/issues/2856
[i915#2920]: https://gitlab.freedesktop.org/drm/intel/issues/2920
[i915#3359]: https://gitlab.freedesktop.org/drm/intel/issues/3359
[i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
[i915#3886]: https://gitlab.freedesktop.org/drm/intel/issues/3886
[i915#4270]: https://gitlab.freedesktop.org/drm/intel/issues/4270
[i915#4281]: https://gitlab.freedesktop.org/drm/intel/issues/4281
[i915#4409]: https://gitlab.freedesktop.org/drm/intel/issues/4409
[i915#4525]: https://gitlab.freedesktop.org/drm/intel/issues/4525
[i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
[i915#4939]: https://gitlab.freedesktop.org/drm/intel/issues/4939
[i915#5176]: https://gitlab.freedesktop.org/drm/intel/issues/5176
[i915#5213]: https://gitlab.freedesktop.org/drm/intel/issues/5213
[i915#5235]: https://gitlab.freedesktop.org/drm/intel/issues/5235
[i915#5286]: https://gitlab.freedesktop.org/drm/intel/issues/5286
[i915#5519]: https://gitlab.freedesktop.org/drm/intel/issues/5519
[i915#588]: https://gitlab.freedesktop.org/drm/intel/issues/588
[i915#5939]: https://gitlab.freedesktop.org/drm/intel/issues/5939
[i915#6117]: https://gitlab.freedesktop.org/drm/intel/issues/6117
[i915#62]: https://gitlab.freedesktop.org/drm/intel/issues/62
[i915#6268]: https://gitlab.freedesktop.org/drm/intel/issues/6268
[i915#6344]: https://gitlab.freedesktop.org/drm/intel/issues/6344
[i915#6375]: https://gitlab.freedesktop.org/drm/intel/issues/6375
[i915#6453]: https://gitlab.freedesktop.org/drm/intel/issues/6453
[i915#6537]: https://gitlab.freedesktop.org/drm/intel/issues/6537
[i915#658]: https://gitlab.freedesktop.org/drm/intel/issues/658
[i915#72]: https://gitlab.freedesktop.org/drm/intel/issues/72
[i915#768]: https://gitlab.freedesktop.org/drm/intel/issues/768
Build changes
-------------
* Linux: CI_DRM_12185 -> Patchwork_109087v1
CI-20190529: 20190529
CI_DRM_12185: ae6a4bb62f9524823ef5b00552e27231f7936da3 @ git://anongit.freedesktop.org/gfx-ci/linux
IGT_6663: 5e232c77cd762147e0882c337a984121fabb1c75 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
Patchwork_109087v1: ae6a4bb62f9524823ef5b00552e27231f7936da3 @ git://anongit.freedesktop.org/gfx-ci/linux
piglit_4509: fdc5a4ca11124ab8413c7988896eec4c97336694 @ git://anongit.freedesktop.org/piglit
== Logs ==
For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_109087v1/index.html
[-- Attachment #2: Type: text/html, Size: 30477 bytes --]
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-27 10:14 ` Andrzej Hajda
@ 2022-09-27 21:33 ` Ceraolo Spurio, Daniele
0 siblings, 0 replies; 18+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-09-27 21:33 UTC (permalink / raw)
To: Andrzej Hajda, Andi Shyti, Tvrtko Ursulin; +Cc: intel-gfx, Matthew Auld, chris
On 9/27/2022 3:14 AM, Andrzej Hajda wrote:
> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>
>>
>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>> Hi Andrzej,
>>>
>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>> Capturing error state is time consuming (up to 350ms on DG2), so it
>>>> should
>>>> be avoided if possible. Context reset triggered by context removal
>>>> is a
>>>> good example.
>>>> With this patch multiple igt tests will not timeout and should run
>>>> faster.
>>>>
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>> fine for me:
>>>
>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>
>>> Just to be on the safe side, can we also have the ack from any of
>>> the GuC folks? Daniele, John?
>>>
>>> Andi
>>>
>>>
>>>> ---
>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>>> intel_guc *guc,
>>>> trace_intel_context_reset(ce);
>>>> if (likely(!intel_context_is_banned(ce))) {
>>>> - capture_error_state(guc, ce);
>>>> + if (!intel_context_is_exiting(ce))
>>>> + capture_error_state(guc, ce);
>>>> guc_context_replay(ce);
>>
>> You definitely don't want to replay requests of a context that is
>> going away.
>
> Without guc_context_replay I see timeouts. Probably because
> guc_context_replay calls __guc_reset_context. I am not sure if there
> is need to dig deeper, stay with my initial proposition, or sth like:
>
> if (likely(!intel_context_is_banned(ce))) {
> if (!intel_context_is_exiting(ce)) {
> capture_error_state(guc, ce);
> guc_context_replay(ce);
> } else {
> __guc_reset_context(ce, ce->engine->mask);
> }
> } else {
>
> The latter is also working.
This seems to be an issue with the context close path when hangcheck is
disabled. In that case we don't call the revoke() helper, so we're not
clearing the context state in the guc backend and therefore we require
__guc_reset_context() in the reset handler to do so. I'd argue that the
proper solution would be to ban the context on close in the hangcheck
disabled scenario and not just rely on the pulse, which btw I'm not sure
works with GuC submission with a preemptable context because the GUC
will just schedule the context back in unless we send an H2G to
explicitly disable it. Not sure why we're not banning right now though,
so I'd prefer if someone knowledgeable could chime in in case there is a
good reason for it.
Daniele
>
> Regards
> Andrzej
>
>
>>
>> This seems at least in part due to
>> https://patchwork.freedesktop.org/patch/487531/, where we replaced
>> the "context_ban" with "context_exiting". There are several places
>> where we skipped operations if the context was banned (here included)
>> which are now not covered anymore for exiting contexts. Maybe we need
>> a new checker function to check both flags in places where we don't
>> care why the context is being removed (ban vs exiting), just that it is?
>>
>> Daniele
>>
>>>> } else {
>>>> drm_info(&guc_to_gt(guc)->i915->drm,
>>>> --
>>>> 2.34.1
>>
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-27 7:45 ` Tvrtko Ursulin
2022-09-27 8:16 ` Andrzej Hajda
@ 2022-09-27 21:36 ` Ceraolo Spurio, Daniele
2022-09-28 7:19 ` Tvrtko Ursulin
1 sibling, 1 reply; 18+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-09-27 21:36 UTC (permalink / raw)
To: Tvrtko Ursulin, Andrzej Hajda, Andi Shyti; +Cc: intel-gfx, Matthew Auld, chris
On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>
> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>
>>
>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>
>>>
>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>> Hi Andrzej,
>>>>
>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>> it should
>>>>> be avoided if possible. Context reset triggered by context removal
>>>>> is a
>>>>> good example.
>>>>> With this patch multiple igt tests will not timeout and should run
>>>>> faster.
>>>>>
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>> fine for me:
>>>>
>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>
>>>> Just to be on the safe side, can we also have the ack from any of
>>>> the GuC folks? Daniele, John?
>>>>
>>>> Andi
>>>>
>>>>
>>>>> ---
>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>>>> intel_guc *guc,
>>>>> trace_intel_context_reset(ce);
>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>> - capture_error_state(guc, ce);
>>>>> + if (!intel_context_is_exiting(ce))
>>>>> + capture_error_state(guc, ce);
>
> I am not sure here - if we have a persistent context which caused a
> GPU hang I'd expect we'd still want error capture.
>
> What causes the reset in the affected IGTs? Always preemption timeout?
>
>>>>> guc_context_replay(ce);
>>>
>>> You definitely don't want to replay requests of a context that is
>>> going away.
>>
>> My intention was to just avoid error capture, but that's even better,
>> only condition change:
>> - if (likely(!intel_context_is_banned(ce))) {
>> + if (likely(intel_context_is_schedulable(ce))) {
>
> Yes that helper was intended to be used for contexts which should not
> be scheduled post exit or ban.
>
> Daniele - you say there are some misses in the GuC backend. Should
> most, or even all in intel_guc_submission.c be converted to use
> intel_context_is_schedulable? My idea indeed was that "ban" should be
> a level up from the backends. Backend should only distinguish between
> "should I run this or not", and not the reason.
I think that all of them should be updated, but I'd like Matt B to
confirm as he's more familiar with the code than me.
Daniele
>
> Regards,
>
> Tvrtko
>
>>
>>>
>>> This seems at least in part due to
>>> https://patchwork.freedesktop.org/patch/487531/, where we replaced
>>> the "context_ban" with "context_exiting". There are several places
>>> where we skipped operations if the context was banned (here
>>> included) which are now not covered anymore for exiting contexts.
>>> Maybe we need a new checker function to check both flags in places
>>> where we don't care why the context is being removed (ban vs
>>> exiting), just that it is?
>>>
>>> Daniele
>>>
>>>>> } else {
>>>>> drm_info(&guc_to_gt(guc)->i915->drm,
>>
>> And maybe degrade above to drm_dbg, to avoid spamming dmesg?
>>
>> Regards
>> Andrzej
>>
>>
>>>>> --
>>>>> 2.34.1
>>>
>>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-27 21:36 ` Ceraolo Spurio, Daniele
@ 2022-09-28 7:19 ` Tvrtko Ursulin
2022-09-28 18:27 ` John Harrison
0 siblings, 1 reply; 18+ messages in thread
From: Tvrtko Ursulin @ 2022-09-28 7:19 UTC (permalink / raw)
To: Ceraolo Spurio, Daniele, Andrzej Hajda, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>
>
> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>
>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>
>>>
>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>
>>>>
>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>> Hi Andrzej,
>>>>>
>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>>> it should
>>>>>> be avoided if possible. Context reset triggered by context removal
>>>>>> is a
>>>>>> good example.
>>>>>> With this patch multiple igt tests will not timeout and should run
>>>>>> faster.
>>>>>>
>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>> fine for me:
>>>>>
>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>
>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>> the GuC folks? Daniele, John?
>>>>>
>>>>> Andi
>>>>>
>>>>>
>>>>>> ---
>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>> @@ -4425,7 +4425,8 @@ static void guc_handle_context_reset(struct
>>>>>> intel_guc *guc,
>>>>>> trace_intel_context_reset(ce);
>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>> - capture_error_state(guc, ce);
>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>> + capture_error_state(guc, ce);
>>
>> I am not sure here - if we have a persistent context which caused a
>> GPU hang I'd expect we'd still want error capture.
>>
>> What causes the reset in the affected IGTs? Always preemption timeout?
>>
>>>>>> guc_context_replay(ce);
>>>>
>>>> You definitely don't want to replay requests of a context that is
>>>> going away.
>>>
>>> My intention was to just avoid error capture, but that's even better,
>>> only condition change:
>>> - if (likely(!intel_context_is_banned(ce))) {
>>> + if (likely(intel_context_is_schedulable(ce))) {
>>
>> Yes that helper was intended to be used for contexts which should not
>> be scheduled post exit or ban.
>>
>> Daniele - you say there are some misses in the GuC backend. Should
>> most, or even all in intel_guc_submission.c be converted to use
>> intel_context_is_schedulable? My idea indeed was that "ban" should be
>> a level up from the backends. Backend should only distinguish between
>> "should I run this or not", and not the reason.
>
> I think that all of them should be updated, but I'd like Matt B to
> confirm as he's more familiar with the code than me.
Right, that sounds plausible to me as well.
One thing I forgot to mention - the only place where backend can care
between "schedulable" and "banned" is when it picks the preempt timeout
for non-schedulable contexts. This is to only apply the strict 1ms to
banned (so bad or naught contexts), while the ones which are exiting
cleanly get the full preempt timeout as otherwise configured. This
solves the ugly user experience quirk where GPU resets/errors were
logged upon exit/Ctrl-C of a well behaving application (using
non-persistent contexts). Hopefully GuC can match that behaviour so
customers stay happy.
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-28 7:19 ` Tvrtko Ursulin
@ 2022-09-28 18:27 ` John Harrison
2022-09-29 8:22 ` Tvrtko Ursulin
0 siblings, 1 reply; 18+ messages in thread
From: John Harrison @ 2022-09-28 18:27 UTC (permalink / raw)
To: Tvrtko Ursulin, Ceraolo Spurio, Daniele, Andrzej Hajda, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 9/28/2022 00:19, Tvrtko Ursulin wrote:
> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>> Hi Andrzej,
>>>>>>
>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>>>> it should
>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>> removal is a
>>>>>>> good example.
>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>> run faster.
>>>>>>>
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>> fine for me:
>>>>>>
>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>
>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>> the GuC folks? Daniele, John?
>>>>>>
>>>>>> Andi
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>> trace_intel_context_reset(ce);
>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>> - capture_error_state(guc, ce);
>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>> + capture_error_state(guc, ce);
>>>
>>> I am not sure here - if we have a persistent context which caused a
>>> GPU hang I'd expect we'd still want error capture.
>>>
>>> What causes the reset in the affected IGTs? Always preemption timeout?
>>>
>>>>>>> guc_context_replay(ce);
>>>>>
>>>>> You definitely don't want to replay requests of a context that is
>>>>> going away.
>>>>
>>>> My intention was to just avoid error capture, but that's even
>>>> better, only condition change:
>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>
>>> Yes that helper was intended to be used for contexts which should
>>> not be scheduled post exit or ban.
>>>
>>> Daniele - you say there are some misses in the GuC backend. Should
>>> most, or even all in intel_guc_submission.c be converted to use
>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>> be a level up from the backends. Backend should only distinguish
>>> between "should I run this or not", and not the reason.
>>
>> I think that all of them should be updated, but I'd like Matt B to
>> confirm as he's more familiar with the code than me.
>
> Right, that sounds plausible to me as well.
>
> One thing I forgot to mention - the only place where backend can care
> between "schedulable" and "banned" is when it picks the preempt
> timeout for non-schedulable contexts. This is to only apply the strict
> 1ms to banned (so bad or naught contexts), while the ones which are
> exiting cleanly get the full preempt timeout as otherwise configured.
> This solves the ugly user experience quirk where GPU resets/errors
> were logged upon exit/Ctrl-C of a well behaving application (using
> non-persistent contexts). Hopefully GuC can match that behaviour so
> customers stay happy.
>
> Regards,
>
> Tvrtko
The whole revoke vs ban thing seems broken to me.
First of all, if the user hits Ctrl+C we need to kill the context off
immediately. That is a fundamental customer requirement. Render and
compute engines have a 7.5s pre-emption timeout. The user should not
have to wait 7.5s for a context to be removed from the system when they
have explicitly killed it themselves. Even the regular timeout of 640ms
is borderline a long time to wait. And note that there is an ongoing
request/requirement to increase that to 1900ms.
Under what circumstances would a user expect anything sensible to happen
after a Ctrl+C in terms of things finishing their rendering and display
nice pretty images? They killed the app. They want it dead. We should be
getting it off the hardware as quickly as possible. If you are really
concerned about resets causing collateral damage then maybe bump the
termination timeout from 1ms up to 10ms, maybe at most 100ms. If an app
is 'well behaved' then it should cleanly exit within 10ms. But if it is
bad (which is almost certainly the case if the user is manually and
explicitly killing it) then it needs to be killed because it is not
going to gracefully exit.
Secondly, the whole persistence thing is a total mess, completely broken
and intended to be massively simplified. See the internal task for it.
In short, the plan is that all contexts will be immediately killed when
the last DRM file handle is closed. Persistence is only valid between
the time the per context file handle is closed and the time the master
DRM handle is closed. Whereas, non-persistent contexts get killed as
soon as the per context handle is closed. There is absolutely no
connection to heartbeats or other irrelevant operations.
So in my view, the best option is to revert the ban vs revoke patch. It
is creating bugs. It is making persistence more complex not simpler. It
harms the user experience.
If the original problem was simply that error captures were being done
on Ctrl+C then the fix is simple. Don't capture for a banned context.
There is no need for all the rest of the revoke patch.
John.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-28 18:27 ` John Harrison
@ 2022-09-29 8:22 ` Tvrtko Ursulin
2022-09-29 9:49 ` Andrzej Hajda
2022-09-29 16:49 ` John Harrison
0 siblings, 2 replies; 18+ messages in thread
From: Tvrtko Ursulin @ 2022-09-29 8:22 UTC (permalink / raw)
To: John Harrison, Ceraolo Spurio, Daniele, Andrzej Hajda, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 28/09/2022 19:27, John Harrison wrote:
> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>> Hi Andrzej,
>>>>>>>
>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>> Capturing error state is time consuming (up to 350ms on DG2), so
>>>>>>>> it should
>>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>>> removal is a
>>>>>>>> good example.
>>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>>> run faster.
>>>>>>>>
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>>> fine for me:
>>>>>>>
>>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>>
>>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>>> the GuC folks? Daniele, John?
>>>>>>>
>>>>>>> Andi
>>>>>>>
>>>>>>>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>> trace_intel_context_reset(ce);
>>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>>> - capture_error_state(guc, ce);
>>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>>> + capture_error_state(guc, ce);
>>>>
>>>> I am not sure here - if we have a persistent context which caused a
>>>> GPU hang I'd expect we'd still want error capture.
>>>>
>>>> What causes the reset in the affected IGTs? Always preemption timeout?
>>>>
>>>>>>>> guc_context_replay(ce);
>>>>>>
>>>>>> You definitely don't want to replay requests of a context that is
>>>>>> going away.
>>>>>
>>>>> My intention was to just avoid error capture, but that's even
>>>>> better, only condition change:
>>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>>
>>>> Yes that helper was intended to be used for contexts which should
>>>> not be scheduled post exit or ban.
>>>>
>>>> Daniele - you say there are some misses in the GuC backend. Should
>>>> most, or even all in intel_guc_submission.c be converted to use
>>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>>> be a level up from the backends. Backend should only distinguish
>>>> between "should I run this or not", and not the reason.
>>>
>>> I think that all of them should be updated, but I'd like Matt B to
>>> confirm as he's more familiar with the code than me.
>>
>> Right, that sounds plausible to me as well.
>>
>> One thing I forgot to mention - the only place where backend can care
>> between "schedulable" and "banned" is when it picks the preempt
>> timeout for non-schedulable contexts. This is to only apply the strict
>> 1ms to banned (so bad or naught contexts), while the ones which are
>> exiting cleanly get the full preempt timeout as otherwise configured.
>> This solves the ugly user experience quirk where GPU resets/errors
>> were logged upon exit/Ctrl-C of a well behaving application (using
>> non-persistent contexts). Hopefully GuC can match that behaviour so
>> customers stay happy.
>>
>> Regards,
>>
>> Tvrtko
>
> The whole revoke vs ban thing seems broken to me.
>
> First of all, if the user hits Ctrl+C we need to kill the context off
> immediately. That is a fundamental customer requirement. Render and
> compute engines have a 7.5s pre-emption timeout. The user should not
> have to wait 7.5s for a context to be removed from the system when they
> have explicitly killed it themselves. Even the regular timeout of 640ms
> is borderline a long time to wait. And note that there is an ongoing
> request/requirement to increase that to 1900ms.
>
> Under what circumstances would a user expect anything sensible to happen
> after a Ctrl+C in terms of things finishing their rendering and display
> nice pretty images? They killed the app. They want it dead. We should be
> getting it off the hardware as quickly as possible. If you are really
> concerned about resets causing collateral damage then maybe bump the
> termination timeout from 1ms up to 10ms, maybe at most 100ms. If an app
> is 'well behaved' then it should cleanly exit within 10ms. But if it is
> bad (which is almost certainly the case if the user is manually and
> explicitly killing it) then it needs to be killed because it is not
> going to gracefully exit.
Right.. I had it like that initially (lower timeout - I think 20ms or
so, patch history on the mailing list would know for sure), but then
simplified it after review feedback to avoid adding another timeout value.
So it's not at all about any expectation that something should actually
finish to any sort of completion/success. It is primarily about not
logging an error message when there is no error. Thing to keep in mind
is that error messages are a big deal in some cultures. In addition to
that, avoiding needless engine resets is a good thing as well.
Previously the execlists backend was over eager and only allowed for 1ms
for such contexts to exit. If the context was banned sure - that means
it was a bad context which was causing many hangs already. But if the
context was a clean one I argue there is no point in doing an engine reset.
So if you want, I think it is okay to re-introduce a secondary timeout.
Or if you have an idea on how to avoid the error messages / GPU resets
when "friendly" contexts exit in some other way, that is also something
to discuss.
> Secondly, the whole persistence thing is a total mess, completely broken
> and intended to be massively simplified. See the internal task for it.
> In short, the plan is that all contexts will be immediately killed when
> the last DRM file handle is closed. Persistence is only valid between
> the time the per context file handle is closed and the time the master
> DRM handle is closed. Whereas, non-persistent contexts get killed as
> soon as the per context handle is closed. There is absolutely no
> connection to heartbeats or other irrelevant operations.
The change we are discussing is not about persistence, but for the
persistence itself - I am not sure it is completely broken and if, or
when, the internal task will result with anything being attempted. In
the meantime we had unhappy customers for more than a year. So do we
tell them "please wait for a few years more until some internal task
with no clear timeline or anyone assigned maybe gets looked at"?
> So in my view, the best option is to revert the ban vs revoke patch. It
> is creating bugs. It is making persistence more complex not simpler. It
> harms the user experience.
I am not aware of the bugs, even less so that it is harming user
experience!?
Bugs are limited to the GuC backend or in general? My CI runs were clean
so maybe test cases are lacking. Is it just a case of
s/intel_context_is_banned/intel_context_is_schedulable/ in there to fix it?
Again, the change was not about persistence. It is the opposite -
allowing non-persistent contexts to exit cleanly.
> If the original problem was simply that error captures were being done
> on Ctrl+C then the fix is simple. Don't capture for a banned context.
> There is no need for all the rest of the revoke patch.
Error capture was not part of the original story so it may be a
completely orthogonal topic that we are discussing it in this thread.
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-29 8:22 ` Tvrtko Ursulin
@ 2022-09-29 9:49 ` Andrzej Hajda
2022-09-29 10:40 ` Tvrtko Ursulin
2022-09-29 16:49 ` John Harrison
1 sibling, 1 reply; 18+ messages in thread
From: Andrzej Hajda @ 2022-09-29 9:49 UTC (permalink / raw)
To: Tvrtko Ursulin, John Harrison, Ceraolo Spurio, Daniele, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 29.09.2022 10:22, Tvrtko Ursulin wrote:
>
> On 28/09/2022 19:27, John Harrison wrote:
>> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>>> Hi Andrzej,
>>>>>>>>
>>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>>> Capturing error state is time consuming (up to 350ms on DG2),
>>>>>>>>> so it should
>>>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>>>> removal is a
>>>>>>>>> good example.
>>>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>>>> run faster.
>>>>>>>>>
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>>>> fine for me:
>>>>>>>>
>>>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>>>
>>>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>>>> the GuC folks? Daniele, John?
>>>>>>>>
>>>>>>>> Andi
>>>>>>>>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>>> trace_intel_context_reset(ce);
>>>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>>>> - capture_error_state(guc, ce);
>>>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>>>> + capture_error_state(guc, ce);
>>>>>
>>>>> I am not sure here - if we have a persistent context which caused a
>>>>> GPU hang I'd expect we'd still want error capture.
>>>>>
>>>>> What causes the reset in the affected IGTs? Always preemption timeout?
>>>>>
>>>>>>>>> guc_context_replay(ce);
>>>>>>>
>>>>>>> You definitely don't want to replay requests of a context that is
>>>>>>> going away.
>>>>>>
>>>>>> My intention was to just avoid error capture, but that's even
>>>>>> better, only condition change:
>>>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>>>
>>>>> Yes that helper was intended to be used for contexts which should
>>>>> not be scheduled post exit or ban.
>>>>>
>>>>> Daniele - you say there are some misses in the GuC backend. Should
>>>>> most, or even all in intel_guc_submission.c be converted to use
>>>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>>>> be a level up from the backends. Backend should only distinguish
>>>>> between "should I run this or not", and not the reason.
>>>>
>>>> I think that all of them should be updated, but I'd like Matt B to
>>>> confirm as he's more familiar with the code than me.
>>>
>>> Right, that sounds plausible to me as well.
>>>
>>> One thing I forgot to mention - the only place where backend can care
>>> between "schedulable" and "banned" is when it picks the preempt
>>> timeout for non-schedulable contexts. This is to only apply the
>>> strict 1ms to banned (so bad or naught contexts), while the ones
>>> which are exiting cleanly get the full preempt timeout as otherwise
>>> configured. This solves the ugly user experience quirk where GPU
>>> resets/errors were logged upon exit/Ctrl-C of a well behaving
>>> application (using non-persistent contexts). Hopefully GuC can match
>>> that behaviour so customers stay happy.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
>> The whole revoke vs ban thing seems broken to me.
>>
>> First of all, if the user hits Ctrl+C we need to kill the context off
>> immediately. That is a fundamental customer requirement. Render and
>> compute engines have a 7.5s pre-emption timeout. The user should not
>> have to wait 7.5s for a context to be removed from the system when
>> they have explicitly killed it themselves. Even the regular timeout of
>> 640ms is borderline a long time to wait. And note that there is an
>> ongoing request/requirement to increase that to 1900ms.
>>
>> Under what circumstances would a user expect anything sensible to
>> happen after a Ctrl+C in terms of things finishing their rendering and
>> display nice pretty images? They killed the app. They want it dead. We
>> should be getting it off the hardware as quickly as possible. If you
>> are really concerned about resets causing collateral damage then maybe
>> bump the termination timeout from 1ms up to 10ms, maybe at most 100ms.
>> If an app is 'well behaved' then it should cleanly exit within 10ms.
>> But if it is bad (which is almost certainly the case if the user is
>> manually and explicitly killing it) then it needs to be killed because
>> it is not going to gracefully exit.
>
> Right.. I had it like that initially (lower timeout - I think 20ms or
> so, patch history on the mailing list would know for sure), but then
> simplified it after review feedback to avoid adding another timeout value.
>
> So it's not at all about any expectation that something should actually
> finish to any sort of completion/success. It is primarily about not
> logging an error message when there is no error. Thing to keep in mind
> is that error messages are a big deal in some cultures. In addition to
> that, avoiding needless engine resets is a good thing as well.
>
> Previously the execlists backend was over eager and only allowed for 1ms
> for such contexts to exit. If the context was banned sure - that means
> it was a bad context which was causing many hangs already. But if the
> context was a clean one I argue there is no point in doing an engine reset.
>
> So if you want, I think it is okay to re-introduce a secondary timeout.
>
> Or if you have an idea on how to avoid the error messages / GPU resets
> when "friendly" contexts exit in some other way, that is also something
> to discuss.
>
>> Secondly, the whole persistence thing is a total mess, completely
>> broken and intended to be massively simplified. See the internal task
>> for it. In short, the plan is that all contexts will be immediately
>> killed when the last DRM file handle is closed. Persistence is only
>> valid between the time the per context file handle is closed and the
>> time the master DRM handle is closed. Whereas, non-persistent contexts
>> get killed as soon as the per context handle is closed. There is
>> absolutely no connection to heartbeats or other irrelevant operations.
>
> The change we are discussing is not about persistence, but for the
> persistence itself - I am not sure it is completely broken and if, or
> when, the internal task will result with anything being attempted. In
> the meantime we had unhappy customers for more than a year. So do we
> tell them "please wait for a few years more until some internal task
> with no clear timeline or anyone assigned maybe gets looked at"?
>
>> So in my view, the best option is to revert the ban vs revoke patch.
>> It is creating bugs. It is making persistence more complex not
>> simpler. It harms the user experience.
>
> I am not aware of the bugs, even less so that it is harming user
> experience!?
>
> Bugs are limited to the GuC backend or in general? My CI runs were clean
> so maybe test cases are lacking. Is it just a case of
> s/intel_context_is_banned/intel_context_is_schedulable/ in there to fix it?
>
> Again, the change was not about persistence. It is the opposite -
> allowing non-persistent contexts to exit cleanly.
>
>> If the original problem was simply that error captures were being done
>> on Ctrl+C then the fix is simple. Don't capture for a banned context.
>> There is no need for all the rest of the revoke patch.
>
> Error capture was not part of the original story so it may be a
> completely orthogonal topic that we are discussing it in this thread.
Wouldn't be good then to separate these two issues:
banned/exiting/schedulable handling and error capturing of exiting context.
This patch handles only the latter, and as I understand there is no big
controversy that we de not need capture errors for exiting contexts.
If yes, can we ack/merge this patch, to make CI happy and continue
discussion on the former.
Regards
Andrzej
>
> Regards,
>
> Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-29 9:49 ` Andrzej Hajda
@ 2022-09-29 10:40 ` Tvrtko Ursulin
2022-09-29 14:28 ` Ceraolo Spurio, Daniele
0 siblings, 1 reply; 18+ messages in thread
From: Tvrtko Ursulin @ 2022-09-29 10:40 UTC (permalink / raw)
To: Andrzej Hajda, John Harrison, Ceraolo Spurio, Daniele, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 29/09/2022 10:49, Andrzej Hajda wrote:
> On 29.09.2022 10:22, Tvrtko Ursulin wrote:
>> On 28/09/2022 19:27, John Harrison wrote:
>>> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>>>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>>>> Hi Andrzej,
>>>>>>>>>
>>>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>>>> Capturing error state is time consuming (up to 350ms on DG2),
>>>>>>>>>> so it should
>>>>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>>>>> removal is a
>>>>>>>>>> good example.
>>>>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>>>>> run faster.
>>>>>>>>>>
>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>>>>> fine for me:
>>>>>>>>>
>>>>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>>>>
>>>>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>>>>> the GuC folks? Daniele, John?
>>>>>>>>>
>>>>>>>>> Andi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> ---
>>>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>>>> trace_intel_context_reset(ce);
>>>>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>>>>> - capture_error_state(guc, ce);
>>>>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>>>>> + capture_error_state(guc, ce);
>>>>>>
>>>>>> I am not sure here - if we have a persistent context which caused
>>>>>> a GPU hang I'd expect we'd still want error capture.
>>>>>>
>>>>>> What causes the reset in the affected IGTs? Always preemption
>>>>>> timeout?
>>>>>>
>>>>>>>>>> guc_context_replay(ce);
>>>>>>>>
>>>>>>>> You definitely don't want to replay requests of a context that
>>>>>>>> is going away.
>>>>>>>
>>>>>>> My intention was to just avoid error capture, but that's even
>>>>>>> better, only condition change:
>>>>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>>>>
>>>>>> Yes that helper was intended to be used for contexts which should
>>>>>> not be scheduled post exit or ban.
>>>>>>
>>>>>> Daniele - you say there are some misses in the GuC backend. Should
>>>>>> most, or even all in intel_guc_submission.c be converted to use
>>>>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>>>>> be a level up from the backends. Backend should only distinguish
>>>>>> between "should I run this or not", and not the reason.
>>>>>
>>>>> I think that all of them should be updated, but I'd like Matt B to
>>>>> confirm as he's more familiar with the code than me.
>>>>
>>>> Right, that sounds plausible to me as well.
>>>>
>>>> One thing I forgot to mention - the only place where backend can
>>>> care between "schedulable" and "banned" is when it picks the preempt
>>>> timeout for non-schedulable contexts. This is to only apply the
>>>> strict 1ms to banned (so bad or naught contexts), while the ones
>>>> which are exiting cleanly get the full preempt timeout as otherwise
>>>> configured. This solves the ugly user experience quirk where GPU
>>>> resets/errors were logged upon exit/Ctrl-C of a well behaving
>>>> application (using non-persistent contexts). Hopefully GuC can match
>>>> that behaviour so customers stay happy.
>>>>
>>>> Regards,
>>>>
>>>> Tvrtko
>>>
>>> The whole revoke vs ban thing seems broken to me.
>>>
>>> First of all, if the user hits Ctrl+C we need to kill the context off
>>> immediately. That is a fundamental customer requirement. Render and
>>> compute engines have a 7.5s pre-emption timeout. The user should not
>>> have to wait 7.5s for a context to be removed from the system when
>>> they have explicitly killed it themselves. Even the regular timeout
>>> of 640ms is borderline a long time to wait. And note that there is an
>>> ongoing request/requirement to increase that to 1900ms.
>>>
>>> Under what circumstances would a user expect anything sensible to
>>> happen after a Ctrl+C in terms of things finishing their rendering
>>> and display nice pretty images? They killed the app. They want it
>>> dead. We should be getting it off the hardware as quickly as
>>> possible. If you are really concerned about resets causing collateral
>>> damage then maybe bump the termination timeout from 1ms up to 10ms,
>>> maybe at most 100ms. If an app is 'well behaved' then it should
>>> cleanly exit within 10ms. But if it is bad (which is almost certainly
>>> the case if the user is manually and explicitly killing it) then it
>>> needs to be killed because it is not going to gracefully exit.
>>
>> Right.. I had it like that initially (lower timeout - I think 20ms or
>> so, patch history on the mailing list would know for sure), but then
>> simplified it after review feedback to avoid adding another timeout
>> value.
>>
>> So it's not at all about any expectation that something should
>> actually finish to any sort of completion/success. It is primarily
>> about not logging an error message when there is no error. Thing to
>> keep in mind is that error messages are a big deal in some cultures.
>> In addition to that, avoiding needless engine resets is a good thing
>> as well.
>>
>> Previously the execlists backend was over eager and only allowed for
>> 1ms for such contexts to exit. If the context was banned sure - that
>> means it was a bad context which was causing many hangs already. But
>> if the context was a clean one I argue there is no point in doing an
>> engine reset.
>>
>> So if you want, I think it is okay to re-introduce a secondary timeout.
>>
>> Or if you have an idea on how to avoid the error messages / GPU resets
>> when "friendly" contexts exit in some other way, that is also
>> something to discuss.
>>
>>> Secondly, the whole persistence thing is a total mess, completely
>>> broken and intended to be massively simplified. See the internal task
>>> for it. In short, the plan is that all contexts will be immediately
>>> killed when the last DRM file handle is closed. Persistence is only
>>> valid between the time the per context file handle is closed and the
>>> time the master DRM handle is closed. Whereas, non-persistent
>>> contexts get killed as soon as the per context handle is closed.
>>> There is absolutely no connection to heartbeats or other irrelevant
>>> operations.
>>
>> The change we are discussing is not about persistence, but for the
>> persistence itself - I am not sure it is completely broken and if, or
>> when, the internal task will result with anything being attempted. In
>> the meantime we had unhappy customers for more than a year. So do we
>> tell them "please wait for a few years more until some internal task
>> with no clear timeline or anyone assigned maybe gets looked at"?
>>
>>> So in my view, the best option is to revert the ban vs revoke patch.
>>> It is creating bugs. It is making persistence more complex not
>>> simpler. It harms the user experience.
>>
>> I am not aware of the bugs, even less so that it is harming user
>> experience!?
>>
>> Bugs are limited to the GuC backend or in general? My CI runs were
>> clean so maybe test cases are lacking. Is it just a case of
>> s/intel_context_is_banned/intel_context_is_schedulable/ in there to
>> fix it?
>>
>> Again, the change was not about persistence. It is the opposite -
>> allowing non-persistent contexts to exit cleanly.
>>
>>> If the original problem was simply that error captures were being
>>> done on Ctrl+C then the fix is simple. Don't capture for a banned
>>> context. There is no need for all the rest of the revoke patch.
>>
>> Error capture was not part of the original story so it may be a
>> completely orthogonal topic that we are discussing it in this thread.
>
> Wouldn't be good then to separate these two issues:
> banned/exiting/schedulable handling and error capturing of exiting context.
> This patch handles only the latter, and as I understand there is no big
> controversy that we de not need capture errors for exiting contexts.
> If yes, can we ack/merge this patch, to make CI happy and continue
> discussion on the former.
Right, question is if the code in guc_handle_context_reset shouldn't be changed to:
if (likely(!intel_context_is_exiting(ce))) {
capture_error_state(guc, ce);
guc_context_replay(ce);
} else {
And if that should be part of patch which changes a few more instances of that same check.
But you wrote that doesn't work? And then Daniele said he thinks it is because revoke is not called when hangcheck is disabled and GuC backend gets confused? If I got the conversation right..
I wonder if that means equivalent of execlists:
if (unlikely(intel_context_is_closed(ce) &&
!intel_engine_has_heartbeat(engine)))
intel_context_set_exiting(ce);
Is needed somewhere in the GuC backend. Which with execlists skips over the context which is no longer schedulable.
But I don't understand why testing did not pick up that miss, or the miss with guc_context_replay on an exiting context. Or where exactly to put the extra handling in the GuC backend. Perhaps it isn't possible in which case we could have an ugly solution where for GuC we do something special in kill_engines() if hangcheck is disabled. Maybe add and call a new helper like:
intel_context_exit_nohangcheck()
{
bool ret = intel_context_set_exiting(ce);
if (!ret && intel_engine_uses_guc(ce->engine))
intel_context_ban(ce, NULL);
return ret;
}
Too ugly?
Regards,
Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-29 10:40 ` Tvrtko Ursulin
@ 2022-09-29 14:28 ` Ceraolo Spurio, Daniele
0 siblings, 0 replies; 18+ messages in thread
From: Ceraolo Spurio, Daniele @ 2022-09-29 14:28 UTC (permalink / raw)
To: Tvrtko Ursulin, Andrzej Hajda, John Harrison, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 9/29/2022 3:40 AM, Tvrtko Ursulin wrote:
>
> On 29/09/2022 10:49, Andrzej Hajda wrote:
>> On 29.09.2022 10:22, Tvrtko Ursulin wrote:
>>> On 28/09/2022 19:27, John Harrison wrote:
>>>> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>>>>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>>>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>>>>> Hi Andrzej,
>>>>>>>>>>
>>>>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>>>>> Capturing error state is time consuming (up to 350ms on
>>>>>>>>>>> DG2), so it should
>>>>>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>>>>>> removal is a
>>>>>>>>>>> good example.
>>>>>>>>>>> With this patch multiple igt tests will not timeout and
>>>>>>>>>>> should run faster.
>>>>>>>>>>>
>>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>>>>>> fine for me:
>>>>>>>>>>
>>>>>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>>>>>
>>>>>>>>>> Just to be on the safe side, can we also have the ack from
>>>>>>>>>> any of
>>>>>>>>>> the GuC folks? Daniele, John?
>>>>>>>>>>
>>>>>>>>>> Andi
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> ---
>>>>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>>>
>>>>>>>>>>> diff --git
>>>>>>>>>>> a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>>>>> trace_intel_context_reset(ce);
>>>>>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>>>>>> - capture_error_state(guc, ce);
>>>>>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>>>>>> + capture_error_state(guc, ce);
>>>>>>>
>>>>>>> I am not sure here - if we have a persistent context which
>>>>>>> caused a GPU hang I'd expect we'd still want error capture.
>>>>>>>
>>>>>>> What causes the reset in the affected IGTs? Always preemption
>>>>>>> timeout?
>>>>>>>
>>>>>>>>>>> guc_context_replay(ce);
>>>>>>>>>
>>>>>>>>> You definitely don't want to replay requests of a context that
>>>>>>>>> is going away.
>>>>>>>>
>>>>>>>> My intention was to just avoid error capture, but that's even
>>>>>>>> better, only condition change:
>>>>>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>>>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>>>>>
>>>>>>> Yes that helper was intended to be used for contexts which
>>>>>>> should not be scheduled post exit or ban.
>>>>>>>
>>>>>>> Daniele - you say there are some misses in the GuC backend.
>>>>>>> Should most, or even all in intel_guc_submission.c be converted
>>>>>>> to use intel_context_is_schedulable? My idea indeed was that
>>>>>>> "ban" should be a level up from the backends. Backend should
>>>>>>> only distinguish between "should I run this or not", and not the
>>>>>>> reason.
>>>>>>
>>>>>> I think that all of them should be updated, but I'd like Matt B
>>>>>> to confirm as he's more familiar with the code than me.
>>>>>
>>>>> Right, that sounds plausible to me as well.
>>>>>
>>>>> One thing I forgot to mention - the only place where backend can
>>>>> care between "schedulable" and "banned" is when it picks the
>>>>> preempt timeout for non-schedulable contexts. This is to only
>>>>> apply the strict 1ms to banned (so bad or naught contexts), while
>>>>> the ones which are exiting cleanly get the full preempt timeout as
>>>>> otherwise configured. This solves the ugly user experience quirk
>>>>> where GPU resets/errors were logged upon exit/Ctrl-C of a well
>>>>> behaving application (using non-persistent contexts). Hopefully
>>>>> GuC can match that behaviour so customers stay happy.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Tvrtko
>>>>
>>>> The whole revoke vs ban thing seems broken to me.
>>>>
>>>> First of all, if the user hits Ctrl+C we need to kill the context
>>>> off immediately. That is a fundamental customer requirement. Render
>>>> and compute engines have a 7.5s pre-emption timeout. The user
>>>> should not have to wait 7.5s for a context to be removed from the
>>>> system when they have explicitly killed it themselves. Even the
>>>> regular timeout of 640ms is borderline a long time to wait. And
>>>> note that there is an ongoing request/requirement to increase that
>>>> to 1900ms.
>>>>
>>>> Under what circumstances would a user expect anything sensible to
>>>> happen after a Ctrl+C in terms of things finishing their rendering
>>>> and display nice pretty images? They killed the app. They want it
>>>> dead. We should be getting it off the hardware as quickly as
>>>> possible. If you are really concerned about resets causing
>>>> collateral damage then maybe bump the termination timeout from 1ms
>>>> up to 10ms, maybe at most 100ms. If an app is 'well behaved' then
>>>> it should cleanly exit within 10ms. But if it is bad (which is
>>>> almost certainly the case if the user is manually and explicitly
>>>> killing it) then it needs to be killed because it is not going to
>>>> gracefully exit.
>>>
>>> Right.. I had it like that initially (lower timeout - I think 20ms
>>> or so, patch history on the mailing list would know for sure), but
>>> then simplified it after review feedback to avoid adding another
>>> timeout value.
>>>
>>> So it's not at all about any expectation that something should
>>> actually finish to any sort of completion/success. It is primarily
>>> about not logging an error message when there is no error. Thing to
>>> keep in mind is that error messages are a big deal in some cultures.
>>> In addition to that, avoiding needless engine resets is a good thing
>>> as well.
>>>
>>> Previously the execlists backend was over eager and only allowed for
>>> 1ms for such contexts to exit. If the context was banned sure - that
>>> means it was a bad context which was causing many hangs already. But
>>> if the context was a clean one I argue there is no point in doing an
>>> engine reset.
>>>
>>> So if you want, I think it is okay to re-introduce a secondary timeout.
>>>
>>> Or if you have an idea on how to avoid the error messages / GPU
>>> resets when "friendly" contexts exit in some other way, that is also
>>> something to discuss.
>>>
>>>> Secondly, the whole persistence thing is a total mess, completely
>>>> broken and intended to be massively simplified. See the internal
>>>> task for it. In short, the plan is that all contexts will be
>>>> immediately killed when the last DRM file handle is closed.
>>>> Persistence is only valid between the time the per context file
>>>> handle is closed and the time the master DRM handle is closed.
>>>> Whereas, non-persistent contexts get killed as soon as the per
>>>> context handle is closed. There is absolutely no connection to
>>>> heartbeats or other irrelevant operations.
>>>
>>> The change we are discussing is not about persistence, but for the
>>> persistence itself - I am not sure it is completely broken and if,
>>> or when, the internal task will result with anything being
>>> attempted. In the meantime we had unhappy customers for more than a
>>> year. So do we tell them "please wait for a few years more until
>>> some internal task with no clear timeline or anyone assigned maybe
>>> gets looked at"?
>>>
>>>> So in my view, the best option is to revert the ban vs revoke
>>>> patch. It is creating bugs. It is making persistence more complex
>>>> not simpler. It harms the user experience.
>>>
>>> I am not aware of the bugs, even less so that it is harming user
>>> experience!?
>>>
>>> Bugs are limited to the GuC backend or in general? My CI runs were
>>> clean so maybe test cases are lacking. Is it just a case of
>>> s/intel_context_is_banned/intel_context_is_schedulable/ in there to
>>> fix it?
>>>
>>> Again, the change was not about persistence. It is the opposite -
>>> allowing non-persistent contexts to exit cleanly.
>>>
>>>> If the original problem was simply that error captures were being
>>>> done on Ctrl+C then the fix is simple. Don't capture for a banned
>>>> context. There is no need for all the rest of the revoke patch.
>>>
>>> Error capture was not part of the original story so it may be a
>>> completely orthogonal topic that we are discussing it in this thread.
>>
>> Wouldn't be good then to separate these two issues:
>> banned/exiting/schedulable handling and error capturing of exiting
>> context.
>> This patch handles only the latter, and as I understand there is no
>> big controversy that we de not need capture errors for exiting contexts.
>> If yes, can we ack/merge this patch, to make CI happy and continue
>> discussion on the former.
>
> Right, question is if the code in guc_handle_context_reset shouldn't
> be changed to:
>
> if (likely(!intel_context_is_exiting(ce))) {
> capture_error_state(guc, ce);
> guc_context_replay(ce);
> } else {
>
> And if that should be part of patch which changes a few more instances
> of that same check.
>
> But you wrote that doesn't work? And then Daniele said he thinks it is
> because revoke is not called when hangcheck is disabled and GuC
> backend gets confused? If I got the conversation right..
>
> I wonder if that means equivalent of execlists:
>
> if (unlikely(intel_context_is_closed(ce) &&
> !intel_engine_has_heartbeat(engine)))
> intel_context_set_exiting(ce);
>
> Is needed somewhere in the GuC backend. Which with execlists skips
> over the context which is no longer schedulable.
There is nowhere we can put that in the GuC back-end if the context has
already been handed over to the GuC, because at that point it is out of
our hands. We need to tell the GuC if we want the context to be dropped.
>
> But I don't understand why testing did not pick up that miss, or the
> miss with guc_context_replay on an exiting context. Or where exactly
> to put the extra handling in the GuC backend.
My worry here is that some of the bugs seem to pre-date your patch
(which might be why they weren't flagged in the CI run), so there might
be something else going on that we're missing.
> Perhaps it isn't possible in which case we could have an ugly solution
> where for GuC we do something special in kill_engines() if hangcheck
> is disabled. Maybe add and call a new helper like:
>
> intel_context_exit_nohangcheck()
> {
> bool ret = intel_context_set_exiting(ce);
>
> if (!ret && intel_engine_uses_guc(ce->engine))
> intel_context_ban(ce, NULL);
>
> return ret;
> }
>
> Too ugly?
This works for me if it fixes the issues. The no hangcheck case is not
common and the user should be careful of what they're running if they
select it, so IMO we don't need a super pretty or super efficient
solution, just something that works.
Daniele
>
> Regards,
>
> Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context
2022-09-29 8:22 ` Tvrtko Ursulin
2022-09-29 9:49 ` Andrzej Hajda
@ 2022-09-29 16:49 ` John Harrison
1 sibling, 0 replies; 18+ messages in thread
From: John Harrison @ 2022-09-29 16:49 UTC (permalink / raw)
To: Tvrtko Ursulin, Ceraolo Spurio, Daniele, Andrzej Hajda, Andi Shyti
Cc: intel-gfx, Matthew Auld, chris
On 9/29/2022 01:22, Tvrtko Ursulin wrote:
> On 28/09/2022 19:27, John Harrison wrote:
>> On 9/28/2022 00:19, Tvrtko Ursulin wrote:
>>> On 27/09/2022 22:36, Ceraolo Spurio, Daniele wrote:
>>>> On 9/27/2022 12:45 AM, Tvrtko Ursulin wrote:
>>>>> On 27/09/2022 07:49, Andrzej Hajda wrote:
>>>>>> On 27.09.2022 01:34, Ceraolo Spurio, Daniele wrote:
>>>>>>> On 9/26/2022 3:44 PM, Andi Shyti wrote:
>>>>>>>> Hi Andrzej,
>>>>>>>>
>>>>>>>> On Mon, Sep 26, 2022 at 11:54:09PM +0200, Andrzej Hajda wrote:
>>>>>>>>> Capturing error state is time consuming (up to 350ms on DG2),
>>>>>>>>> so it should
>>>>>>>>> be avoided if possible. Context reset triggered by context
>>>>>>>>> removal is a
>>>>>>>>> good example.
>>>>>>>>> With this patch multiple igt tests will not timeout and should
>>>>>>>>> run faster.
>>>>>>>>>
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/1551
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/3952
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/5891
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6268
>>>>>>>>> Closes: https://gitlab.freedesktop.org/drm/intel/-/issues/6281
>>>>>>>>> Signed-off-by: Andrzej Hajda <andrzej.hajda@intel.com>
>>>>>>>> fine for me:
>>>>>>>>
>>>>>>>> Reviewed-by: Andi Shyti <andi.shyti@linux.intel.com>
>>>>>>>>
>>>>>>>> Just to be on the safe side, can we also have the ack from any of
>>>>>>>> the GuC folks? Daniele, John?
>>>>>>>>
>>>>>>>> Andi
>>>>>>>>
>>>>>>>>
>>>>>>>>> ---
>>>>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 ++-
>>>>>>>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>>>>>>>
>>>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> index 22ba66e48a9b01..cb58029208afe1 100644
>>>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>>>> @@ -4425,7 +4425,8 @@ static void
>>>>>>>>> guc_handle_context_reset(struct intel_guc *guc,
>>>>>>>>> trace_intel_context_reset(ce);
>>>>>>>>> if (likely(!intel_context_is_banned(ce))) {
>>>>>>>>> - capture_error_state(guc, ce);
>>>>>>>>> + if (!intel_context_is_exiting(ce))
>>>>>>>>> + capture_error_state(guc, ce);
>>>>>
>>>>> I am not sure here - if we have a persistent context which caused
>>>>> a GPU hang I'd expect we'd still want error capture.
>>>>>
>>>>> What causes the reset in the affected IGTs? Always preemption
>>>>> timeout?
>>>>>
>>>>>>>>> guc_context_replay(ce);
>>>>>>>
>>>>>>> You definitely don't want to replay requests of a context that
>>>>>>> is going away.
>>>>>>
>>>>>> My intention was to just avoid error capture, but that's even
>>>>>> better, only condition change:
>>>>>> - if (likely(!intel_context_is_banned(ce))) {
>>>>>> + if (likely(intel_context_is_schedulable(ce))) {
>>>>>
>>>>> Yes that helper was intended to be used for contexts which should
>>>>> not be scheduled post exit or ban.
>>>>>
>>>>> Daniele - you say there are some misses in the GuC backend. Should
>>>>> most, or even all in intel_guc_submission.c be converted to use
>>>>> intel_context_is_schedulable? My idea indeed was that "ban" should
>>>>> be a level up from the backends. Backend should only distinguish
>>>>> between "should I run this or not", and not the reason.
>>>>
>>>> I think that all of them should be updated, but I'd like Matt B to
>>>> confirm as he's more familiar with the code than me.
>>>
>>> Right, that sounds plausible to me as well.
>>>
>>> One thing I forgot to mention - the only place where backend can
>>> care between "schedulable" and "banned" is when it picks the preempt
>>> timeout for non-schedulable contexts. This is to only apply the
>>> strict 1ms to banned (so bad or naught contexts), while the ones
>>> which are exiting cleanly get the full preempt timeout as otherwise
>>> configured. This solves the ugly user experience quirk where GPU
>>> resets/errors were logged upon exit/Ctrl-C of a well behaving
>>> application (using non-persistent contexts). Hopefully GuC can match
>>> that behaviour so customers stay happy.
>>>
>>> Regards,
>>>
>>> Tvrtko
>>
>> The whole revoke vs ban thing seems broken to me.
>>
>> First of all, if the user hits Ctrl+C we need to kill the context off
>> immediately. That is a fundamental customer requirement. Render and
>> compute engines have a 7.5s pre-emption timeout. The user should not
>> have to wait 7.5s for a context to be removed from the system when
>> they have explicitly killed it themselves. Even the regular timeout
>> of 640ms is borderline a long time to wait. And note that there is an
>> ongoing request/requirement to increase that to 1900ms.
>>
>> Under what circumstances would a user expect anything sensible to
>> happen after a Ctrl+C in terms of things finishing their rendering
>> and display nice pretty images? They killed the app. They want it
>> dead. We should be getting it off the hardware as quickly as
>> possible. If you are really concerned about resets causing collateral
>> damage then maybe bump the termination timeout from 1ms up to 10ms,
>> maybe at most 100ms. If an app is 'well behaved' then it should
>> cleanly exit within 10ms. But if it is bad (which is almost certainly
>> the case if the user is manually and explicitly killing it) then it
>> needs to be killed because it is not going to gracefully exit.
>
> Right.. I had it like that initially (lower timeout - I think 20ms or
> so, patch history on the mailing list would know for sure), but then
> simplified it after review feedback to avoid adding another timeout
> value.
>
> So it's not at all about any expectation that something should
> actually finish to any sort of completion/success. It is primarily
> about not logging an error message when there is no error. Thing to
> keep in mind is that error messages are a big deal in some cultures.
> In addition to that, avoiding needless engine resets is a good thing
> as well.
But not calling the error capture code on a banned context is a trivial
change. I don't see why it is so complicated to just suppress that part
of the clean up.
>
> Previously the execlists backend was over eager and only allowed for
> 1ms for such contexts to exit. If the context was banned sure - that
> means it was a bad context which was causing many hangs already. But
> if the context was a clean one I argue there is no point in doing an
> engine reset.
>
> So if you want, I think it is okay to re-introduce a secondary timeout.
>
> Or if you have an idea on how to avoid the error messages / GPU resets
> when "friendly" contexts exit in some other way, that is also
> something to discuss.
Well, yes. Just don't call the error capture code for a banned context.
That's the only bit that prints out any GPU hang error messages. If you
don't call that, the user won't know that anything has happened.
>
>> Secondly, the whole persistence thing is a total mess, completely
>> broken and intended to be massively simplified. See the internal task
>> for it. In short, the plan is that all contexts will be immediately
>> killed when the last DRM file handle is closed. Persistence is only
>> valid between the time the per context file handle is closed and the
>> time the master DRM handle is closed. Whereas, non-persistent
>> contexts get killed as soon as the per context handle is closed.
>> There is absolutely no connection to heartbeats or other irrelevant
>> operations.
>
> The change we are discussing is not about persistence, but for the
> persistence itself - I am not sure it is completely broken and if, or
> when, the internal task will result with anything being attempted. In
> the meantime we had unhappy customers for more than a year. So do we
> tell them "please wait for a few years more until some internal task
> with no clear timeline or anyone assigned maybe gets looked at"?
Persistence is totally broken for any post-execlist platform. It
fundamentally relies upon code deep within the execlst backend that
cannot be done with any other backend - GuC, DRM, anything that comes in
the future, ... Pretty much any IGT with 'persistence' (or
'no-hangcheck') in the name is failing for GuC because of this.
Daniel Vetter's view is that any connection to a submission backend,
heartbeat, or indeed anything other than file handle closure is
horrendous over complication and must be removed.
The task is theoretically at the top of my todo list. But I keep getting
large high priority interrupts and never manage to work on it :(. If you
are feeling bored, then please pick it up. You would massively improve
our DG2 pass rates...
>
>> So in my view, the best option is to revert the ban vs revoke patch.
>> It is creating bugs. It is making persistence more complex not
>> simpler. It harms the user experience.
>
> I am not aware of the bugs, even less so that it is harming user
> experience!?
This whole thread is because there are bugs. E.g. the fact that the GuC
backend did not get properly updated to cope with the new distinction of
ban vs revoke. The fact that compute contexts now take 7.5s to kill via
Ctrl+C. And if the user has disabled the pre-emption timeout completely
then Ctrl+C just won't work at all.
>
> Bugs are limited to the GuC backend or in general? My CI runs were
> clean so maybe test cases are lacking. Is it just a case of
> s/intel_context_is_banned/intel_context_is_schedulable/ in there to
> fix it?
>
> Again, the change was not about persistence. It is the opposite -
> allowing non-persistent contexts to exit cleanly.
If the code being added says 'if(persistent) X; else Y;' then it is
about persistence and it is making the whole persistence problem worse.
>
>> If the original problem was simply that error captures were being
>> done on Ctrl+C then the fix is simple. Don't capture for a banned
>> context. There is no need for all the rest of the revoke patch.
>
> Error capture was not part of the original story so it may be a
> completely orthogonal topic that we are discussing it in this thread.
Then I'm lost. What was the purpose of the original change? According to
the commit message, the whole point of introducing revoke was to
suppress the error capture on a Ctrl+C wasn't it? - "logging engine
resets during normal operation not desirable".
John
>
> Regards,
>
> Tvrtko
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2022-09-29 16:50 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-26 21:54 [Intel-gfx] [PATCH] drm/i915/guc: do not capture error state on exiting context Andrzej Hajda
2022-09-26 22:44 ` Andi Shyti
2022-09-26 23:34 ` Ceraolo Spurio, Daniele
2022-09-27 6:49 ` Andrzej Hajda
2022-09-27 7:45 ` Tvrtko Ursulin
2022-09-27 8:16 ` Andrzej Hajda
2022-09-27 21:36 ` Ceraolo Spurio, Daniele
2022-09-28 7:19 ` Tvrtko Ursulin
2022-09-28 18:27 ` John Harrison
2022-09-29 8:22 ` Tvrtko Ursulin
2022-09-29 9:49 ` Andrzej Hajda
2022-09-29 10:40 ` Tvrtko Ursulin
2022-09-29 14:28 ` Ceraolo Spurio, Daniele
2022-09-29 16:49 ` John Harrison
2022-09-27 10:14 ` Andrzej Hajda
2022-09-27 21:33 ` Ceraolo Spurio, Daniele
2022-09-27 2:07 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2022-09-27 13:50 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.