All of lore.kernel.org
 help / color / mirror / Atom feed
* [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
@ 2021-11-03 22:47 Umesh Nerlige Ramappa
  2021-11-03 23:47 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-03 22:47 UTC (permalink / raw)
  To: intel-gfx, Tvrtko Ursulin

Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5cc49c0b3889..d83ade77ca07 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
 	u64 total, gt_stamp_saved;
 	unsigned long flags;
 	u32 reset_count;
+	bool in_reset;
 
 	spin_lock_irqsave(&guc->timestamp.lock, flags);
 
@@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
 	 * engine busyness from GuC, so we just use the driver stored
 	 * copy of busyness. Synchronize with gt reset using reset_count.
 	 */
-	reset_count = i915_reset_count(gpu_error);
+	rcu_read_lock();
+	in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
+	rcu_read_unlock();
 
 	*now = ktime_get();
 
@@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
 	 * start_gt_clk is derived from GuC state. To get a consistent
 	 * view of activity, we query the GuC state only if gt is awake.
 	 */
-	stats_saved = *stats;
-	gt_stamp_saved = guc->timestamp.gt_stamp;
-	if (intel_gt_pm_get_if_awake(gt)) {
+	if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
+		stats_saved = *stats;
+		gt_stamp_saved = guc->timestamp.gt_stamp;
+		reset_count = i915_reset_count(gpu_error);
 		guc_update_engine_gt_clks(engine);
 		guc_update_pm_timestamp(guc, engine, now);
 		intel_gt_pm_put_async(gt);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [Intel-gfx] ✓ Fi.CI.BAT: success for drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-03 22:47 [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset Umesh Nerlige Ramappa
@ 2021-11-03 23:47 ` Patchwork
  2021-11-04  0:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2021-11-03 23:47 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 6866 bytes --]

== Series Details ==

Series: drm/i915/pmu: Fix synchronization of PMU callback with reset
URL   : https://patchwork.freedesktop.org/series/96543/
State : success

== Summary ==

CI Bug Log - changes from CI_DRM_10834 -> Patchwork_21513
====================================================

Summary
-------

  **SUCCESS**

  No regressions found.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/index.html

Participating hosts (37 -> 33)
------------------------------

  Additional (2): fi-glk-dsi fi-tgl-1115g4 
  Missing    (6): fi-kbl-soraka bat-dg1-6 fi-tgl-u2 fi-bsw-cyan fi-icl-u2 bat-adlp-4 

Known issues
------------

  Here are the changes found in Patchwork_21513 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@amdgpu/amd_basic@query-info:
    - fi-bsw-kefka:       NOTRUN -> [SKIP][1] ([fdo#109271]) +17 similar issues
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-bsw-kefka/igt@amdgpu/amd_basic@query-info.html
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][2] ([fdo#109315])
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@amdgpu/amd_basic@query-info.html

  * igt@amdgpu/amd_cs_nop@nop-gfx0:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][3] ([fdo#109315] / [i915#2575]) +16 similar issues
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@amdgpu/amd_cs_nop@nop-gfx0.html

  * igt@gem_huc_copy@huc-copy:
    - fi-glk-dsi:         NOTRUN -> [SKIP][4] ([fdo#109271] / [i915#2190])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-glk-dsi/igt@gem_huc_copy@huc-copy.html
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][5] ([i915#2190])
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@gem_huc_copy@huc-copy.html

  * igt@i915_pm_backlight@basic-brightness:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][6] ([i915#1155])
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@i915_pm_backlight@basic-brightness.html

  * igt@kms_chamelium@common-hpd-after-suspend:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][7] ([fdo#111827]) +8 similar issues
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@kms_chamelium@common-hpd-after-suspend.html

  * igt@kms_chamelium@hdmi-hpd-fast:
    - fi-glk-dsi:         NOTRUN -> [SKIP][8] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-glk-dsi/igt@kms_chamelium@hdmi-hpd-fast.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][9] ([i915#4103]) +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][10] ([fdo#109285])
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d:
    - fi-glk-dsi:         NOTRUN -> [SKIP][11] ([fdo#109271] / [i915#533])
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-glk-dsi/igt@kms_pipe_crc_basic@compare-crc-sanitycheck-pipe-d.html

  * igt@kms_psr@primary_mmap_gtt:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][12] ([i915#1072]) +3 similar issues
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@kms_psr@primary_mmap_gtt.html

  * igt@kms_psr@primary_page_flip:
    - fi-glk-dsi:         NOTRUN -> [SKIP][13] ([fdo#109271]) +30 similar issues
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-glk-dsi/igt@kms_psr@primary_page_flip.html

  * igt@prime_vgem@basic-userptr:
    - fi-tgl-1115g4:      NOTRUN -> [SKIP][14] ([i915#3301])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-tgl-1115g4/igt@prime_vgem@basic-userptr.html

  * igt@runner@aborted:
    - fi-bdw-5557u:       NOTRUN -> [FAIL][15] ([i915#1602] / [i915#2426] / [i915#4312])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-bdw-5557u/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@i915_selftest@live@execlists:
    - fi-bsw-kefka:       [INCOMPLETE][16] ([i915#2940]) -> [PASS][17]
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/fi-bsw-kefka/igt@i915_selftest@live@execlists.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-bsw-kefka/igt@i915_selftest@live@execlists.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-bxt-dsi:         [DMESG-FAIL][18] ([i915#541]) -> [PASS][19]
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/fi-bxt-dsi/igt@i915_selftest@live@gt_heartbeat.html
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/fi-bxt-dsi/igt@i915_selftest@live@gt_heartbeat.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109315]: https://bugs.freedesktop.org/show_bug.cgi?id=109315
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#1072]: https://gitlab.freedesktop.org/drm/intel/issues/1072
  [i915#1155]: https://gitlab.freedesktop.org/drm/intel/issues/1155
  [i915#1602]: https://gitlab.freedesktop.org/drm/intel/issues/1602
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2426]: https://gitlab.freedesktop.org/drm/intel/issues/2426
  [i915#2575]: https://gitlab.freedesktop.org/drm/intel/issues/2575
  [i915#2940]: https://gitlab.freedesktop.org/drm/intel/issues/2940
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3303]: https://gitlab.freedesktop.org/drm/intel/issues/3303
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#533]: https://gitlab.freedesktop.org/drm/intel/issues/533
  [i915#541]: https://gitlab.freedesktop.org/drm/intel/issues/541


Build changes
-------------

  * Linux: CI_DRM_10834 -> Patchwork_21513

  CI-20190529: 20190529
  CI_DRM_10834: a8a5c5eeb0e76534b999100eabad8b673d2ed310 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_6269: 0dfc3834f0e07badf5b6149c634807ddae119c88 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_21513: 82ff013d48e85781e5f5e8ad523200f9e8c2a6f2 @ git://anongit.freedesktop.org/gfx-ci/linux


== Linux commits ==

82ff013d48e8 drm/i915/pmu: Fix synchronization of PMU callback with reset

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/index.html

[-- Attachment #2: Type: text/html, Size: 8289 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-gfx] ✗ Fi.CI.IGT: failure for drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-03 22:47 [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset Umesh Nerlige Ramappa
  2021-11-03 23:47 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
@ 2021-11-04  0:55 ` Patchwork
  2021-11-04 15:57 ` [Intel-gfx] [PATCH] " Matthew Brost
  2021-11-04 17:37 ` Tvrtko Ursulin
  3 siblings, 0 replies; 13+ messages in thread
From: Patchwork @ 2021-11-04  0:55 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 30283 bytes --]

== Series Details ==

Series: drm/i915/pmu: Fix synchronization of PMU callback with reset
URL   : https://patchwork.freedesktop.org/series/96543/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_10834_full -> Patchwork_21513_full
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_21513_full absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_21513_full, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  

Participating hosts (10 -> 10)
------------------------------

  No changes in participating hosts

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_21513_full:

### IGT changes ###

#### Possible regressions ####

  * igt@kms_cursor_crc@pipe-c-cursor-suspend:
    - shard-kbl:          [PASS][1] -> [INCOMPLETE][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl1/igt@kms_cursor_crc@pipe-c-cursor-suspend.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl4/igt@kms_cursor_crc@pipe-c-cursor-suspend.html

  
Known issues
------------

  Here are the changes found in Patchwork_21513_full that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@gem_exec_fair@basic-deadline:
    - shard-skl:          NOTRUN -> [FAIL][3] ([i915#2846])
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-none@vcs0:
    - shard-apl:          [PASS][4] -> [FAIL][5] ([i915#2842])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-apl8/igt@gem_exec_fair@basic-none@vcs0.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl7/igt@gem_exec_fair@basic-none@vcs0.html
    - shard-tglb:         NOTRUN -> [FAIL][6] ([i915#2842]) +5 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@gem_exec_fair@basic-none@vcs0.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-tglb:         [PASS][7] -> [FAIL][8] ([i915#2842])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-tglb2/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb8/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace-solo@rcs0:
    - shard-glk:          [PASS][9] -> [FAIL][10] ([i915#2842]) +1 similar issue
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk7/igt@gem_exec_fair@basic-pace-solo@rcs0.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk1/igt@gem_exec_fair@basic-pace-solo@rcs0.html

  * igt@gem_exec_fair@basic-pace@vecs0:
    - shard-kbl:          [PASS][11] -> [FAIL][12] ([i915#2842]) +2 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl4/igt@gem_exec_fair@basic-pace@vecs0.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl4/igt@gem_exec_fair@basic-pace@vecs0.html

  * igt@gem_exec_whisper@basic-contexts-priority:
    - shard-glk:          [PASS][13] -> [DMESG-WARN][14] ([i915#118])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk1/igt@gem_exec_whisper@basic-contexts-priority.html
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk2/igt@gem_exec_whisper@basic-contexts-priority.html

  * igt@gem_pread@exhaustion:
    - shard-skl:          NOTRUN -> [WARN][15] ([i915#2658])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@gem_pread@exhaustion.html

  * igt@gem_pxp@protected-raw-src-copy-not-readible:
    - shard-skl:          NOTRUN -> ([SKIP][16], [SKIP][17]) ([fdo#109271]) +6 similar issues
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@gem_pxp@protected-raw-src-copy-not-readible.html
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl3/igt@gem_pxp@protected-raw-src-copy-not-readible.html

  * igt@gem_pxp@regular-baseline-src-copy-readible:
    - shard-kbl:          NOTRUN -> [SKIP][18] ([fdo#109271]) +95 similar issues
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl3/igt@gem_pxp@regular-baseline-src-copy-readible.html

  * igt@gem_userptr_blits@coherency-sync:
    - shard-tglb:         NOTRUN -> [SKIP][19] ([fdo#110542])
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb6/igt@gem_userptr_blits@coherency-sync.html

  * igt@gem_userptr_blits@input-checking:
    - shard-skl:          NOTRUN -> ([DMESG-WARN][20], [DMESG-WARN][21]) ([i915#3002])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl3/igt@gem_userptr_blits@input-checking.html
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@gem_userptr_blits@input-checking.html

  * igt@gem_userptr_blits@unsync-unmap-after-close:
    - shard-tglb:         NOTRUN -> [SKIP][22] ([i915#3297])
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@gem_userptr_blits@unsync-unmap-after-close.html

  * igt@gen9_exec_parse@allowed-single:
    - shard-skl:          [PASS][23] -> [DMESG-WARN][24] ([i915#1436] / [i915#716])
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl5/igt@gen9_exec_parse@allowed-single.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@gen9_exec_parse@allowed-single.html

  * igt@gen9_exec_parse@bb-oversize:
    - shard-tglb:         NOTRUN -> [SKIP][25] ([i915#2856])
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@gen9_exec_parse@bb-oversize.html

  * igt@i915_pm_lpsp@kms-lpsp@kms-lpsp-dp:
    - shard-apl:          NOTRUN -> [SKIP][26] ([fdo#109271] / [i915#1937])
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl3/igt@i915_pm_lpsp@kms-lpsp@kms-lpsp-dp.html

  * igt@i915_pm_lpsp@screens-disabled:
    - shard-tglb:         NOTRUN -> [SKIP][27] ([i915#1902])
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@i915_pm_lpsp@screens-disabled.html

  * igt@kms_big_fb@x-tiled-8bpp-rotate-270:
    - shard-tglb:         NOTRUN -> [SKIP][28] ([fdo#111614])
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_big_fb@x-tiled-8bpp-rotate-270.html

  * igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-async-flip:
    - shard-skl:          NOTRUN -> [FAIL][29] ([i915#3743]) +1 similar issue
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@kms_big_fb@x-tiled-max-hw-stride-64bpp-rotate-180-async-flip.html

  * igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-async-flip:
    - shard-skl:          NOTRUN -> [FAIL][30] ([i915#3763])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl3/igt@kms_big_fb@y-tiled-max-hw-stride-64bpp-rotate-180-async-flip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip:
    - shard-skl:          NOTRUN -> [SKIP][31] ([fdo#109271] / [i915#3777]) +2 similar issues
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl1/igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-0-hflip.html

  * igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-180-hflip:
    - shard-kbl:          NOTRUN -> [SKIP][32] ([fdo#109271] / [i915#3777])
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl7/igt@kms_big_fb@yf-tiled-max-hw-stride-32bpp-rotate-180-hflip.html

  * igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          NOTRUN -> [SKIP][33] ([fdo#109271] / [i915#3886]) +14 similar issues
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl9/igt@kms_ccs@pipe-a-ccs-on-another-bo-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-bad-aux-stride-y_tiled_gen12_rc_ccs_cc:
    - shard-kbl:          NOTRUN -> [SKIP][34] ([fdo#109271] / [i915#3886]) +3 similar issues
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl2/igt@kms_ccs@pipe-b-bad-aux-stride-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-b-crc-primary-basic-y_tiled_gen12_mc_ccs:
    - shard-apl:          NOTRUN -> [SKIP][35] ([fdo#109271] / [i915#3886])
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl1/igt@kms_ccs@pipe-b-crc-primary-basic-y_tiled_gen12_mc_ccs.html

  * igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc:
    - shard-skl:          NOTRUN -> ([SKIP][36], [SKIP][37]) ([fdo#109271] / [i915#3886])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl3/igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc.html
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl1/igt@kms_ccs@pipe-b-random-ccs-data-y_tiled_gen12_rc_ccs_cc.html

  * igt@kms_ccs@pipe-c-crc-primary-rotation-180-yf_tiled_ccs:
    - shard-tglb:         NOTRUN -> [SKIP][38] ([i915#3689]) +4 similar issues
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_ccs@pipe-c-crc-primary-rotation-180-yf_tiled_ccs.html

  * igt@kms_chamelium@hdmi-hpd-storm-disable:
    - shard-skl:          NOTRUN -> [SKIP][39] ([fdo#109271] / [fdo#111827]) +29 similar issues
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@kms_chamelium@hdmi-hpd-storm-disable.html

  * igt@kms_color@pipe-b-ctm-0-25:
    - shard-skl:          [PASS][40] -> [DMESG-WARN][41] ([i915#1982]) +1 similar issue
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl10/igt@kms_color@pipe-b-ctm-0-25.html
   [41]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl7/igt@kms_color@pipe-b-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-c-ctm-0-25:
    - shard-kbl:          NOTRUN -> [SKIP][42] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [42]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl2/igt@kms_color_chamelium@pipe-c-ctm-0-25.html

  * igt@kms_color_chamelium@pipe-c-ctm-max:
    - shard-apl:          NOTRUN -> [SKIP][43] ([fdo#109271] / [fdo#111827]) +4 similar issues
   [43]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl3/igt@kms_color_chamelium@pipe-c-ctm-max.html

  * igt@kms_color_chamelium@pipe-d-ctm-red-to-blue:
    - shard-tglb:         NOTRUN -> [SKIP][44] ([fdo#109284] / [fdo#111827]) +5 similar issues
   [44]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_color_chamelium@pipe-d-ctm-red-to-blue.html

  * igt@kms_content_protection@dp-mst-type-1:
    - shard-tglb:         NOTRUN -> [SKIP][45] ([i915#3116])
   [45]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_content_protection@dp-mst-type-1.html

  * igt@kms_cursor_crc@pipe-b-cursor-32x10-sliding:
    - shard-tglb:         NOTRUN -> [SKIP][46] ([i915#3359]) +5 similar issues
   [46]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_cursor_crc@pipe-b-cursor-32x10-sliding.html

  * igt@kms_cursor_crc@pipe-d-cursor-32x32-sliding:
    - shard-tglb:         NOTRUN -> [SKIP][47] ([i915#3319]) +1 similar issue
   [47]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_cursor_crc@pipe-d-cursor-32x32-sliding.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy:
    - shard-tglb:         NOTRUN -> [SKIP][48] ([i915#4103])
   [48]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy.html

  * igt@kms_cursor_legacy@cursora-vs-flipb-atomic:
    - shard-tglb:         NOTRUN -> [SKIP][49] ([fdo#111825]) +17 similar issues
   [49]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_cursor_legacy@cursora-vs-flipb-atomic.html

  * igt@kms_dither@fb-8bpc-vs-panel-8bpc@edp-1-pipe-a:
    - shard-tglb:         NOTRUN -> [SKIP][50] ([i915#3788])
   [50]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_dither@fb-8bpc-vs-panel-8bpc@edp-1-pipe-a.html

  * igt@kms_fbcon_fbt@fbc-suspend:
    - shard-kbl:          [PASS][51] -> [INCOMPLETE][52] ([i915#180] / [i915#636])
   [51]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl6/igt@kms_fbcon_fbt@fbc-suspend.html
   [52]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl7/igt@kms_fbcon_fbt@fbc-suspend.html

  * igt@kms_flip@flip-vs-suspend@b-dp1:
    - shard-apl:          [PASS][53] -> [DMESG-WARN][54] ([i915#180]) +1 similar issue
   [53]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-apl4/igt@kms_flip@flip-vs-suspend@b-dp1.html
   [54]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl6/igt@kms_flip@flip-vs-suspend@b-dp1.html

  * igt@kms_flip@flip-vs-suspend@b-edp1:
    - shard-tglb:         [PASS][55] -> [DMESG-WARN][56] ([i915#2411] / [i915#2867])
   [55]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-tglb2/igt@kms_flip@flip-vs-suspend@b-edp1.html
   [56]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb6/igt@kms_flip@flip-vs-suspend@b-edp1.html

  * igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1:
    - shard-skl:          [PASS][57] -> [FAIL][58] ([i915#2122])
   [57]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl9/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html
   [58]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl5/igt@kms_flip@plain-flip-fb-recreate-interruptible@c-edp1.html

  * igt@kms_flip@plain-flip-ts-check@c-edp1:
    - shard-skl:          NOTRUN -> [FAIL][59] ([i915#2122])
   [59]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@kms_flip@plain-flip-ts-check@c-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs:
    - shard-skl:          NOTRUN -> [INCOMPLETE][60] ([i915#3699])
   [60]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl5/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytileccs.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs:
    - shard-kbl:          NOTRUN -> [SKIP][61] ([fdo#109271] / [i915#2672])
   [61]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl3/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs.html
    - shard-skl:          NOTRUN -> [SKIP][62] ([fdo#109271] / [i915#2672])
   [62]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl10/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-32bpp-ytilegen12rcccs.html

  * igt@kms_frontbuffer_tracking@fbc-1p-shrfb-fliptrack-mmap-gtt:
    - shard-skl:          NOTRUN -> [SKIP][63] ([fdo#109271]) +353 similar issues
   [63]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@kms_frontbuffer_tracking@fbc-1p-shrfb-fliptrack-mmap-gtt.html

  * igt@kms_hdr@bpc-switch-dpms:
    - shard-skl:          [PASS][64] -> [FAIL][65] ([i915#1188])
   [64]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@kms_hdr@bpc-switch-dpms.html
   [65]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@kms_hdr@bpc-switch-dpms.html

  * igt@kms_hdr@bpc-switch-suspend:
    - shard-kbl:          [PASS][66] -> [DMESG-WARN][67] ([i915#180]) +6 similar issues
   [66]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl2/igt@kms_hdr@bpc-switch-suspend.html
   [67]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl1/igt@kms_hdr@bpc-switch-suspend.html

  * igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb:
    - shard-skl:          NOTRUN -> [FAIL][68] ([i915#265])
   [68]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl9/igt@kms_plane_alpha_blend@pipe-a-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb:
    - shard-kbl:          NOTRUN -> [FAIL][69] ([i915#265])
   [69]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl3/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html
    - shard-skl:          NOTRUN -> ([FAIL][70], [FAIL][71]) ([i915#265])
   [70]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl10/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html
   [71]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl3/igt@kms_plane_alpha_blend@pipe-c-alpha-transparent-fb.html

  * igt@kms_plane_alpha_blend@pipe-c-constant-alpha-max:
    - shard-apl:          NOTRUN -> [FAIL][72] ([fdo#108145] / [i915#265])
   [72]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl1/igt@kms_plane_alpha_blend@pipe-c-constant-alpha-max.html

  * igt@kms_plane_alpha_blend@pipe-c-coverage-7efc:
    - shard-skl:          NOTRUN -> [FAIL][73] ([fdo#108145] / [i915#265]) +4 similar issues
   [73]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl8/igt@kms_plane_alpha_blend@pipe-c-coverage-7efc.html

  * igt@kms_plane_multiple@atomic-pipe-a-tiling-yf:
    - shard-tglb:         NOTRUN -> [SKIP][74] ([fdo#112054])
   [74]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@kms_plane_multiple@atomic-pipe-a-tiling-yf.html

  * igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-1:
    - shard-kbl:          NOTRUN -> [SKIP][75] ([fdo#109271] / [i915#658]) +2 similar issues
   [75]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl2/igt@kms_psr2_sf@overlay-plane-update-sf-dmg-area-1.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-3:
    - shard-skl:          NOTRUN -> [SKIP][76] ([fdo#109271] / [i915#658]) +4 similar issues
   [76]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl9/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-3.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1:
    - shard-tglb:         NOTRUN -> [SKIP][77] ([i915#2920]) +1 similar issue
   [77]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-1.html

  * igt@kms_psr@psr2_primary_mmap_cpu:
    - shard-iclb:         [PASS][78] -> [SKIP][79] ([fdo#109441]) +1 similar issue
   [78]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb2/igt@kms_psr@psr2_primary_mmap_cpu.html
   [79]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb8/igt@kms_psr@psr2_primary_mmap_cpu.html

  * igt@kms_psr@psr2_sprite_blt:
    - shard-tglb:         NOTRUN -> [FAIL][80] ([i915#132] / [i915#3467])
   [80]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_psr@psr2_sprite_blt.html

  * igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180:
    - shard-tglb:         NOTRUN -> [SKIP][81] ([fdo#111615]) +1 similar issue
   [81]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@kms_rotation_crc@primary-yf-tiled-reflect-x-180.html

  * igt@kms_setmode@basic:
    - shard-apl:          [PASS][82] -> [FAIL][83] ([i915#31])
   [82]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-apl3/igt@kms_setmode@basic.html
   [83]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl2/igt@kms_setmode@basic.html
    - shard-glk:          [PASS][84] -> [FAIL][85] ([i915#31])
   [84]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk9/igt@kms_setmode@basic.html
   [85]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk2/igt@kms_setmode@basic.html

  * igt@kms_vblank@pipe-d-wait-idle:
    - shard-skl:          NOTRUN -> [SKIP][86] ([fdo#109271] / [i915#533]) +3 similar issues
   [86]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl6/igt@kms_vblank@pipe-d-wait-idle.html

  * igt@kms_writeback@writeback-fb-id:
    - shard-kbl:          NOTRUN -> [SKIP][87] ([fdo#109271] / [i915#2437])
   [87]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl2/igt@kms_writeback@writeback-fb-id.html

  * igt@nouveau_crc@pipe-c-ctx-flip-skip-current-frame:
    - shard-tglb:         NOTRUN -> [SKIP][88] ([i915#2530]) +2 similar issues
   [88]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb2/igt@nouveau_crc@pipe-c-ctx-flip-skip-current-frame.html

  * igt@perf@polling-parameterized:
    - shard-glk:          [PASS][89] -> [FAIL][90] ([i915#1542])
   [89]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk1/igt@perf@polling-parameterized.html
   [90]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk2/igt@perf@polling-parameterized.html
    - shard-skl:          [PASS][91] -> [FAIL][92] ([i915#1542])
   [91]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@perf@polling-parameterized.html
   [92]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl1/igt@perf@polling-parameterized.html

  * igt@prime_nv_api@i915_nv_reimport_twice_check_flink_name:
    - shard-apl:          NOTRUN -> [SKIP][93] ([fdo#109271]) +27 similar issues
   [93]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl3/igt@prime_nv_api@i915_nv_reimport_twice_check_flink_name.html
    - shard-tglb:         NOTRUN -> [SKIP][94] ([fdo#109291]) +2 similar issues
   [94]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb6/igt@prime_nv_api@i915_nv_reimport_twice_check_flink_name.html

  * igt@sysfs_clients@busy:
    - shard-tglb:         NOTRUN -> [SKIP][95] ([i915#2994]) +1 similar issue
   [95]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@sysfs_clients@busy.html
    - shard-skl:          NOTRUN -> [SKIP][96] ([fdo#109271] / [i915#2994]) +5 similar issues
   [96]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl9/igt@sysfs_clients@busy.html

  * igt@sysfs_clients@pidname:
    - shard-kbl:          NOTRUN -> [SKIP][97] ([fdo#109271] / [i915#2994])
   [97]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl2/igt@sysfs_clients@pidname.html

  
#### Possible fixes ####

  * igt@fbdev@nullptr:
    - shard-skl:          [DMESG-WARN][98] ([i915#1982]) -> [PASS][99]
   [98]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@fbdev@nullptr.html
   [99]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl1/igt@fbdev@nullptr.html

  * igt@gem_ctx_isolation@preservation-s3@bcs0:
    - shard-tglb:         [INCOMPLETE][100] ([i915#456]) -> [PASS][101] +1 similar issue
   [100]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-tglb7/igt@gem_ctx_isolation@preservation-s3@bcs0.html
   [101]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-tglb3/igt@gem_ctx_isolation@preservation-s3@bcs0.html
    - shard-kbl:          [DMESG-WARN][102] ([i915#180]) -> [PASS][103] +5 similar issues
   [102]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl3/igt@gem_ctx_isolation@preservation-s3@bcs0.html
   [103]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl4/igt@gem_ctx_isolation@preservation-s3@bcs0.html

  * igt@gem_exec_fair@basic-deadline:
    - shard-kbl:          [FAIL][104] ([i915#2846]) -> [PASS][105]
   [104]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl4/igt@gem_exec_fair@basic-deadline.html
   [105]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl7/igt@gem_exec_fair@basic-deadline.html
    - shard-glk:          [FAIL][106] ([i915#2846]) -> [PASS][107]
   [106]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk6/igt@gem_exec_fair@basic-deadline.html
   [107]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk8/igt@gem_exec_fair@basic-deadline.html

  * igt@gem_exec_fair@basic-pace-share@rcs0:
    - shard-glk:          [FAIL][108] ([i915#2842]) -> [PASS][109] +2 similar issues
   [108]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk2/igt@gem_exec_fair@basic-pace-share@rcs0.html
   [109]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk8/igt@gem_exec_fair@basic-pace-share@rcs0.html

  * igt@gem_exec_fair@basic-pace@rcs0:
    - shard-kbl:          [SKIP][110] ([fdo#109271]) -> [PASS][111]
   [110]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-kbl4/igt@gem_exec_fair@basic-pace@rcs0.html
   [111]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-kbl4/igt@gem_exec_fair@basic-pace@rcs0.html

  * igt@gem_workarounds@suspend-resume-context:
    - shard-skl:          [INCOMPLETE][112] ([i915#198]) -> [PASS][113]
   [112]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@gem_workarounds@suspend-resume-context.html
   [113]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl8/igt@gem_workarounds@suspend-resume-context.html

  * igt@kms_big_fb@linear-32bpp-rotate-180:
    - shard-glk:          [DMESG-WARN][114] ([i915#118]) -> [PASS][115] +2 similar issues
   [114]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk9/igt@kms_big_fb@linear-32bpp-rotate-180.html
   [115]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk7/igt@kms_big_fb@linear-32bpp-rotate-180.html

  * igt@kms_flip@plain-flip-ts-check-interruptible@a-edp1:
    - shard-skl:          [FAIL][116] ([i915#2122]) -> [PASS][117]
   [116]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@kms_flip@plain-flip-ts-check-interruptible@a-edp1.html
   [117]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@kms_flip@plain-flip-ts-check-interruptible@a-edp1.html

  * igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile:
    - shard-iclb:         [SKIP][118] ([i915#3701]) -> [PASS][119] +1 similar issue
   [118]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb2/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile.html
   [119]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb1/igt@kms_flip_scaled_crc@flip-32bpp-ytile-to-64bpp-ytile.html

  * igt@kms_frontbuffer_tracking@fbc-suspend:
    - shard-apl:          [DMESG-WARN][120] ([i915#180]) -> [PASS][121] +2 similar issues
   [120]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-apl4/igt@kms_frontbuffer_tracking@fbc-suspend.html
   [121]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-apl3/igt@kms_frontbuffer_tracking@fbc-suspend.html

  * igt@kms_psr@psr2_suspend:
    - shard-iclb:         [SKIP][122] ([fdo#109441]) -> [PASS][123] +2 similar issues
   [122]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb1/igt@kms_psr@psr2_suspend.html
   [123]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb2/igt@kms_psr@psr2_suspend.html

  * igt@kms_rotation_crc@primary-rotation-270:
    - shard-glk:          [FAIL][124] ([i915#1888] / [i915#65]) -> [PASS][125]
   [124]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-glk3/igt@kms_rotation_crc@primary-rotation-270.html
   [125]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-glk9/igt@kms_rotation_crc@primary-rotation-270.html

  * igt@perf@blocking:
    - shard-skl:          [FAIL][126] ([i915#1542]) -> [PASS][127]
   [126]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl6/igt@perf@blocking.html
   [127]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl4/igt@perf@blocking.html

  * igt@perf_pmu@module-unload:
    - shard-skl:          [DMESG-WARN][128] ([i915#1982] / [i915#262]) -> [PASS][129]
   [128]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-skl8/igt@perf_pmu@module-unload.html
   [129]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-skl9/igt@perf_pmu@module-unload.html

  
#### Warnings ####

  * igt@gem_exec_fair@basic-none-rrul@rcs0:
    - shard-iclb:         [FAIL][130] ([i915#2842]) -> [FAIL][131] ([i915#2852])
   [130]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb5/igt@gem_exec_fair@basic-none-rrul@rcs0.html
   [131]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb7/igt@gem_exec_fair@basic-none-rrul@rcs0.html

  * igt@i915_pm_dc@dc9-dpms:
    - shard-iclb:         [FAIL][132] ([i915#4275]) -> [SKIP][133] ([i915#4281])
   [132]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb7/igt@i915_pm_dc@dc9-dpms.html
   [133]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb3/igt@i915_pm_dc@dc9-dpms.html

  * igt@i915_pm_rc6_residency@rc6-fence:
    - shard-iclb:         [WARN][134] ([i915#1804] / [i915#2684]) -> [WARN][135] ([i915#2684])
   [134]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb4/igt@i915_pm_rc6_residency@rc6-fence.html
   [135]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb5/igt@i915_pm_rc6_residency@rc6-fence.html

  * igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4:
    - shard-iclb:         [SKIP][136] ([i915#2920]) -> [SKIP][137] ([i915#658]) +1 similar issue
   [136]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb2/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4.html
   [137]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb1/igt@kms_psr2_sf@overlay-primary-update-sf-dmg-area-4.html

  * igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2:
    - shard-iclb:         [SKIP][138] ([i915#658]) -> [SKIP][139] ([i915#2920]) +2 similar issues
   [138]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb3/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html
   [139]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb2/igt@kms_psr2_sf@primary-plane-update-sf-dmg-area-2.html

  * igt@kms_psr2_su@page_flip:
    - shard-iclb:         [FAIL][140] ([i915#4148]) -> [SKIP][141] ([fdo#109642] / [fdo#111068] / [i915#658])
   [140]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_10834/shard-iclb2/igt@kms_psr2_su@page_flip.html
   [141]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/shard-iclb8/igt@kms_psr2_su@page_flip.html

  * igt@runner@aborted:
    - shard-kbl:          ([FAIL][142], [FAIL][1

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_21513/index.html

[-- Attachment #2: Type: text/html, Size: 33475 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-03 22:47 [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset Umesh Nerlige Ramappa
  2021-11-03 23:47 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
  2021-11-04  0:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
@ 2021-11-04 15:57 ` Matthew Brost
  2021-11-04 17:37 ` Tvrtko Ursulin
  3 siblings, 0 replies; 13+ messages in thread
From: Matthew Brost @ 2021-11-04 15:57 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx

On Wed, Nov 03, 2021 at 03:47:08PM -0700, Umesh Nerlige Ramappa wrote:
> Since the PMU callback runs in irq context, it synchronizes with gt
> reset using the reset count. We could run into a case where the PMU
> callback could read the reset count before it is updated. This has a
> potential of corrupting the busyness stats.
> 
> In addition to the reset count, check if the reset bit is set before
> capturing busyness.
> 
> In addition save the previous stats only if you intend to update them.
> 
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>

Reviewed-by: Matthew Brost <matthew.brost@intel.com>

> ---
>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 5cc49c0b3889..d83ade77ca07 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>  	u64 total, gt_stamp_saved;
>  	unsigned long flags;
>  	u32 reset_count;
> +	bool in_reset;
>  
>  	spin_lock_irqsave(&guc->timestamp.lock, flags);
>  
> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>  	 * engine busyness from GuC, so we just use the driver stored
>  	 * copy of busyness. Synchronize with gt reset using reset_count.
>  	 */
> -	reset_count = i915_reset_count(gpu_error);
> +	rcu_read_lock();
> +	in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
> +	rcu_read_unlock();
>  
>  	*now = ktime_get();
>  
> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>  	 * start_gt_clk is derived from GuC state. To get a consistent
>  	 * view of activity, we query the GuC state only if gt is awake.
>  	 */
> -	stats_saved = *stats;
> -	gt_stamp_saved = guc->timestamp.gt_stamp;
> -	if (intel_gt_pm_get_if_awake(gt)) {
> +	if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
> +		stats_saved = *stats;
> +		gt_stamp_saved = guc->timestamp.gt_stamp;
> +		reset_count = i915_reset_count(gpu_error);
>  		guc_update_engine_gt_clks(engine);
>  		guc_update_pm_timestamp(guc, engine, now);
>  		intel_gt_pm_put_async(gt);
> -- 
> 2.20.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-03 22:47 [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset Umesh Nerlige Ramappa
                   ` (2 preceding siblings ...)
  2021-11-04 15:57 ` [Intel-gfx] [PATCH] " Matthew Brost
@ 2021-11-04 17:37 ` Tvrtko Ursulin
  2021-11-04 22:04   ` Umesh Nerlige Ramappa
  3 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2021-11-04 17:37 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa, intel-gfx


On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
> Since the PMU callback runs in irq context, it synchronizes with gt
> reset using the reset count. We could run into a case where the PMU
> callback could read the reset count before it is updated. This has a
> potential of corrupting the busyness stats.
> 
> In addition to the reset count, check if the reset bit is set before
> capturing busyness.
> 
> In addition save the previous stats only if you intend to update them.
> 
> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
> ---
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>   1 file changed, 8 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 5cc49c0b3889..d83ade77ca07 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>   	u64 total, gt_stamp_saved;
>   	unsigned long flags;
>   	u32 reset_count;
> +	bool in_reset;
>   
>   	spin_lock_irqsave(&guc->timestamp.lock, flags);
>   
> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>   	 * engine busyness from GuC, so we just use the driver stored
>   	 * copy of busyness. Synchronize with gt reset using reset_count.
>   	 */
> -	reset_count = i915_reset_count(gpu_error);
> +	rcu_read_lock();
> +	in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
> +	rcu_read_unlock();

I don't really understand the point of rcu_read_lock over test_bit but I 
guess you copied it from the trylock loop.

>   
>   	*now = ktime_get();
>   
> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>   	 * start_gt_clk is derived from GuC state. To get a consistent
>   	 * view of activity, we query the GuC state only if gt is awake.
>   	 */
> -	stats_saved = *stats;
> -	gt_stamp_saved = guc->timestamp.gt_stamp;
> -	if (intel_gt_pm_get_if_awake(gt)) {
> +	if (intel_gt_pm_get_if_awake(gt) && !in_reset) {

What is the point of looking at the old value of in_reset here?  Gut 
feeling says if there is a race this does not fix it.

I did not figure out from the commit message what does "could read the 
reset count before it is updated" mean? I thought the point of reading 
the reset count twice was that you are sure there was no reset while in 
here, in which case it is safe to update the software copy. I don't 
easily see what test_bit does on top.

Regards,

Tvrtko

> +		stats_saved = *stats;
> +		gt_stamp_saved = guc->timestamp.gt_stamp;
> +		reset_count = i915_reset_count(gpu_error);
>   		guc_update_engine_gt_clks(engine);
>   		guc_update_pm_timestamp(guc, engine, now);
>   		intel_gt_pm_put_async(gt);
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-04 17:37 ` Tvrtko Ursulin
@ 2021-11-04 22:04   ` Umesh Nerlige Ramappa
  2021-11-11 14:37     ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-04 22:04 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>
>On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>Since the PMU callback runs in irq context, it synchronizes with gt
>>reset using the reset count. We could run into a case where the PMU
>>callback could read the reset count before it is updated. This has a
>>potential of corrupting the busyness stats.
>>
>>In addition to the reset count, check if the reset bit is set before
>>capturing busyness.
>>
>>In addition save the previous stats only if you intend to update them.
>>
>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>---
>>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>
>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>index 5cc49c0b3889..d83ade77ca07 100644
>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>@@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>  	u64 total, gt_stamp_saved;
>>  	unsigned long flags;
>>  	u32 reset_count;
>>+	bool in_reset;
>>  	spin_lock_irqsave(&guc->timestamp.lock, flags);
>>@@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>  	 * engine busyness from GuC, so we just use the driver stored
>>  	 * copy of busyness. Synchronize with gt reset using reset_count.
>>  	 */
>>-	reset_count = i915_reset_count(gpu_error);
>>+	rcu_read_lock();
>>+	in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>+	rcu_read_unlock();
>
>I don't really understand the point of rcu_read_lock over test_bit but 
>I guess you copied it from the trylock loop.

Yes, I don't see other parts of code using the lock though. I can drop 
it.

>
>>  	*now = ktime_get();
>>@@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
>>  	 * start_gt_clk is derived from GuC state. To get a consistent
>>  	 * view of activity, we query the GuC state only if gt is awake.
>>  	 */
>>-	stats_saved = *stats;
>>-	gt_stamp_saved = guc->timestamp.gt_stamp;
>>-	if (intel_gt_pm_get_if_awake(gt)) {
>>+	if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>
>What is the point of looking at the old value of in_reset here?  Gut 
>feeling says if there is a race this does not fix it.
>
>I did not figure out from the commit message what does "could read the 
>reset count before it is updated" mean?
>I thought the point of reading 

>the reset count twice was that you are sure there was no reset while 
>in here, in which case it is safe to update the software copy. I don't 
>easily see what test_bit does on top.

This is what I see in the reset flow
---------------

R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
R2) atomic_inc(&gt->i915->gpu_error.reset_count)
R3) reset prepare
R4) do the HW reset

The reset count is updated only once above and that's before an actual 
HW reset happens.

PMU callback flow before this patch
---------------

P1) read reset count
P2) update stats
P3) read reset count
P4) if reset count changed, use old stats. if not use updated stats.

I am concerned that the PMU flow could run after step (R2). Then we 
wrongly conclude that the count stayed the same and no HW reset 
happened.

PMU callback flow with this patch
---------------
This would rely on the reset_count only if a reset is not in progress.

P0) test_bit for I915_RESET_BACKOFF
P1) read reset count if not in reset. if in reset, use old stats
P2) update stats
P3) read reset count
P4) if reset count changed, use old stats. if not use updated stats.

Now that I think about it more, I do see one sequence that still needs 
fixing though - P0, R1, R2, P1 - P4. For that, I think I need to re-read 
the BACKOFF bit after reading the reset_count for the first time. 

Modified PMU callback sequence would be:
----------

M0) test_bit for I915_RESET_BACKOFF
M1) read reset count if not in reset, if in reset, use old stats

M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if not, 
use reset_count to synchronize

M2) update stats
M3) read reset count
M4) if reset count changed, use old stats. if not use updated stats.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko
>
>>+		stats_saved = *stats;
>>+		gt_stamp_saved = guc->timestamp.gt_stamp;
>>+		reset_count = i915_reset_count(gpu_error);
>>  		guc_update_engine_gt_clks(engine);
>>  		guc_update_pm_timestamp(guc, engine, now);
>>  		intel_gt_pm_put_async(gt);
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-04 22:04   ` Umesh Nerlige Ramappa
@ 2021-11-11 14:37     ` Tvrtko Ursulin
  2021-11-11 16:48       ` Umesh Nerlige Ramappa
  0 siblings, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2021-11-11 14:37 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx


On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
> On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>
>> On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>> Since the PMU callback runs in irq context, it synchronizes with gt
>>> reset using the reset count. We could run into a case where the PMU
>>> callback could read the reset count before it is updated. This has a
>>> potential of corrupting the busyness stats.
>>>
>>> In addition to the reset count, check if the reset bit is set before
>>> capturing busyness.
>>>
>>> In addition save the previous stats only if you intend to update them.
>>>
>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>> ---
>>>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> index 5cc49c0b3889..d83ade77ca07 100644
>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct 
>>> intel_engine_cs *engine, ktime_t *now)
>>>      u64 total, gt_stamp_saved;
>>>      unsigned long flags;
>>>      u32 reset_count;
>>> +    bool in_reset;
>>>      spin_lock_irqsave(&guc->timestamp.lock, flags);
>>> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct 
>>> intel_engine_cs *engine, ktime_t *now)
>>>       * engine busyness from GuC, so we just use the driver stored
>>>       * copy of busyness. Synchronize with gt reset using reset_count.
>>>       */
>>> -    reset_count = i915_reset_count(gpu_error);
>>> +    rcu_read_lock();
>>> +    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>> +    rcu_read_unlock();
>>
>> I don't really understand the point of rcu_read_lock over test_bit but 
>> I guess you copied it from the trylock loop.
> 
> Yes, I don't see other parts of code using the lock though. I can drop it.
> 
>>
>>>      *now = ktime_get();
>>> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct 
>>> intel_engine_cs *engine, ktime_t *now)
>>>       * start_gt_clk is derived from GuC state. To get a consistent
>>>       * view of activity, we query the GuC state only if gt is awake.
>>>       */
>>> -    stats_saved = *stats;
>>> -    gt_stamp_saved = guc->timestamp.gt_stamp;
>>> -    if (intel_gt_pm_get_if_awake(gt)) {
>>> +    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>
>> What is the point of looking at the old value of in_reset here?  Gut 
>> feeling says if there is a race this does not fix it.
>>
>> I did not figure out from the commit message what does "could read the 
>> reset count before it is updated" mean?
>> I thought the point of reading 
> 
>> the reset count twice was that you are sure there was no reset while 
>> in here, in which case it is safe to update the software copy. I don't 
>> easily see what test_bit does on top.
> 
> This is what I see in the reset flow
> ---------------
> 
> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
> R3) reset prepare
> R4) do the HW reset
> 
> The reset count is updated only once above and that's before an actual 
> HW reset happens.
> 
> PMU callback flow before this patch
> ---------------
> 
> P1) read reset count
> P2) update stats
> P3) read reset count
> P4) if reset count changed, use old stats. if not use updated stats.
> 
> I am concerned that the PMU flow could run after step (R2). Then we 
> wrongly conclude that the count stayed the same and no HW reset happened.
> 
> PMU callback flow with this patch
> ---------------
> This would rely on the reset_count only if a reset is not in progress.
> 
> P0) test_bit for I915_RESET_BACKOFF
> P1) read reset count if not in reset. if in reset, use old stats
> P2) update stats
> P3) read reset count
> P4) if reset count changed, use old stats. if not use updated stats.
> 
> Now that I think about it more, I do see one sequence that still needs 
> fixing though - P0, R1, R2, P1 - P4. For that, I think I need to re-read 
> the BACKOFF bit after reading the reset_count for the first time.
> Modified PMU callback sequence would be:
> ----------
> 
> M0) test_bit for I915_RESET_BACKOFF
> M1) read reset count if not in reset, if in reset, use old stats
> 
> M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if not, 
> use reset_count to synchronize
> 
> M2) update stats
> M3) read reset count
> M4) if reset count changed, use old stats. if not use updated stats.

You did not end up implementing this flow? Have you later changed your 
mind whether it is required or not? Or maybe I am looking at not the 
latest patch.

Is the below the latest?

"""
v2:
- The 2 reset counts captured in the PMU callback can end up being the
   same if they were captured right after the count is incremented in the
   reset flow. This can lead to a bad busyness state. Ensure that reset
   is not in progress when the initial reset count is captured.
"""

Is the key now that you rely on ordering of atomic_inc and set_bit in 
the reset path? Frankly I still don't understand why you can get away 
with using stale in_reset in v2. If you acknowledge it can change 
between sampling and checking, then what is the point in having it 
altogether? You still solely rely on reset count in that case, no?


Regards,

Tvrtko

> 
> Thanks,
> Umesh
> 
>>
>> Regards,
>>
>> Tvrtko
>>
>>> +        stats_saved = *stats;
>>> +        gt_stamp_saved = guc->timestamp.gt_stamp;
>>> +        reset_count = i915_reset_count(gpu_error);
>>>          guc_update_engine_gt_clks(engine);
>>>          guc_update_pm_timestamp(guc, engine, now);
>>>          intel_gt_pm_put_async(gt);
>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-11 14:37     ` Tvrtko Ursulin
@ 2021-11-11 16:48       ` Umesh Nerlige Ramappa
  2021-11-20  0:25         ` Umesh Nerlige Ramappa
  2021-11-22 15:44         ` Tvrtko Ursulin
  0 siblings, 2 replies; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-11 16:48 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>
>On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>
>>>On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>Since the PMU callback runs in irq context, it synchronizes with gt
>>>>reset using the reset count. We could run into a case where the PMU
>>>>callback could read the reset count before it is updated. This has a
>>>>potential of corrupting the busyness stats.
>>>>
>>>>In addition to the reset count, check if the reset bit is set before
>>>>capturing busyness.
>>>>
>>>>In addition save the previous stats only if you intend to update them.
>>>>
>>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>>---
>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>index 5cc49c0b3889..d83ade77ca07 100644
>>>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>@@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct 
>>>>intel_engine_cs *engine, ktime_t *now)
>>>>     u64 total, gt_stamp_saved;
>>>>     unsigned long flags;
>>>>     u32 reset_count;
>>>>+    bool in_reset;
>>>>     spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>@@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct 
>>>>intel_engine_cs *engine, ktime_t *now)
>>>>      * engine busyness from GuC, so we just use the driver stored
>>>>      * copy of busyness. Synchronize with gt reset using reset_count.
>>>>      */
>>>>-    reset_count = i915_reset_count(gpu_error);
>>>>+    rcu_read_lock();
>>>>+    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>+    rcu_read_unlock();
>>>
>>>I don't really understand the point of rcu_read_lock over test_bit 
>>>but I guess you copied it from the trylock loop.
>>
>>Yes, I don't see other parts of code using the lock though. I can drop it.
>>
>>>
>>>>     *now = ktime_get();
>>>>@@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct 
>>>>intel_engine_cs *engine, ktime_t *now)
>>>>      * start_gt_clk is derived from GuC state. To get a consistent
>>>>      * view of activity, we query the GuC state only if gt is awake.
>>>>      */
>>>>-    stats_saved = *stats;
>>>>-    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>-    if (intel_gt_pm_get_if_awake(gt)) {
>>>>+    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>
>>>What is the point of looking at the old value of in_reset here?  
>>>Gut feeling says if there is a race this does not fix it.
>>>
>>>I did not figure out from the commit message what does "could read 
>>>the reset count before it is updated" mean?
>>>I thought the point of reading
>>
>>>the reset count twice was that you are sure there was no reset 
>>>while in here, in which case it is safe to update the software 
>>>copy. I don't easily see what test_bit does on top.
>>
>>This is what I see in the reset flow
>>---------------
>>
>>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>R3) reset prepare
>>R4) do the HW reset
>>
>>The reset count is updated only once above and that's before an 
>>actual HW reset happens.
>>
>>PMU callback flow before this patch
>>---------------
>>
>>P1) read reset count
>>P2) update stats
>>P3) read reset count
>>P4) if reset count changed, use old stats. if not use updated stats.
>>
>>I am concerned that the PMU flow could run after step (R2). Then we 
>>wrongly conclude that the count stayed the same and no HW reset 
>>happened.

Here is the problematic sequence: Threads R and P.
------------
R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
R2) atomic_inc(&gt->i915->gpu_error.reset_count)
	P1) read reset count
	P2) update stats
	P3) read reset count
	P4) if reset count changed, use old stats. if not use updated 
stats.
R3) reset prepare
R4) do the HW reset

Do you agree that this is racy? In thread P we don't know in if the 
reset flag was set or not when we captured the reset count in P1?

>>
>>PMU callback flow with this patch
>>---------------
>>This would rely on the reset_count only if a reset is not in progress.
>>
>>P0) test_bit for I915_RESET_BACKOFF
>>P1) read reset count if not in reset. if in reset, use old stats
>>P2) update stats
>>P3) read reset count
>>P4) if reset count changed, use old stats. if not use updated stats.
>>
>>Now that I think about it more, I do see one sequence that still 
>>needs fixing though - P0, R1, R2, P1 - P4. For that, I think I need 
>>to re-read the BACKOFF bit after reading the reset_count for the 
>>first time.
>>Modified PMU callback sequence would be:
>>----------
>>
>>M0) test_bit for I915_RESET_BACKOFF
>>M1) read reset count if not in reset, if in reset, use old stats
>>
>>M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if 
>>not, use reset_count to synchronize
>>
>>M2) update stats
>>M3) read reset count
>>M4) if reset count changed, use old stats. if not use updated stats.
>
>You did not end up implementing this flow? Have you later changed your 
>mind whether it is required or not? Or maybe I am looking at not the 
>latest patch.
>
>Is the below the latest?
>
>"""
>v2:
>- The 2 reset counts captured in the PMU callback can end up being the
>  same if they were captured right after the count is incremented in the
>  reset flow. This can lead to a bad busyness state. Ensure that reset
>  is not in progress when the initial reset count is captured.
>"""

Yes, v2 is the latest (maybe CI results re-ordered the patches). Instead 
of sampling the BACKOFF flag before and after the reset count (as in the 
modified sequence), I just sample it after. The order is critical - 
first sample reset count and then the reset flag.

>
>Is the key now that you rely on ordering of atomic_inc and set_bit in 
>the reset path?

Yes

>Frankly I still don't understand why you can get away 

>with using stale in_reset in v2. If you acknowledge it can change 
>between sampling and checking, then what is the point in having it 
>altogether? You still solely rely on reset count in that case, no?

Correct, but now I know for sure that the first sample of reset_count 
was captured when reset flag was not set (since I am relying on the 
order of sampling).

About solely using the reset_count, I have listed the problematic 
sequence above to highlight what the issue is.

Thanks,
Umesh

>
>
>Regards,
>
>Tvrtko
>
>>
>>Thanks,
>>Umesh
>>
>>>
>>>Regards,
>>>
>>>Tvrtko
>>>
>>>>+        stats_saved = *stats;
>>>>+        gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>+        reset_count = i915_reset_count(gpu_error);
>>>>         guc_update_engine_gt_clks(engine);
>>>>         guc_update_pm_timestamp(guc, engine, now);
>>>>         intel_gt_pm_put_async(gt);
>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-11 16:48       ` Umesh Nerlige Ramappa
@ 2021-11-20  0:25         ` Umesh Nerlige Ramappa
  2021-11-22 15:44         ` Tvrtko Ursulin
  1 sibling, 0 replies; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-20  0:25 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

Hi Tvrtko,

Any inputs on this one?

Thanks,
Umesh

On Thu, Nov 11, 2021 at 08:48:10AM -0800, Umesh Nerlige Ramappa wrote:
>On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>>
>>On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>>On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>>On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>>Since the PMU callback runs in irq context, it synchronizes with gt
>>>>>reset using the reset count. We could run into a case where the PMU
>>>>>callback could read the reset count before it is updated. This has a
>>>>>potential of corrupting the busyness stats.
>>>>>
>>>>>In addition to the reset count, check if the reset bit is set before
>>>>>capturing busyness.
>>>>>
>>>>>In addition save the previous stats only if you intend to update them.
>>>>>
>>>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>>>---
>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>
>>>>>diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>index 5cc49c0b3889..d83ade77ca07 100644
>>>>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>@@ -1183,6 +1183,7 @@ static ktime_t 
>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>*now)
>>>>>     u64 total, gt_stamp_saved;
>>>>>     unsigned long flags;
>>>>>     u32 reset_count;
>>>>>+    bool in_reset;
>>>>>     spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>>@@ -1191,7 +1192,9 @@ static ktime_t 
>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>*now)
>>>>>      * engine busyness from GuC, so we just use the driver stored
>>>>>      * copy of busyness. Synchronize with gt reset using reset_count.
>>>>>      */
>>>>>-    reset_count = i915_reset_count(gpu_error);
>>>>>+    rcu_read_lock();
>>>>>+    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>>+    rcu_read_unlock();
>>>>
>>>>I don't really understand the point of rcu_read_lock over 
>>>>test_bit but I guess you copied it from the trylock loop.
>>>
>>>Yes, I don't see other parts of code using the lock though. I can drop it.
>>>
>>>>
>>>>>     *now = ktime_get();
>>>>>@@ -1201,9 +1204,10 @@ static ktime_t 
>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>*now)
>>>>>      * start_gt_clk is derived from GuC state. To get a consistent
>>>>>      * view of activity, we query the GuC state only if gt is awake.
>>>>>      */
>>>>>-    stats_saved = *stats;
>>>>>-    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>>-    if (intel_gt_pm_get_if_awake(gt)) {
>>>>>+    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>>
>>>>What is the point of looking at the old value of in_reset here?  
>>>>Gut feeling says if there is a race this does not fix it.
>>>>
>>>>I did not figure out from the commit message what does "could 
>>>>read the reset count before it is updated" mean?
>>>>I thought the point of reading
>>>
>>>>the reset count twice was that you are sure there was no reset 
>>>>while in here, in which case it is safe to update the software 
>>>>copy. I don't easily see what test_bit does on top.
>>>
>>>This is what I see in the reset flow
>>>---------------
>>>
>>>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>R3) reset prepare
>>>R4) do the HW reset
>>>
>>>The reset count is updated only once above and that's before an 
>>>actual HW reset happens.
>>>
>>>PMU callback flow before this patch
>>>---------------
>>>
>>>P1) read reset count
>>>P2) update stats
>>>P3) read reset count
>>>P4) if reset count changed, use old stats. if not use updated stats.
>>>
>>>I am concerned that the PMU flow could run after step (R2). Then 
>>>we wrongly conclude that the count stayed the same and no HW reset 
>>>happened.
>
>Here is the problematic sequence: Threads R and P.
>------------
>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>	P1) read reset count
>	P2) update stats
>	P3) read reset count
>	P4) if reset count changed, use old stats. if not use updated stats.
>R3) reset prepare
>R4) do the HW reset
>
>Do you agree that this is racy? In thread P we don't know in if the 
>reset flag was set or not when we captured the reset count in P1?
>
>>>
>>>PMU callback flow with this patch
>>>---------------
>>>This would rely on the reset_count only if a reset is not in progress.
>>>
>>>P0) test_bit for I915_RESET_BACKOFF
>>>P1) read reset count if not in reset. if in reset, use old stats
>>>P2) update stats
>>>P3) read reset count
>>>P4) if reset count changed, use old stats. if not use updated stats.
>>>
>>>Now that I think about it more, I do see one sequence that still 
>>>needs fixing though - P0, R1, R2, P1 - P4. For that, I think I 
>>>need to re-read the BACKOFF bit after reading the reset_count for 
>>>the first time.
>>>Modified PMU callback sequence would be:
>>>----------
>>>
>>>M0) test_bit for I915_RESET_BACKOFF
>>>M1) read reset count if not in reset, if in reset, use old stats
>>>
>>>M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if 
>>>not, use reset_count to synchronize
>>>
>>>M2) update stats
>>>M3) read reset count
>>>M4) if reset count changed, use old stats. if not use updated stats.
>>
>>You did not end up implementing this flow? Have you later changed 
>>your mind whether it is required or not? Or maybe I am looking at 
>>not the latest patch.
>>
>>Is the below the latest?
>>
>>"""
>>v2:
>>- The 2 reset counts captured in the PMU callback can end up being the
>> same if they were captured right after the count is incremented in the
>> reset flow. This can lead to a bad busyness state. Ensure that reset
>> is not in progress when the initial reset count is captured.
>>"""
>
>Yes, v2 is the latest (maybe CI results re-ordered the patches). 
>Instead of sampling the BACKOFF flag before and after the reset count 
>(as in the modified sequence), I just sample it after. The order is 
>critical - first sample reset count and then the reset flag.
>
>>
>>Is the key now that you rely on ordering of atomic_inc and set_bit 
>>in the reset path?
>
>Yes
>
>>Frankly I still don't understand why you can get away
>
>>with using stale in_reset in v2. If you acknowledge it can change 
>>between sampling and checking, then what is the point in having it 
>>altogether? You still solely rely on reset count in that case, no?
>
>Correct, but now I know for sure that the first sample of reset_count 
>was captured when reset flag was not set (since I am relying on the 
>order of sampling).
>
>About solely using the reset_count, I have listed the problematic 
>sequence above to highlight what the issue is.
>
>Thanks,
>Umesh
>
>>
>>
>>Regards,
>>
>>Tvrtko
>>
>>>
>>>Thanks,
>>>Umesh
>>>
>>>>
>>>>Regards,
>>>>
>>>>Tvrtko
>>>>
>>>>>+        stats_saved = *stats;
>>>>>+        gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>>+        reset_count = i915_reset_count(gpu_error);
>>>>>         guc_update_engine_gt_clks(engine);
>>>>>         guc_update_pm_timestamp(guc, engine, now);
>>>>>         intel_gt_pm_put_async(gt);
>>>>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-11 16:48       ` Umesh Nerlige Ramappa
  2021-11-20  0:25         ` Umesh Nerlige Ramappa
@ 2021-11-22 15:44         ` Tvrtko Ursulin
  2021-11-22 23:39           ` Umesh Nerlige Ramappa
  1 sibling, 1 reply; 13+ messages in thread
From: Tvrtko Ursulin @ 2021-11-22 15:44 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx


On 11/11/2021 16:48, Umesh Nerlige Ramappa wrote:
> On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>>
>> On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>> On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>> Since the PMU callback runs in irq context, it synchronizes with gt
>>>>> reset using the reset count. We could run into a case where the PMU
>>>>> callback could read the reset count before it is updated. This has a
>>>>> potential of corrupting the busyness stats.
>>>>>
>>>>> In addition to the reset count, check if the reset bit is set before
>>>>> capturing busyness.
>>>>>
>>>>> In addition save the previous stats only if you intend to update them.
>>>>>
>>>>> Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>>> ---
>>>>>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> index 5cc49c0b3889..d83ade77ca07 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct 
>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>      u64 total, gt_stamp_saved;
>>>>>      unsigned long flags;
>>>>>      u32 reset_count;
>>>>> +    bool in_reset;
>>>>>      spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct 
>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>       * engine busyness from GuC, so we just use the driver stored
>>>>>       * copy of busyness. Synchronize with gt reset using reset_count.
>>>>>       */
>>>>> -    reset_count = i915_reset_count(gpu_error);
>>>>> +    rcu_read_lock();
>>>>> +    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>> +    rcu_read_unlock();
>>>>
>>>> I don't really understand the point of rcu_read_lock over test_bit 
>>>> but I guess you copied it from the trylock loop.
>>>
>>> Yes, I don't see other parts of code using the lock though. I can 
>>> drop it.
>>>
>>>>
>>>>>      *now = ktime_get();
>>>>> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct 
>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>       * start_gt_clk is derived from GuC state. To get a consistent
>>>>>       * view of activity, we query the GuC state only if gt is awake.
>>>>>       */
>>>>> -    stats_saved = *stats;
>>>>> -    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>> -    if (intel_gt_pm_get_if_awake(gt)) {
>>>>> +    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>>
>>>> What is the point of looking at the old value of in_reset here? Gut 
>>>> feeling says if there is a race this does not fix it.
>>>>
>>>> I did not figure out from the commit message what does "could read 
>>>> the reset count before it is updated" mean?
>>>> I thought the point of reading
>>>
>>>> the reset count twice was that you are sure there was no reset while 
>>>> in here, in which case it is safe to update the software copy. I 
>>>> don't easily see what test_bit does on top.
>>>
>>> This is what I see in the reset flow
>>> ---------------
>>>
>>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>> R3) reset prepare
>>> R4) do the HW reset
>>>
>>> The reset count is updated only once above and that's before an 
>>> actual HW reset happens.
>>>
>>> PMU callback flow before this patch
>>> ---------------
>>>
>>> P1) read reset count
>>> P2) update stats
>>> P3) read reset count
>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>
>>> I am concerned that the PMU flow could run after step (R2). Then we 
>>> wrongly conclude that the count stayed the same and no HW reset 
>>> happened.
> 
> Here is the problematic sequence: Threads R and P.
> ------------
> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>      P1) read reset count
>      P2) update stats
>      P3) read reset count
>      P4) if reset count changed, use old stats. if not use updated stats.
> R3) reset prepare
> R4) do the HW reset
> 
> Do you agree that this is racy? In thread P we don't know in if the 
> reset flag was set or not when we captured the reset count in P1?
> 
>>>
>>> PMU callback flow with this patch
>>> ---------------
>>> This would rely on the reset_count only if a reset is not in progress.
>>>
>>> P0) test_bit for I915_RESET_BACKOFF
>>> P1) read reset count if not in reset. if in reset, use old stats
>>> P2) update stats
>>> P3) read reset count
>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>
>>> Now that I think about it more, I do see one sequence that still 
>>> needs fixing though - P0, R1, R2, P1 - P4. For that, I think I need 
>>> to re-read the BACKOFF bit after reading the reset_count for the 
>>> first time.
>>> Modified PMU callback sequence would be:
>>> ----------
>>>
>>> M0) test_bit for I915_RESET_BACKOFF
>>> M1) read reset count if not in reset, if in reset, use old stats
>>>
>>> M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if not, 
>>> use reset_count to synchronize
>>>
>>> M2) update stats
>>> M3) read reset count
>>> M4) if reset count changed, use old stats. if not use updated stats.
>>
>> You did not end up implementing this flow? Have you later changed your 
>> mind whether it is required or not? Or maybe I am looking at not the 
>> latest patch.
>>
>> Is the below the latest?
>>
>> """
>> v2:
>> - The 2 reset counts captured in the PMU callback can end up being the
>>  same if they were captured right after the count is incremented in the
>>  reset flow. This can lead to a bad busyness state. Ensure that reset
>>  is not in progress when the initial reset count is captured.
>> """
> 
> Yes, v2 is the latest (maybe CI results re-ordered the patches). Instead 
> of sampling the BACKOFF flag before and after the reset count (as in the 
> modified sequence), I just sample it after. The order is critical - 
> first sample reset count and then the reset flag.
> 
>>
>> Is the key now that you rely on ordering of atomic_inc and set_bit in 
>> the reset path?
> 
> Yes
> 
>> Frankly I still don't understand why you can get away 
> 
>> with using stale in_reset in v2. If you acknowledge it can change 
>> between sampling and checking, then what is the point in having it 
>> altogether? You still solely rely on reset count in that case, no?
> 
> Correct, but now I know for sure that the first sample of reset_count 
> was captured when reset flag was not set (since I am relying on the 
> order of sampling).
> 
> About solely using the reset_count, I have listed the problematic 
> sequence above to highlight what the issue is.

It was this:

"""
R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
R2) atomic_inc(&gt->i915->gpu_error.reset_count)
      P1) read reset count
      P2) update stats
      P3) read reset count
      P4) if reset count changed, use old stats. if not use updated stats.
R3) reset prepare
R4) do the HW reset

Do you agree that this is racy? In thread P we don't know in if the reset flag was set or not when we captured the reset count in P1?
"""

Why it matter if reset flag was set or not? Lets see how things are after this patch:

After this patch it ends like this:

      P1) Read and store reset bit
R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
      P2) If reset bit was not set:
            P2.1) read reset count
R2) atomic_inc(&gt->i915->gpu_error.reset_count)
            P2.2) update stats
            P2.3) read reset count
            P2.4) if reset count changed, use old stats. if not use updated stats.
R3) reset prepare
R4) do the HW reset

So the reset bit got set between P1 and P2. How is that then not the same as not looking at the reset bit at all?

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-22 15:44         ` Tvrtko Ursulin
@ 2021-11-22 23:39           ` Umesh Nerlige Ramappa
  2021-11-23  9:15             ` Tvrtko Ursulin
  0 siblings, 1 reply; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-22 23:39 UTC (permalink / raw)
  To: Tvrtko Ursulin; +Cc: intel-gfx

On Mon, Nov 22, 2021 at 03:44:29PM +0000, Tvrtko Ursulin wrote:
>
>On 11/11/2021 16:48, Umesh Nerlige Ramappa wrote:
>>On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>>>
>>>On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>>>On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>>>
>>>>>On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>>>Since the PMU callback runs in irq context, it synchronizes with gt
>>>>>>reset using the reset count. We could run into a case where the PMU
>>>>>>callback could read the reset count before it is updated. This has a
>>>>>>potential of corrupting the busyness stats.
>>>>>>
>>>>>>In addition to the reset count, check if the reset bit is set before
>>>>>>capturing busyness.
>>>>>>
>>>>>>In addition save the previous stats only if you intend to update them.
>>>>>>
>>>>>>Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
>>>>>>---
>>>>>> drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>>>> 1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>>
>>>>>>diff --git 
>>>>>>a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>index 5cc49c0b3889..d83ade77ca07 100644
>>>>>>--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>@@ -1183,6 +1183,7 @@ static ktime_t 
>>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>>*now)
>>>>>>     u64 total, gt_stamp_saved;
>>>>>>     unsigned long flags;
>>>>>>     u32 reset_count;
>>>>>>+    bool in_reset;
>>>>>>     spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>>>@@ -1191,7 +1192,9 @@ static ktime_t 
>>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>>*now)
>>>>>>      * engine busyness from GuC, so we just use the driver stored
>>>>>>      * copy of busyness. Synchronize with gt reset using reset_count.
>>>>>>      */
>>>>>>-    reset_count = i915_reset_count(gpu_error);
>>>>>>+    rcu_read_lock();
>>>>>>+    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>>>+    rcu_read_unlock();
>>>>>
>>>>>I don't really understand the point of rcu_read_lock over 
>>>>>test_bit but I guess you copied it from the trylock loop.
>>>>
>>>>Yes, I don't see other parts of code using the lock though. I 
>>>>can drop it.
>>>>
>>>>>
>>>>>>     *now = ktime_get();
>>>>>>@@ -1201,9 +1204,10 @@ static ktime_t 
>>>>>>guc_engine_busyness(struct intel_engine_cs *engine, ktime_t 
>>>>>>*now)
>>>>>>      * start_gt_clk is derived from GuC state. To get a consistent
>>>>>>      * view of activity, we query the GuC state only if gt is awake.
>>>>>>      */
>>>>>>-    stats_saved = *stats;
>>>>>>-    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>>>-    if (intel_gt_pm_get_if_awake(gt)) {
>>>>>>+    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>>>
>>>>>What is the point of looking at the old value of in_reset 
>>>>>here? Gut feeling says if there is a race this does not fix 
>>>>>it.
>>>>>
>>>>>I did not figure out from the commit message what does "could 
>>>>>read the reset count before it is updated" mean?
>>>>>I thought the point of reading
>>>>
>>>>>the reset count twice was that you are sure there was no reset 
>>>>>while in here, in which case it is safe to update the software 
>>>>>copy. I don't easily see what test_bit does on top.
>>>>
>>>>This is what I see in the reset flow
>>>>---------------
>>>>
>>>>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>>>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>>R3) reset prepare
>>>>R4) do the HW reset
>>>>
>>>>The reset count is updated only once above and that's before an 
>>>>actual HW reset happens.
>>>>
>>>>PMU callback flow before this patch
>>>>---------------
>>>>
>>>>P1) read reset count
>>>>P2) update stats
>>>>P3) read reset count
>>>>P4) if reset count changed, use old stats. if not use updated stats.
>>>>
>>>>I am concerned that the PMU flow could run after step (R2). Then 
>>>>we wrongly conclude that the count stayed the same and no HW 
>>>>reset happened.
>>
>>Here is the problematic sequence: Threads R and P.
>>------------
>>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>     P1) read reset count
>>     P2) update stats
>>     P3) read reset count
>>     P4) if reset count changed, use old stats. if not use updated stats.
>>R3) reset prepare
>>R4) do the HW reset
>>
>>Do you agree that this is racy? In thread P we don't know in if the 
>>reset flag was set or not when we captured the reset count in P1?
>>
>>>>
>>>>PMU callback flow with this patch
>>>>---------------
>>>>This would rely on the reset_count only if a reset is not in progress.
>>>>
>>>>P0) test_bit for I915_RESET_BACKOFF
>>>>P1) read reset count if not in reset. if in reset, use old stats
>>>>P2) update stats
>>>>P3) read reset count
>>>>P4) if reset count changed, use old stats. if not use updated stats.
>>>>
>>>>Now that I think about it more, I do see one sequence that still 
>>>>needs fixing though - P0, R1, R2, P1 - P4. For that, I think I 
>>>>need to re-read the BACKOFF bit after reading the reset_count 
>>>>for the first time.
>>>>Modified PMU callback sequence would be:
>>>>----------
>>>>
>>>>M0) test_bit for I915_RESET_BACKOFF
>>>>M1) read reset count if not in reset, if in reset, use old stats
>>>>
>>>>M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if 
>>>>not, use reset_count to synchronize
>>>>
>>>>M2) update stats
>>>>M3) read reset count
>>>>M4) if reset count changed, use old stats. if not use updated stats.
>>>
>>>You did not end up implementing this flow? Have you later changed 
>>>your mind whether it is required or not? Or maybe I am looking at 
>>>not the latest patch.
>>>
>>>Is the below the latest?
>>>
>>>"""
>>>v2:
>>>- The 2 reset counts captured in the PMU callback can end up being the
>>> same if they were captured right after the count is incremented in the
>>> reset flow. This can lead to a bad busyness state. Ensure that reset
>>> is not in progress when the initial reset count is captured.
>>>"""
>>
>>Yes, v2 is the latest (maybe CI results re-ordered the patches). 
>>Instead of sampling the BACKOFF flag before and after the reset 
>>count (as in the modified sequence), I just sample it after. The 
>>order is critical - first sample reset count and then the reset 
>>flag.
>>
>>>
>>>Is the key now that you rely on ordering of atomic_inc and set_bit 
>>>in the reset path?
>>
>>Yes
>>
>>>Frankly I still don't understand why you can get away
>>
>>>with using stale in_reset in v2. If you acknowledge it can change 
>>>between sampling and checking, then what is the point in having it 
>>>altogether? You still solely rely on reset count in that case, no?
>>
>>Correct, but now I know for sure that the first sample of 
>>reset_count was captured when reset flag was not set (since I am 
>>relying on the order of sampling).
>>
>>About solely using the reset_count, I have listed the problematic 
>>sequence above to highlight what the issue is.
>
>It was this:
>
>"""
>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>     P1) read reset count
>     P2) update stats
>     P3) read reset count
>     P4) if reset count changed, use old stats. if not use updated stats.
>R3) reset prepare
>R4) do the HW reset
>
>Do you agree that this is racy? In thread P we don't know in if the reset flag was set or not when we captured the reset count in P1?
>"""
>
>Why it matter if reset flag was set or not? Lets see how things are after this patch:
>
>After this patch it ends like this:
>
>     P1) Read and store reset bit
>R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>     P2) If reset bit was not set:
>           P2.1) read reset count
>R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>           P2.2) update stats
>           P2.3) read reset count
>           P2.4) if reset count changed, use old stats. if not use updated stats.
>R3) reset prepare
>R4) do the HW reset
>
>So the reset bit got set between P1 and P2. How is that then not the same as not looking at the reset bit at all?

But the new sequence in this patch is this:

     P0) read reset count
     P1) Read and store reset bit
R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
     P2) If reset bit was not set:
R2) atomic_inc(&gt->i915->gpu_error.reset_count)
	   P2.2) update stats
	   P2.3) read reset count
	   P2.4) if reset count changed, use old stats. if not use updated stats.
R3) reset prepare
R4) do the HW reset

P2.1 moved to P0 when compared to the sequence you shared above.

Thanks,
Umesh

>
>Regards,
>
>Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
  2021-11-22 23:39           ` Umesh Nerlige Ramappa
@ 2021-11-23  9:15             ` Tvrtko Ursulin
  0 siblings, 0 replies; 13+ messages in thread
From: Tvrtko Ursulin @ 2021-11-23  9:15 UTC (permalink / raw)
  To: Umesh Nerlige Ramappa; +Cc: intel-gfx


On 22/11/2021 23:39, Umesh Nerlige Ramappa wrote:
> On Mon, Nov 22, 2021 at 03:44:29PM +0000, Tvrtko Ursulin wrote:
>>
>> On 11/11/2021 16:48, Umesh Nerlige Ramappa wrote:
>>> On Thu, Nov 11, 2021 at 02:37:43PM +0000, Tvrtko Ursulin wrote:
>>>>
>>>> On 04/11/2021 22:04, Umesh Nerlige Ramappa wrote:
>>>>> On Thu, Nov 04, 2021 at 05:37:37PM +0000, Tvrtko Ursulin wrote:
>>>>>>
>>>>>> On 03/11/2021 22:47, Umesh Nerlige Ramappa wrote:
>>>>>>> Since the PMU callback runs in irq context, it synchronizes with gt
>>>>>>> reset using the reset count. We could run into a case where the PMU
>>>>>>> callback could read the reset count before it is updated. This has a
>>>>>>> potential of corrupting the busyness stats.
>>>>>>>
>>>>>>> In addition to the reset count, check if the reset bit is set before
>>>>>>> capturing busyness.
>>>>>>>
>>>>>>> In addition save the previous stats only if you intend to update 
>>>>>>> them.
>>>>>>>
>>>>>>> Signed-off-by: Umesh Nerlige Ramappa 
>>>>>>> <umesh.nerlige.ramappa@intel.com>
>>>>>>> ---
>>>>>>>  drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 12 ++++++++----
>>>>>>>  1 file changed, 8 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c 
>>>>>>> b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> index 5cc49c0b3889..d83ade77ca07 100644
>>>>>>> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
>>>>>>> @@ -1183,6 +1183,7 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>      u64 total, gt_stamp_saved;
>>>>>>>      unsigned long flags;
>>>>>>>      u32 reset_count;
>>>>>>> +    bool in_reset;
>>>>>>>      spin_lock_irqsave(&guc->timestamp.lock, flags);
>>>>>>> @@ -1191,7 +1192,9 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>       * engine busyness from GuC, so we just use the driver stored
>>>>>>>       * copy of busyness. Synchronize with gt reset using 
>>>>>>> reset_count.
>>>>>>>       */
>>>>>>> -    reset_count = i915_reset_count(gpu_error);
>>>>>>> +    rcu_read_lock();
>>>>>>> +    in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
>>>>>>> +    rcu_read_unlock();
>>>>>>
>>>>>> I don't really understand the point of rcu_read_lock over test_bit 
>>>>>> but I guess you copied it from the trylock loop.
>>>>>
>>>>> Yes, I don't see other parts of code using the lock though. I can 
>>>>> drop it.
>>>>>
>>>>>>
>>>>>>>      *now = ktime_get();
>>>>>>> @@ -1201,9 +1204,10 @@ static ktime_t guc_engine_busyness(struct 
>>>>>>> intel_engine_cs *engine, ktime_t *now)
>>>>>>>       * start_gt_clk is derived from GuC state. To get a consistent
>>>>>>>       * view of activity, we query the GuC state only if gt is 
>>>>>>> awake.
>>>>>>>       */
>>>>>>> -    stats_saved = *stats;
>>>>>>> -    gt_stamp_saved = guc->timestamp.gt_stamp;
>>>>>>> -    if (intel_gt_pm_get_if_awake(gt)) {
>>>>>>> +    if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
>>>>>>
>>>>>> What is the point of looking at the old value of in_reset here? 
>>>>>> Gut feeling says if there is a race this does not fix it.
>>>>>>
>>>>>> I did not figure out from the commit message what does "could read 
>>>>>> the reset count before it is updated" mean?
>>>>>> I thought the point of reading
>>>>>
>>>>>> the reset count twice was that you are sure there was no reset 
>>>>>> while in here, in which case it is safe to update the software 
>>>>>> copy. I don't easily see what test_bit does on top.
>>>>>
>>>>> This is what I see in the reset flow
>>>>> ---------------
>>>>>
>>>>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>>>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>>> R3) reset prepare
>>>>> R4) do the HW reset
>>>>>
>>>>> The reset count is updated only once above and that's before an 
>>>>> actual HW reset happens.
>>>>>
>>>>> PMU callback flow before this patch
>>>>> ---------------
>>>>>
>>>>> P1) read reset count
>>>>> P2) update stats
>>>>> P3) read reset count
>>>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>>>
>>>>> I am concerned that the PMU flow could run after step (R2). Then we 
>>>>> wrongly conclude that the count stayed the same and no HW reset 
>>>>> happened.
>>>
>>> Here is the problematic sequence: Threads R and P.
>>> ------------
>>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>>     P1) read reset count
>>>     P2) update stats
>>>     P3) read reset count
>>>     P4) if reset count changed, use old stats. if not use updated stats.
>>> R3) reset prepare
>>> R4) do the HW reset
>>>
>>> Do you agree that this is racy? In thread P we don't know in if the 
>>> reset flag was set or not when we captured the reset count in P1?
>>>
>>>>>
>>>>> PMU callback flow with this patch
>>>>> ---------------
>>>>> This would rely on the reset_count only if a reset is not in progress.
>>>>>
>>>>> P0) test_bit for I915_RESET_BACKOFF
>>>>> P1) read reset count if not in reset. if in reset, use old stats
>>>>> P2) update stats
>>>>> P3) read reset count
>>>>> P4) if reset count changed, use old stats. if not use updated stats.
>>>>>
>>>>> Now that I think about it more, I do see one sequence that still 
>>>>> needs fixing though - P0, R1, R2, P1 - P4. For that, I think I need 
>>>>> to re-read the BACKOFF bit after reading the reset_count for the 
>>>>> first time.
>>>>> Modified PMU callback sequence would be:
>>>>> ----------
>>>>>
>>>>> M0) test_bit for I915_RESET_BACKOFF
>>>>> M1) read reset count if not in reset, if in reset, use old stats
>>>>>
>>>>> M1.1) test_bit for I915_RESET_BACKOFF. if set, use old stats. if 
>>>>> not, use reset_count to synchronize
>>>>>
>>>>> M2) update stats
>>>>> M3) read reset count
>>>>> M4) if reset count changed, use old stats. if not use updated stats.
>>>>
>>>> You did not end up implementing this flow? Have you later changed 
>>>> your mind whether it is required or not? Or maybe I am looking at 
>>>> not the latest patch.
>>>>
>>>> Is the below the latest?
>>>>
>>>> """
>>>> v2:
>>>> - The 2 reset counts captured in the PMU callback can end up being the
>>>>  same if they were captured right after the count is incremented in the
>>>>  reset flow. This can lead to a bad busyness state. Ensure that reset
>>>>  is not in progress when the initial reset count is captured.
>>>> """
>>>
>>> Yes, v2 is the latest (maybe CI results re-ordered the patches). 
>>> Instead of sampling the BACKOFF flag before and after the reset count 
>>> (as in the modified sequence), I just sample it after. The order is 
>>> critical - first sample reset count and then the reset flag.
>>>
>>>>
>>>> Is the key now that you rely on ordering of atomic_inc and set_bit 
>>>> in the reset path?
>>>
>>> Yes
>>>
>>>> Frankly I still don't understand why you can get away
>>>
>>>> with using stale in_reset in v2. If you acknowledge it can change 
>>>> between sampling and checking, then what is the point in having it 
>>>> altogether? You still solely rely on reset count in that case, no?
>>>
>>> Correct, but now I know for sure that the first sample of reset_count 
>>> was captured when reset flag was not set (since I am relying on the 
>>> order of sampling).
>>>
>>> About solely using the reset_count, I have listed the problematic 
>>> sequence above to highlight what the issue is.
>>
>> It was this:
>>
>> """
>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>     P1) read reset count
>>     P2) update stats
>>     P3) read reset count
>>     P4) if reset count changed, use old stats. if not use updated stats.
>> R3) reset prepare
>> R4) do the HW reset
>>
>> Do you agree that this is racy? In thread P we don't know in if the 
>> reset flag was set or not when we captured the reset count in P1?
>> """
>>
>> Why it matter if reset flag was set or not? Lets see how things are 
>> after this patch:
>>
>> After this patch it ends like this:
>>
>>     P1) Read and store reset bit
>> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>>     P2) If reset bit was not set:
>>           P2.1) read reset count
>> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>>           P2.2) update stats
>>           P2.3) read reset count
>>           P2.4) if reset count changed, use old stats. if not use 
>> updated stats.
>> R3) reset prepare
>> R4) do the HW reset
>>
>> So the reset bit got set between P1 and P2. How is that then not the 
>> same as not looking at the reset bit at all?
> 
> But the new sequence in this patch is this:

Oops I was looking at v1. Okay I can't come up with any further races, 
acked.

Regards,

Tvrtko

> 
>      P0) read reset count
>      P1) Read and store reset bit
> R1) test_and_set_bit(I915_RESET_BACKOFF, &gt->reset.flags)
>      P2) If reset bit was not set:
> R2) atomic_inc(&gt->i915->gpu_error.reset_count)
>         P2.2) update stats
>         P2.3) read reset count
>         P2.4) if reset count changed, use old stats. if not use updated 
> stats.
> R3) reset prepare
> R4) do the HW reset
> 
> P2.1 moved to P0 when compared to the sequence you shared above.
> 
> Thanks,
> Umesh
> 
>>
>> Regards,
>>
>> Tvrtko

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset
@ 2021-11-08 21:10 Umesh Nerlige Ramappa
  0 siblings, 0 replies; 13+ messages in thread
From: Umesh Nerlige Ramappa @ 2021-11-08 21:10 UTC (permalink / raw)
  To: intel-gfx, dri-devel

Since the PMU callback runs in irq context, it synchronizes with gt
reset using the reset count. We could run into a case where the PMU
callback could read the reset count before it is updated. This has a
potential of corrupting the busyness stats.

In addition to the reset count, check if the reset bit is set before
capturing busyness.

In addition save the previous stats only if you intend to update them.

v2:
- The 2 reset counts captured in the PMU callback can end up being the
  same if they were captured right after the count is incremented in the
  reset flow. This can lead to a bad busyness state. Ensure that reset
  is not in progress when the initial reset count is captured.

Signed-off-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Reviewed-by: Matthew Brost <matthew.brost@intel.com>
---
 .../gpu/drm/i915/gt/uc/intel_guc_submission.c   | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 5cc49c0b3889..0dfc6032cd6b 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1183,15 +1183,20 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
 	u64 total, gt_stamp_saved;
 	unsigned long flags;
 	u32 reset_count;
+	bool in_reset;
 
 	spin_lock_irqsave(&guc->timestamp.lock, flags);
 
 	/*
-	 * If a reset happened, we risk reading partially updated
-	 * engine busyness from GuC, so we just use the driver stored
-	 * copy of busyness. Synchronize with gt reset using reset_count.
+	 * If a reset happened, we risk reading partially updated engine
+	 * busyness from GuC, so we just use the driver stored copy of busyness.
+	 * Synchronize with gt reset using reset_count and the
+	 * I915_RESET_BACKOFF flag. Note that reset flow updates the reset_count
+	 * after I915_RESET_BACKOFF flag, so ensure that the reset_count is
+	 * usable by checking the flag afterwards.
 	 */
 	reset_count = i915_reset_count(gpu_error);
+	in_reset = test_bit(I915_RESET_BACKOFF, &gt->reset.flags);
 
 	*now = ktime_get();
 
@@ -1201,9 +1206,9 @@ static ktime_t guc_engine_busyness(struct intel_engine_cs *engine, ktime_t *now)
 	 * start_gt_clk is derived from GuC state. To get a consistent
 	 * view of activity, we query the GuC state only if gt is awake.
 	 */
-	stats_saved = *stats;
-	gt_stamp_saved = guc->timestamp.gt_stamp;
-	if (intel_gt_pm_get_if_awake(gt)) {
+	if (intel_gt_pm_get_if_awake(gt) && !in_reset) {
+		stats_saved = *stats;
+		gt_stamp_saved = guc->timestamp.gt_stamp;
 		guc_update_engine_gt_clks(engine);
 		guc_update_pm_timestamp(guc, engine, now);
 		intel_gt_pm_put_async(gt);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-11-23  9:15 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-03 22:47 [Intel-gfx] [PATCH] drm/i915/pmu: Fix synchronization of PMU callback with reset Umesh Nerlige Ramappa
2021-11-03 23:47 ` [Intel-gfx] ✓ Fi.CI.BAT: success for " Patchwork
2021-11-04  0:55 ` [Intel-gfx] ✗ Fi.CI.IGT: failure " Patchwork
2021-11-04 15:57 ` [Intel-gfx] [PATCH] " Matthew Brost
2021-11-04 17:37 ` Tvrtko Ursulin
2021-11-04 22:04   ` Umesh Nerlige Ramappa
2021-11-11 14:37     ` Tvrtko Ursulin
2021-11-11 16:48       ` Umesh Nerlige Ramappa
2021-11-20  0:25         ` Umesh Nerlige Ramappa
2021-11-22 15:44         ` Tvrtko Ursulin
2021-11-22 23:39           ` Umesh Nerlige Ramappa
2021-11-23  9:15             ` Tvrtko Ursulin
2021-11-08 21:10 Umesh Nerlige Ramappa

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.