[Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it

intel-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed

* [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
@ 2023-09-28 13:22 Andi Shyti
  2023-09-28 13:25 ` Andi Shyti
                   ` (3 more replies)
  0 siblings, 4 replies; 6+ messages in thread
From: Andi Shyti @ 2023-09-28 13:22 UTC (permalink / raw)
  To: intel-gfx; +Cc: Matt Roper, Nirmoy Das

While discussing with Nirmoy offline about this other way for
fixing lock contention, he was a bit sceptical about it.

But why not? We know that if we fall into this case it's because
some hardware component has forgotten to release the lock within
100ms. So that we have two possibilities, either bail out or
force the unlock.

Forcing the unlock might not be respectful to the environment,
but, at the end, i915 should have the highest priority.

Nirmoy's solution here[*] is to force the unlock during gt
resume, but what happens if meantime the hardware takes the lock
and doesn't release it?

Open for opinions or profligate rejections :-)

I'm also curious to see what CI has to say about.

[*] https://patchwork.freedesktop.org/series/124397/

Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
Cc: Nirmoy Das <nirmoy.das@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
---
 drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 46 ++++++++++++++++----------
 1 file changed, 28 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
index bf4a933de03a..e3eb3c2ace68 100644
--- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
+++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
@@ -371,14 +371,34 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
 
 	lockdep_assert_not_held(&gt->uncore->lock);
 
-	/*
-	 * Starting with MTL, we need to coordinate not only with other
-	 * driver threads, but also with hardware/firmware agents.  A dedicated
-	 * locking register is used.
-	 */
-	if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
-		err = wait_for(intel_uncore_read_fw(gt->uncore,
-						    MTL_STEER_SEMAPHORE) == 0x1, 100);
+	do {
+		/*
+		 * Starting with MTL, we need to coordinate not only with other
+		 * driver threads, but also with hardware/firmware agents.  A
+		 * dedicated locking register is used.
+		 */
+		if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
+			err = wait_for(intel_uncore_read_fw(gt->uncore,
+					      MTL_STEER_SEMAPHORE) == 0x1, 100);
+		else
+			break;
+
+		/*
+		 * In theory we should never fail to acquire the HW semaphore;
+		 * this would indicate some hardware/firmware is misbehaving and
+		 * not releasing it properly.
+		 */
+		if (err == -ETIMEDOUT) {
+			gt_warn(gt,
+				"hardware MCR steering semaphore timed out "
+				"forcing lock takeover\n");
+			/*
+			 * Force lock takeover
+			 */
+			intel_uncore_write_fw(gt->uncore,
+					      MTL_STEER_SEMAPHORE, 0x1);
+		}
+	} while (err != -ETIMEDOUT);
 
 	/*
 	 * Even on platforms with a hardware lock, we'll continue to grab
@@ -389,16 +409,6 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
 	spin_lock_irqsave(&gt->mcr_lock, __flags);
 
 	*flags = __flags;
-
-	/*
-	 * In theory we should never fail to acquire the HW semaphore; this
-	 * would indicate some hardware/firmware is misbehaving and not
-	 * releasing it properly.
-	 */
-	if (err == -ETIMEDOUT) {
-		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
-		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
-	}
 }
 
 /**
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
  2023-09-28 13:22 [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it Andi Shyti
@ 2023-09-28 13:25 ` Andi Shyti
  2023-09-28 14:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 6+ messages in thread
From: Andi Shyti @ 2023-09-28 13:25 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, Matt Roper, Nirmoy Das

Hi,

> +	do {
> +		/*
> +		 * Starting with MTL, we need to coordinate not only with other
> +		 * driver threads, but also with hardware/firmware agents.  A
> +		 * dedicated locking register is used.
> +		 */
> +		if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
> +			err = wait_for(intel_uncore_read_fw(gt->uncore,
> +					      MTL_STEER_SEMAPHORE) == 0x1, 100);
> +		else
> +			break;
> +
> +		/*
> +		 * In theory we should never fail to acquire the HW semaphore;
> +		 * this would indicate some hardware/firmware is misbehaving and
> +		 * not releasing it properly.
> +		 */
> +		if (err == -ETIMEDOUT) {
> +			gt_warn(gt,
> +				"hardware MCR steering semaphore timed out "
> +				"forcing lock takeover\n");
> +			/*
> +			 * Force lock takeover
> +			 */
> +			intel_uncore_write_fw(gt->uncore,
> +					      MTL_STEER_SEMAPHORE, 0x1);
> +		}
> +	} while (err != -ETIMEDOUT);

this is '==' of course... now I missed the CI tests.

Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
  2023-09-28 13:22 [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it Andi Shyti
  2023-09-28 13:25 ` Andi Shyti
@ 2023-09-28 14:18 ` Patchwork
  2023-09-28 14:35 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
  2023-09-28 16:25 ` [Intel-gfx] [RFC PATCH] " Matt Roper
  3 siblings, 0 replies; 6+ messages in thread
From: Patchwork @ 2023-09-28 14:18 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

== Series Details ==

Series: drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
URL   : https://patchwork.freedesktop.org/series/124402/
State : warning

== Summary ==

Error: dim checkpatch failed
ae8280e8edc1 drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
-:56: CHECK:PARENTHESIS_ALIGNMENT: Alignment should match open parenthesis
#56: FILE: drivers/gpu/drm/i915/gt/intel_gt_mcr.c:382:
+			err = wait_for(intel_uncore_read_fw(gt->uncore,
+					      MTL_STEER_SEMAPHORE) == 0x1, 100);

total: 0 errors, 0 warnings, 1 checks, 58 lines checked



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
  2023-09-28 13:22 [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it Andi Shyti
  2023-09-28 13:25 ` Andi Shyti
  2023-09-28 14:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
@ 2023-09-28 14:35 ` Patchwork
  2023-09-28 16:25 ` [Intel-gfx] [RFC PATCH] " Matt Roper
  3 siblings, 0 replies; 6+ messages in thread
From: Patchwork @ 2023-09-28 14:35 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 11318 bytes --]

== Series Details ==

Series: drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
URL   : https://patchwork.freedesktop.org/series/124402/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_13689 -> Patchwork_124402v1
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_124402v1 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_124402v1, please notify your bug team (lgci.bug.filing@intel.com) to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/index.html

Participating hosts (34 -> 39)
------------------------------

  Additional (7): fi-kbl-soraka bat-kbl-2 fi-cfl-8700k fi-apl-guc fi-kbl-guc fi-ivb-3770 fi-skl-6600u 
  Missing    (2): bat-dg2-9 fi-snb-2520m 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_124402v1:

### IGT changes ###

#### Possible regressions ####

  * igt@gem_exec_parallel@engines@fds:
    - bat-mtlp-6:         [PASS][1] -> [TIMEOUT][2]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-6/igt@gem_exec_parallel@engines@fds.html
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-6/igt@gem_exec_parallel@engines@fds.html

  * igt@gem_ringfill@basic-all:
    - bat-mtlp-8:         [PASS][3] -> [TIMEOUT][4] +1 other test timeout
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-8/igt@gem_ringfill@basic-all.html
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-8/igt@gem_ringfill@basic-all.html

  * igt@i915_selftest@live@gem_migrate:
    - bat-mtlp-8:         [PASS][5] -> [DMESG-WARN][6] +132 other tests dmesg-warn
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-8/igt@i915_selftest@live@gem_migrate.html
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-8/igt@i915_selftest@live@gem_migrate.html

  * igt@i915_selftest@live@slpc:
    - bat-mtlp-6:         [PASS][7] -> [DMESG-WARN][8] +88 other tests dmesg-warn
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-6/igt@i915_selftest@live@slpc.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-6/igt@i915_selftest@live@slpc.html

  
#### Warnings ####

  * igt@gem_exec_suspend@basic-s0@smem:
    - bat-mtlp-6:         [ABORT][9] ([i915#9262]) -> [TIMEOUT][10]
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-6/igt@gem_exec_suspend@basic-s0@smem.html
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-6/igt@gem_exec_suspend@basic-s0@smem.html

  
#### Suppressed ####

  The following results come from untrusted machines, tests, or statuses.
  They do not affect the overall result.

  * {igt@kms_pm_backlight@basic-brightness@edp-1}:
    - bat-mtlp-8:         [PASS][11] -> [DMESG-WARN][12]
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-8/igt@kms_pm_backlight@basic-brightness@edp-1.html
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-8/igt@kms_pm_backlight@basic-brightness@edp-1.html

  
Known issues
------------

  Here are the changes found in Patchwork_124402v1 that come from known issues:

### IGT changes ###

#### Issues hit ####

  * igt@fbdev@info:
    - bat-kbl-2:          NOTRUN -> [SKIP][13] ([fdo#109271] / [i915#1849])
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-kbl-2/igt@fbdev@info.html
    - fi-kbl-guc:         NOTRUN -> [SKIP][14] ([fdo#109271] / [i915#1849])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-guc/igt@fbdev@info.html

  * igt@gem_huc_copy@huc-copy:
    - fi-cfl-8700k:       NOTRUN -> [SKIP][15] ([fdo#109271] / [i915#2190])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-cfl-8700k/igt@gem_huc_copy@huc-copy.html
    - fi-skl-6600u:       NOTRUN -> [SKIP][16] ([fdo#109271] / [i915#2190])
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-skl-6600u/igt@gem_huc_copy@huc-copy.html
    - fi-kbl-soraka:      NOTRUN -> [SKIP][17] ([fdo#109271] / [i915#2190])
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@basic:
    - fi-apl-guc:         NOTRUN -> [SKIP][18] ([fdo#109271] / [i915#4613]) +3 other tests skip
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-apl-guc/igt@gem_lmem_swapping@basic.html
    - fi-kbl-soraka:      NOTRUN -> [SKIP][19] ([fdo#109271] / [i915#4613]) +3 other tests skip
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-soraka/igt@gem_lmem_swapping@basic.html
    - fi-cfl-8700k:       NOTRUN -> [SKIP][20] ([fdo#109271] / [i915#4613]) +3 other tests skip
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-cfl-8700k/igt@gem_lmem_swapping@basic.html
    - fi-kbl-guc:         NOTRUN -> [SKIP][21] ([fdo#109271] / [i915#4613]) +3 other tests skip
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-guc/igt@gem_lmem_swapping@basic.html

  * igt@gem_lmem_swapping@parallel-random-engines:
    - bat-kbl-2:          NOTRUN -> [SKIP][22] ([fdo#109271]) +39 other tests skip
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-kbl-2/igt@gem_lmem_swapping@parallel-random-engines.html

  * igt@gem_lmem_swapping@random-engines:
    - fi-skl-6600u:       NOTRUN -> [SKIP][23] ([fdo#109271] / [i915#4613]) +3 other tests skip
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-skl-6600u/igt@gem_lmem_swapping@random-engines.html

  * igt@i915_module_load@load:
    - fi-bsw-n3050:       [PASS][24] -> [DMESG-WARN][25] ([i915#1982])
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/fi-bsw-n3050/igt@i915_module_load@load.html
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-bsw-n3050/igt@i915_module_load@load.html

  * igt@i915_selftest@live@gt_heartbeat:
    - bat-mtlp-6:         [PASS][26] -> [DMESG-WARN][27] ([i915#7699]) +1 other test dmesg-warn
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-6/igt@i915_selftest@live@gt_heartbeat.html
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-6/igt@i915_selftest@live@gt_heartbeat.html
    - bat-mtlp-8:         [PASS][28] -> [DMESG-WARN][29] ([i915#7699]) +1 other test dmesg-warn
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_13689/bat-mtlp-8/igt@i915_selftest@live@gt_heartbeat.html
   [29]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-mtlp-8/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][30] ([i915#1886] / [i915#7913])
   [30]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][31] ([fdo#109271]) +9 other tests skip
   [31]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-soraka/igt@kms_cursor_legacy@basic-busy-flip-before-cursor-atomic.html

  * igt@kms_cursor_legacy@basic-flip-before-cursor-legacy:
    - fi-kbl-guc:         NOTRUN -> [SKIP][32] ([fdo#109271] / [i915#1845]) +8 other tests skip
   [32]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-guc/igt@kms_cursor_legacy@basic-flip-before-cursor-legacy.html

  * igt@kms_dsc@dsc-basic:
    - fi-skl-6600u:       NOTRUN -> [SKIP][33] ([fdo#109271]) +8 other tests skip
   [33]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-skl-6600u/igt@kms_dsc@dsc-basic.html

  * igt@kms_force_connector_basic@force-load-detect:
    - fi-cfl-8700k:       NOTRUN -> [SKIP][34] ([fdo#109271]) +10 other tests skip
   [34]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-cfl-8700k/igt@kms_force_connector_basic@force-load-detect.html

  * igt@kms_hdmi_inject@inject-audio:
    - fi-apl-guc:         NOTRUN -> [SKIP][35] ([fdo#109271]) +16 other tests skip
   [35]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-apl-guc/igt@kms_hdmi_inject@inject-audio.html
    - fi-kbl-guc:         NOTRUN -> [FAIL][36] ([IGT#3])
   [36]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-guc/igt@kms_hdmi_inject@inject-audio.html

  * igt@kms_pipe_crc_basic@read-crc-frame-sequence:
    - bat-adlp-9:         NOTRUN -> [SKIP][37] ([i915#3546]) +2 other tests skip
   [37]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/bat-adlp-9/igt@kms_pipe_crc_basic@read-crc-frame-sequence.html

  * igt@kms_pipe_crc_basic@suspend-read-crc@pipe-c-vga-1:
    - fi-ivb-3770:        NOTRUN -> [DMESG-WARN][38] ([i915#8841]) +6 other tests dmesg-warn
   [38]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-ivb-3770/igt@kms_pipe_crc_basic@suspend-read-crc@pipe-c-vga-1.html

  * igt@kms_psr@cursor_plane_move:
    - fi-ivb-3770:        NOTRUN -> [SKIP][39] ([fdo#109271]) +21 other tests skip
   [39]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-ivb-3770/igt@kms_psr@cursor_plane_move.html
    - fi-kbl-guc:         NOTRUN -> [SKIP][40] ([fdo#109271]) +25 other tests skip
   [40]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/fi-kbl-guc/igt@kms_psr@cursor_plane_move.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [IGT#3]: https://gitlab.freedesktop.org/drm/igt-gpu-tools/issues/3
  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1849]: https://gitlab.freedesktop.org/drm/intel/issues/1849
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#1982]: https://gitlab.freedesktop.org/drm/intel/issues/1982
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#3546]: https://gitlab.freedesktop.org/drm/intel/issues/3546
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#7699]: https://gitlab.freedesktop.org/drm/intel/issues/7699
  [i915#7913]: https://gitlab.freedesktop.org/drm/intel/issues/7913
  [i915#8841]: https://gitlab.freedesktop.org/drm/intel/issues/8841
  [i915#9262]: https://gitlab.freedesktop.org/drm/intel/issues/9262


Build changes
-------------

  * Linux: CI_DRM_13689 -> Patchwork_124402v1

  CI-20190529: 20190529
  CI_DRM_13689: 5933eb0a0717a28e668d33e01a707311d31cebbb @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7506: 4fdf544bd0a38c5a100ef43c30171827e1c8c442 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_124402v1: 5933eb0a0717a28e668d33e01a707311d31cebbb @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

2e19260e8a12 drm/i915/gt: Force mcr lock takeover if hardware forgot to release it

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_124402v1/index.html

[-- Attachment #2: Type: text/html, Size: 14330 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
  2023-09-28 13:22 [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it Andi Shyti
                   ` (2 preceding siblings ...)
  2023-09-28 14:35 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
@ 2023-09-28 16:25 ` Matt Roper
  2023-09-28 16:47   ` Andi Shyti
  3 siblings, 1 reply; 6+ messages in thread
From: Matt Roper @ 2023-09-28 16:25 UTC (permalink / raw)
  To: Andi Shyti; +Cc: intel-gfx, Nirmoy Das

On Thu, Sep 28, 2023 at 03:22:35PM +0200, Andi Shyti wrote:
> While discussing with Nirmoy offline about this other way for
> fixing lock contention, he was a bit sceptical about it.
> 
> But why not? We know that if we fall into this case it's because
> some hardware component has forgotten to release the lock within
> 100ms. So that we have two possibilities, either bail out or
> force the unlock.
> 
> Forcing the unlock might not be respectful to the environment,
> but, at the end, i915 should have the highest priority.
> 
> Nirmoy's solution here[*] is to force the unlock during gt
> resume, but what happens if meantime the hardware takes the lock
> and doesn't release it?
> 
> Open for opinions or profligate rejections :-)
> 
> I'm also curious to see what CI has to say about.
> 
> [*] https://patchwork.freedesktop.org/series/124397/
> 

As far as I can tell, this patch doesn't really do anything beneficial
that I can see.  We already unlock and proceed today if we hit a lock
timeout:

 - intel_gt_mcr_lock
   - attempt to get lock
   - timeout, warn, add CI taint
 - perform MCR register access even if the lock failed
 - intel_gt_mcr_unlock
   - lock is released regardless of whether we obtained it successfully
     at the beginning, or whether someone else was still holding it

With your patch, it looks like you're just adding an extra
unlock/reacquire step before we move on which I don't think accomplishes
anything.  If someone else forgot to release the lock, then we're still
protected from other agents, and we'll take care of releasing it
ourselves once we're done.  If the other agent actually is still using
the lock and they're just going slower than we expected, then when they
finally finish they're just going to blindly unlock; if we're in the
middle of our critical section at that point, they'll release our lock
the same way we released theirs.  The main change here is that when we
hit a timeout, your patch is giving other outside agents a chance to sneak in
and re-grab the lock, further delaying our KMD acquisition.

The real-world IFWI problems we saw, which Nirmoy's series is working
around, is that some boot-time agent simply forgot to ever release the
lock, leaving it locked "forever" so it makes sense to sanitize it
initially.  Load/resume is the only time when it's actually "safe" to
reset/sanitize the lock like that.  If we're getting MCR timeouts during
regular driver operation (i.e., not during the beginning of driver load
or resume), then it either means our timeout values are too quick (i.e.,
we're not giving external agents sufficient time to run their critical
sections), or some piece of system firmware (e.g., pcode) has completely
died in the middle of its critical section.  In the former case, we
probably need to adjust our timeout amount (and possibly work with those
firmware teams to see if they can reduce the size of their critical
sections).  In the latter case, the system is going to be so badly
broken that it doesn't really matter what we do; we're just not going to
have a functioning system anymore at that point and it's not something
the graphics driver has a way of recovering from.

Matt

> Signed-off-by: Andi Shyti <andi.shyti@linux.intel.com>
> Cc: Nirmoy Das <nirmoy.das@intel.com>
> Cc: Matt Roper <matthew.d.roper@intel.com>
> ---
>  drivers/gpu/drm/i915/gt/intel_gt_mcr.c | 46 ++++++++++++++++----------
>  1 file changed, 28 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> index bf4a933de03a..e3eb3c2ace68 100644
> --- a/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> +++ b/drivers/gpu/drm/i915/gt/intel_gt_mcr.c
> @@ -371,14 +371,34 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
>  
>  	lockdep_assert_not_held(&gt->uncore->lock);
>  
> -	/*
> -	 * Starting with MTL, we need to coordinate not only with other
> -	 * driver threads, but also with hardware/firmware agents.  A dedicated
> -	 * locking register is used.
> -	 */
> -	if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
> -		err = wait_for(intel_uncore_read_fw(gt->uncore,
> -						    MTL_STEER_SEMAPHORE) == 0x1, 100);
> +	do {
> +		/*
> +		 * Starting with MTL, we need to coordinate not only with other
> +		 * driver threads, but also with hardware/firmware agents.  A
> +		 * dedicated locking register is used.
> +		 */
> +		if (GRAPHICS_VER_FULL(gt->i915) >= IP_VER(12, 70))
> +			err = wait_for(intel_uncore_read_fw(gt->uncore,
> +					      MTL_STEER_SEMAPHORE) == 0x1, 100);
> +		else
> +			break;
> +
> +		/*
> +		 * In theory we should never fail to acquire the HW semaphore;
> +		 * this would indicate some hardware/firmware is misbehaving and
> +		 * not releasing it properly.
> +		 */
> +		if (err == -ETIMEDOUT) {
> +			gt_warn(gt,
> +				"hardware MCR steering semaphore timed out "
> +				"forcing lock takeover\n");
> +			/*
> +			 * Force lock takeover
> +			 */
> +			intel_uncore_write_fw(gt->uncore,
> +					      MTL_STEER_SEMAPHORE, 0x1);
> +		}
> +	} while (err != -ETIMEDOUT);
>  
>  	/*
>  	 * Even on platforms with a hardware lock, we'll continue to grab
> @@ -389,16 +409,6 @@ void intel_gt_mcr_lock(struct intel_gt *gt, unsigned long *flags)
>  	spin_lock_irqsave(&gt->mcr_lock, __flags);
>  
>  	*flags = __flags;
> -
> -	/*
> -	 * In theory we should never fail to acquire the HW semaphore; this
> -	 * would indicate some hardware/firmware is misbehaving and not
> -	 * releasing it properly.
> -	 */
> -	if (err == -ETIMEDOUT) {
> -		gt_err_ratelimited(gt, "hardware MCR steering semaphore timed out");
> -		add_taint_for_CI(gt->i915, TAINT_WARN);  /* CI is now unreliable */
> -	}
>  }
>  
>  /**
> -- 
> 2.40.1
> 

-- 
Matt Roper
Graphics Software Engineer
Linux GPU Platform Enablement
Intel Corporation

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it
  2023-09-28 16:25 ` [Intel-gfx] [RFC PATCH] " Matt Roper
@ 2023-09-28 16:47   ` Andi Shyti
  0 siblings, 0 replies; 6+ messages in thread
From: Andi Shyti @ 2023-09-28 16:47 UTC (permalink / raw)
  To: Matt Roper; +Cc: intel-gfx, Nirmoy Das

Hi Matt,

> > While discussing with Nirmoy offline about this other way for
> > fixing lock contention, he was a bit sceptical about it.
> > 
> > But why not? We know that if we fall into this case it's because
> > some hardware component has forgotten to release the lock within
> > 100ms. So that we have two possibilities, either bail out or
> > force the unlock.
> > 
> > Forcing the unlock might not be respectful to the environment,
> > but, at the end, i915 should have the highest priority.
> > 
> > Nirmoy's solution here[*] is to force the unlock during gt
> > resume, but what happens if meantime the hardware takes the lock
> > and doesn't release it?
> > 
> > Open for opinions or profligate rejections :-)
> > 
> > I'm also curious to see what CI has to say about.
> > 
> > [*] https://patchwork.freedesktop.org/series/124397/
> > 
> 
> As far as I can tell, this patch doesn't really do anything beneficial
> that I can see.  We already unlock and proceed today if we hit a lock
> timeout:
> 
>  - intel_gt_mcr_lock
>    - attempt to get lock
>    - timeout, warn, add CI taint
>  - perform MCR register access even if the lock failed
>  - intel_gt_mcr_unlock
>    - lock is released regardless of whether we obtained it successfully
>      at the beginning, or whether someone else was still holding it
> 
> With your patch, it looks like you're just adding an extra
> unlock/reacquire step before we move on which I don't think accomplishes
> anything.  If someone else forgot to release the lock, then we're still
> protected from other agents, and we'll take care of releasing it
> ourselves once we're done.  If the other agent actually is still using
> the lock and they're just going slower than we expected, then when they
> finally finish they're just going to blindly unlock; if we're in the
> middle of our critical section at that point, they'll release our lock
> the same way we released theirs.  The main change here is that when we
> hit a timeout, your patch is giving other outside agents a chance to sneak in
> and re-grab the lock, further delaying our KMD acquisition.
> 
> The real-world IFWI problems we saw, which Nirmoy's series is working
> around, is that some boot-time agent simply forgot to ever release the
> lock, leaving it locked "forever" so it makes sense to sanitize it
> initially.  Load/resume is the only time when it's actually "safe" to
> reset/sanitize the lock like that.  If we're getting MCR timeouts during
> regular driver operation (i.e., not during the beginning of driver load
> or resume), then it either means our timeout values are too quick (i.e.,
> we're not giving external agents sufficient time to run their critical
> sections), or some piece of system firmware (e.g., pcode) has completely
> died in the middle of its critical section.  In the former case, we
> probably need to adjust our timeout amount (and possibly work with those
> firmware teams to see if they can reduce the size of their critical
> sections).  In the latter case, the system is going to be so badly
> broken that it doesn't really matter what we do; we're just not going to
> have a functioning system anymore at that point and it's not something
> the graphics driver has a way of recovering from.

Makes sense... thanks!

Andi

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-09-28 16:47 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-28 13:22 [Intel-gfx] [RFC PATCH] drm/i915/gt: Force mcr lock takeover if hardware forgot to release it Andi Shyti
2023-09-28 13:25 ` Andi Shyti
2023-09-28 14:18 ` [Intel-gfx] ✗ Fi.CI.CHECKPATCH: warning for " Patchwork
2023-09-28 14:35 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork
2023-09-28 16:25 ` [Intel-gfx] [RFC PATCH] " Matt Roper
2023-09-28 16:47   ` Andi Shyti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).