All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] Fix for two GuC issues
@ 2022-10-28 19:46 ` John.C.Harrison
  0 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Fix for a deadlock issue between the GuC busyness stats worker and GT
resets. Also fix kernel contexts not getting the correct scheduling
priority at start of day.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (2):
  drm/i915/guc: Properly initialise kernel contexts
  drm/i915/guc: Don't deadlock busyness stats vs reset

 drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  9 +++++++--
 3 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 0/2] Fix for two GuC issues
@ 2022-10-28 19:46 ` John.C.Harrison
  0 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

Fix for a deadlock issue between the GuC busyness stats worker and GT
resets. Also fix kernel contexts not getting the correct scheduling
priority at start of day.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>


John Harrison (2):
  drm/i915/guc: Properly initialise kernel contexts
  drm/i915/guc: Don't deadlock busyness stats vs reset

 drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  9 +++++++--
 3 files changed, 22 insertions(+), 3 deletions(-)

-- 
2.37.3


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 1/2] drm/i915/guc: Properly initialise kernel contexts
  2022-10-28 19:46 ` [Intel-gfx] " John.C.Harrison
@ 2022-10-28 19:46   ` John.C.Harrison
  -1 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

If a context has already been registered prior to first submission
then context init code was not being called. The noticeable effect of
that was the scheduling priority was left at zero (meaning super high
priority) instead of being set to normal. This would occur with
kernel contexts at start of day as they are manually pinned up front
rather than on first submission. So add a call to initialise those
when they are pinned.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4ccb29f9ac55c..941613be3b9dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4111,6 +4111,9 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 
+	if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
+		guc_context_init(ce);
+
 	try_context_registration(ce, true);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 1/2] drm/i915/guc: Properly initialise kernel contexts
@ 2022-10-28 19:46   ` John.C.Harrison
  0 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

If a context has already been registered prior to first submission
then context init code was not being called. The noticeable effect of
that was the scheduling priority was left at zero (meaning super high
priority) instead of being set to normal. This would occur with
kernel contexts at start of day as they are manually pinned up front
rather than on first submission. So add a call to initialise those
when they are pinned.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 4ccb29f9ac55c..941613be3b9dd 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -4111,6 +4111,9 @@ static inline void guc_kernel_context_pin(struct intel_guc *guc,
 	if (context_guc_id_invalid(ce))
 		pin_guc_id(guc, ce);
 
+	if (!test_bit(CONTEXT_GUC_INIT, &ce->flags))
+		guc_context_init(ce);
+
 	try_context_registration(ce, true);
 }
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-10-28 19:46 ` [Intel-gfx] " John.C.Harrison
@ 2022-10-28 19:46   ` John.C.Harrison
  -1 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: John Harrison, DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The engine busyness stats has a worker function to do things like
64bit extend the 32bit hardware counters. The GuC's reset prepare
function flushes out this worker function to ensure no corruption
happens during the reset. Unforunately, the worker function has an
infinite wait for active resets to finish before doing its work. Thus
a deadlock would occur if the worker function had actually started
just as the reset starts.

Update the worker to abort if a reset is in progress rather than
waiting for it to complete. It will still acquire the reset lock in
the case where a reset was not already in progress. So the processing
is still safe from corruption, but the deadlock can no longer occur.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 3159df6cdd492..2f48c6e4420ea 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
 	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
 }
 
-int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, bool retry)
 {
 	might_lock(&gt->reset.backoff_srcu);
 	might_sleep();
@@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
 	while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
 		rcu_read_unlock();
 
+		if (!retry)
+			return -EBUSY;
+
 		if (wait_event_interruptible(gt->reset.queue,
 					     !test_bit(I915_RESET_BACKOFF,
 						       &gt->reset.flags)))
@@ -1429,6 +1432,16 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
 	return 0;
 }
 
+int intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_trylock(gt, srcu, false);
+}
+
+int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_trylock(gt, srcu, true);
+}
+
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
 __releases(&gt->reset.backoff_srcu)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
index adc734e673870..7f863726eb6a2 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -38,6 +38,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine,
 
 void __i915_request_reset(struct i915_request *rq, bool guilty);
 
+int __must_check intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu);
 int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu);
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 941613be3b9dd..1fa1bc7dde3df 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1401,9 +1401,11 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 
 	/*
 	 * Synchronize with gt reset to make sure the worker does not
-	 * corrupt the engine/guc stats.
+	 * corrupt the engine/guc stats. NB: can't actually block waiting
+	 * for a reset to complete as the reset requires flushing out
+	 * any running worker thread. So waiting would deadlock.
 	 */
-	ret = intel_gt_reset_trylock(gt, &srcu);
+	ret = intel_gt_reset_trylock_noretry(gt, &srcu);
 	if (ret)
 		return;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
@ 2022-10-28 19:46   ` John.C.Harrison
  0 siblings, 0 replies; 14+ messages in thread
From: John.C.Harrison @ 2022-10-28 19:46 UTC (permalink / raw)
  To: Intel-GFX; +Cc: DRI-Devel

From: John Harrison <John.C.Harrison@Intel.com>

The engine busyness stats has a worker function to do things like
64bit extend the 32bit hardware counters. The GuC's reset prepare
function flushes out this worker function to ensure no corruption
happens during the reset. Unforunately, the worker function has an
infinite wait for active resets to finish before doing its work. Thus
a deadlock would occur if the worker function had actually started
just as the reset starts.

Update the worker to abort if a reset is in progress rather than
waiting for it to complete. It will still acquire the reset lock in
the case where a reset was not already in progress. So the processing
is still safe from corruption, but the deadlock can no longer occur.

Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
---
 drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
 drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
 drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
 3 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
index 3159df6cdd492..2f48c6e4420ea 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.c
+++ b/drivers/gpu/drm/i915/gt/intel_reset.c
@@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
 	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
 }
 
-int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, bool retry)
 {
 	might_lock(&gt->reset.backoff_srcu);
 	might_sleep();
@@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
 	while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
 		rcu_read_unlock();
 
+		if (!retry)
+			return -EBUSY;
+
 		if (wait_event_interruptible(gt->reset.queue,
 					     !test_bit(I915_RESET_BACKOFF,
 						       &gt->reset.flags)))
@@ -1429,6 +1432,16 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
 	return 0;
 }
 
+int intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_trylock(gt, srcu, false);
+}
+
+int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
+{
+	return _intel_gt_reset_trylock(gt, srcu, true);
+}
+
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
 __releases(&gt->reset.backoff_srcu)
 {
diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
index adc734e673870..7f863726eb6a2 100644
--- a/drivers/gpu/drm/i915/gt/intel_reset.h
+++ b/drivers/gpu/drm/i915/gt/intel_reset.h
@@ -38,6 +38,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine,
 
 void __i915_request_reset(struct i915_request *rq, bool guilty);
 
+int __must_check intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu);
 int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu);
 void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
 
diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
index 941613be3b9dd..1fa1bc7dde3df 100644
--- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
+++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
@@ -1401,9 +1401,11 @@ static void guc_timestamp_ping(struct work_struct *wrk)
 
 	/*
 	 * Synchronize with gt reset to make sure the worker does not
-	 * corrupt the engine/guc stats.
+	 * corrupt the engine/guc stats. NB: can't actually block waiting
+	 * for a reset to complete as the reset requires flushing out
+	 * any running worker thread. So waiting would deadlock.
 	 */
-	ret = intel_gt_reset_trylock(gt, &srcu);
+	ret = intel_gt_reset_trylock_noretry(gt, &srcu);
 	if (ret)
 		return;
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Fix for two GuC issues
  2022-10-28 19:46 ` [Intel-gfx] " John.C.Harrison
                   ` (2 preceding siblings ...)
  (?)
@ 2022-10-28 20:58 ` Patchwork
  -1 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2022-10-28 20:58 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

== Series Details ==

Series: Fix for two GuC issues
URL   : https://patchwork.freedesktop.org/series/110269/
State : warning

== Summary ==

Error: dim sparse failed
Sparse version: v0.6.2
Fast mode used, each commit won't be checked separately.



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [Intel-gfx] ✗ Fi.CI.BAT: failure for Fix for two GuC issues
  2022-10-28 19:46 ` [Intel-gfx] " John.C.Harrison
                   ` (3 preceding siblings ...)
  (?)
@ 2022-10-29  0:07 ` Patchwork
  -1 siblings, 0 replies; 14+ messages in thread
From: Patchwork @ 2022-10-29  0:07 UTC (permalink / raw)
  To: john.c.harrison; +Cc: intel-gfx

[-- Attachment #1: Type: text/plain, Size: 9367 bytes --]

== Series Details ==

Series: Fix for two GuC issues
URL   : https://patchwork.freedesktop.org/series/110269/
State : failure

== Summary ==

CI Bug Log - changes from CI_DRM_12317 -> Patchwork_110269v1
====================================================

Summary
-------

  **FAILURE**

  Serious unknown changes coming with Patchwork_110269v1 absolutely need to be
  verified manually.
  
  If you think the reported changes have nothing to do with the changes
  introduced in Patchwork_110269v1, please notify your bug team to allow them
  to document this new failure mode, which will reduce false positives in CI.

  External URL: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/index.html

Participating hosts (40 -> 39)
------------------------------

  Additional (3): fi-kbl-soraka fi-tgl-dsi fi-pnv-d510 
  Missing    (4): fi-ctg-p8600 fi-rkl-11600 fi-icl-u2 fi-cfl-8700k 

Possible new issues
-------------------

  Here are the unknown changes that may have been introduced in Patchwork_110269v1:

### IGT changes ###

#### Possible regressions ####

  * igt@i915_selftest@live@gem_contexts:
    - fi-kbl-soraka:      NOTRUN -> [INCOMPLETE][1]
   [1]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@i915_selftest@live@gem_contexts.html

  
Known issues
------------

  Here are the changes found in Patchwork_110269v1 that come from known issues:

### CI changes ###

#### Possible fixes ####

  * boot:
    - fi-ilk-650:         [FAIL][2] ([i915#7350]) -> [PASS][3]
   [2]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/fi-ilk-650/boot.html
   [3]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-ilk-650/boot.html

  

### IGT changes ###

#### Issues hit ####

  * igt@core_hotunplug@unbind-rebind:
    - fi-apl-guc:         [PASS][4] -> [INCOMPLETE][5] ([i915#7073])
   [4]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/fi-apl-guc/igt@core_hotunplug@unbind-rebind.html
   [5]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-apl-guc/igt@core_hotunplug@unbind-rebind.html

  * igt@gem_exec_gttfill@basic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][6] ([fdo#109271]) +9 similar issues
   [6]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@gem_exec_gttfill@basic.html

  * igt@gem_exec_suspend@basic-s3@smem:
    - fi-bdw-5557u:       [PASS][7] -> [INCOMPLETE][8] ([i915#146])
   [7]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html
   [8]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-bdw-5557u/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@gem_huc_copy@huc-copy:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][9] ([fdo#109271] / [i915#2190])
   [9]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@gem_huc_copy@huc-copy.html

  * igt@gem_lmem_swapping@basic:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][10] ([fdo#109271] / [i915#4613]) +3 similar issues
   [10]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@gem_lmem_swapping@basic.html

  * igt@i915_pm_rpm@basic-pci-d3-state:
    - fi-ilk-650:         NOTRUN -> [SKIP][11] ([fdo#109271]) +19 similar issues
   [11]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-ilk-650/igt@i915_pm_rpm@basic-pci-d3-state.html

  * igt@i915_selftest@live@gt_heartbeat:
    - fi-glk-j4005:       [PASS][12] -> [DMESG-FAIL][13] ([i915#5334])
   [12]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html
   [13]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-glk-j4005/igt@i915_selftest@live@gt_heartbeat.html

  * igt@i915_selftest@live@gt_pm:
    - fi-kbl-soraka:      NOTRUN -> [DMESG-FAIL][14] ([i915#1886])
   [14]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@i915_selftest@live@gt_pm.html

  * igt@i915_selftest@live@hangcheck:
    - fi-hsw-4770:        [PASS][15] -> [INCOMPLETE][16] ([i915#4785])
   [15]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html
   [16]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-hsw-4770/igt@i915_selftest@live@hangcheck.html

  * igt@kms_chamelium@dp-hpd-fast:
    - fi-ilk-650:         NOTRUN -> [SKIP][17] ([fdo#109271] / [fdo#111827]) +8 similar issues
   [17]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-ilk-650/igt@kms_chamelium@dp-hpd-fast.html

  * igt@kms_chamelium@hdmi-hpd-fast:
    - fi-kbl-soraka:      NOTRUN -> [SKIP][18] ([fdo#109271] / [fdo#111827]) +7 similar issues
   [18]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-kbl-soraka/igt@kms_chamelium@hdmi-hpd-fast.html

  * igt@kms_psr@primary_page_flip:
    - fi-pnv-d510:        NOTRUN -> [SKIP][19] ([fdo#109271]) +43 similar issues
   [19]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-pnv-d510/igt@kms_psr@primary_page_flip.html

  * igt@runner@aborted:
    - fi-hsw-4770:        NOTRUN -> [FAIL][20] ([fdo#109271] / [i915#4312] / [i915#5594])
   [20]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/fi-hsw-4770/igt@runner@aborted.html

  
#### Possible fixes ####

  * igt@gem_exec_suspend@basic-s0@smem:
    - {bat-rpls-2}:       [DMESG-WARN][21] ([i915#6434]) -> [PASS][22]
   [21]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/bat-rpls-2/igt@gem_exec_suspend@basic-s0@smem.html
   [22]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/bat-rpls-2/igt@gem_exec_suspend@basic-s0@smem.html

  * igt@gem_exec_suspend@basic-s3@smem:
    - {bat-rplp-1}:       [DMESG-WARN][23] ([i915#2867]) -> [PASS][24]
   [23]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/bat-rplp-1/igt@gem_exec_suspend@basic-s3@smem.html
   [24]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/bat-rplp-1/igt@gem_exec_suspend@basic-s3@smem.html
    - {bat-rpls-1}:       [DMESG-WARN][25] ([i915#6687]) -> [PASS][26]
   [25]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/bat-rpls-1/igt@gem_exec_suspend@basic-s3@smem.html
   [26]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/bat-rpls-1/igt@gem_exec_suspend@basic-s3@smem.html

  * igt@i915_selftest@live@migrate:
    - {bat-dg2-11}:       [DMESG-WARN][27] -> [PASS][28]
   [27]: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_12317/bat-dg2-11/igt@i915_selftest@live@migrate.html
   [28]: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/bat-dg2-11/igt@i915_selftest@live@migrate.html

  
  {name}: This element is suppressed. This means it is ignored when computing
          the status of the difference (SUCCESS, WARNING, or FAILURE).

  [fdo#109271]: https://bugs.freedesktop.org/show_bug.cgi?id=109271
  [fdo#109284]: https://bugs.freedesktop.org/show_bug.cgi?id=109284
  [fdo#109285]: https://bugs.freedesktop.org/show_bug.cgi?id=109285
  [fdo#109295]: https://bugs.freedesktop.org/show_bug.cgi?id=109295
  [fdo#110189]: https://bugs.freedesktop.org/show_bug.cgi?id=110189
  [fdo#111827]: https://bugs.freedesktop.org/show_bug.cgi?id=111827
  [i915#146]: https://gitlab.freedesktop.org/drm/intel/issues/146
  [i915#1845]: https://gitlab.freedesktop.org/drm/intel/issues/1845
  [i915#1886]: https://gitlab.freedesktop.org/drm/intel/issues/1886
  [i915#2190]: https://gitlab.freedesktop.org/drm/intel/issues/2190
  [i915#2582]: https://gitlab.freedesktop.org/drm/intel/issues/2582
  [i915#2867]: https://gitlab.freedesktop.org/drm/intel/issues/2867
  [i915#3301]: https://gitlab.freedesktop.org/drm/intel/issues/3301
  [i915#3555]: https://gitlab.freedesktop.org/drm/intel/issues/3555
  [i915#4103]: https://gitlab.freedesktop.org/drm/intel/issues/4103
  [i915#4258]: https://gitlab.freedesktop.org/drm/intel/issues/4258
  [i915#4312]: https://gitlab.freedesktop.org/drm/intel/issues/4312
  [i915#4613]: https://gitlab.freedesktop.org/drm/intel/issues/4613
  [i915#4785]: https://gitlab.freedesktop.org/drm/intel/issues/4785
  [i915#5334]: https://gitlab.freedesktop.org/drm/intel/issues/5334
  [i915#5594]: https://gitlab.freedesktop.org/drm/intel/issues/5594
  [i915#6367]: https://gitlab.freedesktop.org/drm/intel/issues/6367
  [i915#6434]: https://gitlab.freedesktop.org/drm/intel/issues/6434
  [i915#6687]: https://gitlab.freedesktop.org/drm/intel/issues/6687
  [i915#6856]: https://gitlab.freedesktop.org/drm/intel/issues/6856
  [i915#6997]: https://gitlab.freedesktop.org/drm/intel/issues/6997
  [i915#7073]: https://gitlab.freedesktop.org/drm/intel/issues/7073
  [i915#7125]: https://gitlab.freedesktop.org/drm/intel/issues/7125
  [i915#7350]: https://gitlab.freedesktop.org/drm/intel/issues/7350


Build changes
-------------

  * Linux: CI_DRM_12317 -> Patchwork_110269v1

  CI-20190529: 20190529
  CI_DRM_12317: 0f9cbc285f7d3d1860f7f903663933f33ff9ae25 @ git://anongit.freedesktop.org/gfx-ci/linux
  IGT_7029: c32cb1e614017f14274d335ac571383799e6c786 @ https://gitlab.freedesktop.org/drm/igt-gpu-tools.git
  Patchwork_110269v1: 0f9cbc285f7d3d1860f7f903663933f33ff9ae25 @ git://anongit.freedesktop.org/gfx-ci/linux


### Linux commits

995a78c4923a drm/i915/guc: Don't deadlock busyness stats vs reset
4ee3675c1688 drm/i915/guc: Properly initialise kernel contexts

== Logs ==

For more details see: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_110269v1/index.html

[-- Attachment #2: Type: text/html, Size: 10043 bytes --]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-10-28 19:46   ` [Intel-gfx] " John.C.Harrison
  (?)
@ 2022-10-31 10:09   ` Tvrtko Ursulin
  2022-10-31 12:51     ` Tvrtko Ursulin
  -1 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2022-10-31 10:09 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
> From: John Harrison <John.C.Harrison@Intel.com>
> 
> The engine busyness stats has a worker function to do things like
> 64bit extend the 32bit hardware counters. The GuC's reset prepare
> function flushes out this worker function to ensure no corruption
> happens during the reset. Unforunately, the worker function has an
> infinite wait for active resets to finish before doing its work. Thus
> a deadlock would occur if the worker function had actually started
> just as the reset starts.
> 
> Update the worker to abort if a reset is in progress rather than
> waiting for it to complete. It will still acquire the reset lock in
> the case where a reset was not already in progress. So the processing
> is still safe from corruption, but the deadlock can no longer occur.
> 
> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
> ---
>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>   3 files changed, 19 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c b/drivers/gpu/drm/i915/gt/intel_reset.c
> index 3159df6cdd492..2f48c6e4420ea 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>   	intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>   }
>   
> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, bool retry)
>   {
>   	might_lock(&gt->reset.backoff_srcu);
>   	might_sleep();
> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>   	while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>   		rcu_read_unlock();
>   
> +		if (!retry)
> +			return -EBUSY;
> +
>   		if (wait_event_interruptible(gt->reset.queue,
>   					     !test_bit(I915_RESET_BACKOFF,
>   						       &gt->reset.flags)))

Would it be more obvious to rename the existing semantics to 
intel_gt_reset_interruptible(), while the flavour you add in this patch 
truly is trylock? I am not sure, since it's all a bit special, but 
trylock sure feels confusing if it can sleep forever...

Regards,

Tvrtko

> @@ -1429,6 +1432,16 @@ int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>   	return 0;
>   }
>   
> +int intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu)
> +{
> +	return _intel_gt_reset_trylock(gt, srcu, false);
> +}
> +
> +int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
> +{
> +	return _intel_gt_reset_trylock(gt, srcu, true);
> +}
> +
>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag)
>   __releases(&gt->reset.backoff_srcu)
>   {
> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.h b/drivers/gpu/drm/i915/gt/intel_reset.h
> index adc734e673870..7f863726eb6a2 100644
> --- a/drivers/gpu/drm/i915/gt/intel_reset.h
> +++ b/drivers/gpu/drm/i915/gt/intel_reset.h
> @@ -38,6 +38,7 @@ int __intel_engine_reset_bh(struct intel_engine_cs *engine,
>   
>   void __i915_request_reset(struct i915_request *rq, bool guilty);
>   
> +int __must_check intel_gt_reset_trylock_noretry(struct intel_gt *gt, int *srcu);
>   int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu);
>   void intel_gt_reset_unlock(struct intel_gt *gt, int tag);
>   
> diff --git a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> index 941613be3b9dd..1fa1bc7dde3df 100644
> --- a/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> +++ b/drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c
> @@ -1401,9 +1401,11 @@ static void guc_timestamp_ping(struct work_struct *wrk)
>   
>   	/*
>   	 * Synchronize with gt reset to make sure the worker does not
> -	 * corrupt the engine/guc stats.
> +	 * corrupt the engine/guc stats. NB: can't actually block waiting
> +	 * for a reset to complete as the reset requires flushing out
> +	 * any running worker thread. So waiting would deadlock.
>   	 */
> -	ret = intel_gt_reset_trylock(gt, &srcu);
> +	ret = intel_gt_reset_trylock_noretry(gt, &srcu);
>   	if (ret)
>   		return;
>   

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-10-31 10:09   ` Tvrtko Ursulin
@ 2022-10-31 12:51     ` Tvrtko Ursulin
  2022-10-31 18:30       ` John Harrison
  0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2022-10-31 12:51 UTC (permalink / raw)
  To: John.C.Harrison, Intel-GFX; +Cc: DRI-Devel


On 31/10/2022 10:09, Tvrtko Ursulin wrote:
> 
> On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
>> From: John Harrison <John.C.Harrison@Intel.com>
>>
>> The engine busyness stats has a worker function to do things like
>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>> function flushes out this worker function to ensure no corruption
>> happens during the reset. Unforunately, the worker function has an
>> infinite wait for active resets to finish before doing its work. Thus
>> a deadlock would occur if the worker function had actually started
>> just as the reset starts.
>>
>> Update the worker to abort if a reset is in progress rather than
>> waiting for it to complete. It will still acquire the reset lock in
>> the case where a reset was not already in progress. So the processing
>> is still safe from corruption, but the deadlock can no longer occur.
>>
>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>> ---
>>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 ++++++++++++++-
>>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>> index 3159df6cdd492..2f48c6e4420ea 100644
>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>   }
>> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, 
>> bool retry)
>>   {
>>       might_lock(&gt->reset.backoff_srcu);
>>       might_sleep();
>> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt *gt, 
>> int *srcu)
>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>           rcu_read_unlock();
>> +        if (!retry)
>> +            return -EBUSY;
>> +
>>           if (wait_event_interruptible(gt->reset.queue,
>>                            !test_bit(I915_RESET_BACKOFF,
>>                                  &gt->reset.flags)))
> 
> Would it be more obvious to rename the existing semantics to 
> intel_gt_reset_interruptible(), while the flavour you add in this patch 
> truly is trylock? I am not sure, since it's all a bit special, but 
> trylock sure feels confusing if it can sleep forever...

Oh and might_sleep() shouldn't be there with the trylock version - I 
mean any flavour of the real trylock.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-10-31 12:51     ` Tvrtko Ursulin
@ 2022-10-31 18:30       ` John Harrison
  2022-11-01  9:58         ` Tvrtko Ursulin
  0 siblings, 1 reply; 14+ messages in thread
From: John Harrison @ 2022-10-31 18:30 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 10/31/2022 05:51, Tvrtko Ursulin wrote:
> On 31/10/2022 10:09, Tvrtko Ursulin wrote:
>> On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>
>>> The engine busyness stats has a worker function to do things like
>>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>>> function flushes out this worker function to ensure no corruption
>>> happens during the reset. Unforunately, the worker function has an
>>> infinite wait for active resets to finish before doing its work. Thus
>>> a deadlock would occur if the worker function had actually started
>>> just as the reset starts.
>>>
>>> Update the worker to abort if a reset is in progress rather than
>>> waiting for it to complete. It will still acquire the reset lock in
>>> the case where a reset was not already in progress. So the processing
>>> is still safe from corruption, but the deadlock can no longer occur.
>>>
>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>> ---
>>>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 
>>> ++++++++++++++-
>>>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> index 3159df6cdd492..2f48c6e4420ea 100644
>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>>   }
>>> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, 
>>> bool retry)
>>>   {
>>>       might_lock(&gt->reset.backoff_srcu);
>>>       might_sleep();
>>> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt 
>>> *gt, int *srcu)
>>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>>           rcu_read_unlock();
>>> +        if (!retry)
>>> +            return -EBUSY;
>>> +
>>>           if (wait_event_interruptible(gt->reset.queue,
>>>                            !test_bit(I915_RESET_BACKOFF,
>>>                                  &gt->reset.flags)))
>>
>> Would it be more obvious to rename the existing semantics to 
>> intel_gt_reset_interruptible(), while the flavour you add in this 
>> patch truly is trylock? I am not sure, since it's all a bit special, 
>> but trylock sure feels confusing if it can sleep forever...
To me, it would seem totally more obvious to have a function called 
'trylock' not wait forever until it can manage to acquire the lock. 
However, according to '2caffbf1176256 drm/i915: Revoke mmaps and prevent 
access to fence registers across reset', the current behaviour is 
exactly how the code was originally written and intended. It hasn't just 
mutated into some confused evolution a thousand patches later. So I 
figure there is some subtle but important reason why it was named how it 
is named and yet does what it does. Therefore it seemed safest to not 
change it unnecessarily.

>
> Oh and might_sleep() shouldn't be there with the trylock version - I 
> mean any flavour of the real trylock.
You mean if the code is split into two completely separate functions? Or 
do you just mean to wrap the might_sleep() call with 'if(!retry)'?

And just to be totally clear, the unconditional call to rcu_read_lock() 
is not something that can sleep? One doesn't need a might_sleep() before 
doing that lock?

John.


>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-10-31 18:30       ` John Harrison
@ 2022-11-01  9:58         ` Tvrtko Ursulin
  2022-11-01 16:56           ` John Harrison
  0 siblings, 1 reply; 14+ messages in thread
From: Tvrtko Ursulin @ 2022-11-01  9:58 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 31/10/2022 18:30, John Harrison wrote:
> On 10/31/2022 05:51, Tvrtko Ursulin wrote:
>> On 31/10/2022 10:09, Tvrtko Ursulin wrote:
>>> On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>
>>>> The engine busyness stats has a worker function to do things like
>>>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>>>> function flushes out this worker function to ensure no corruption
>>>> happens during the reset. Unforunately, the worker function has an
>>>> infinite wait for active resets to finish before doing its work. Thus
>>>> a deadlock would occur if the worker function had actually started
>>>> just as the reset starts.
>>>>
>>>> Update the worker to abort if a reset is in progress rather than
>>>> waiting for it to complete. It will still acquire the reset lock in
>>>> the case where a reset was not already in progress. So the processing
>>>> is still safe from corruption, but the deadlock can no longer occur.
>>>>
>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>> ---
>>>>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 
>>>> ++++++++++++++-
>>>>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>>>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> index 3159df6cdd492..2f48c6e4420ea 100644
>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>>>   }
>>>> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>>> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int *srcu, 
>>>> bool retry)
>>>>   {
>>>>       might_lock(&gt->reset.backoff_srcu);
>>>>       might_sleep();
>>>> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt 
>>>> *gt, int *srcu)
>>>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>>>           rcu_read_unlock();
>>>> +        if (!retry)
>>>> +            return -EBUSY;
>>>> +
>>>>           if (wait_event_interruptible(gt->reset.queue,
>>>>                            !test_bit(I915_RESET_BACKOFF,
>>>>                                  &gt->reset.flags)))
>>>
>>> Would it be more obvious to rename the existing semantics to 
>>> intel_gt_reset_interruptible(), while the flavour you add in this 
>>> patch truly is trylock? I am not sure, since it's all a bit special, 
>>> but trylock sure feels confusing if it can sleep forever...
> To me, it would seem totally more obvious to have a function called 
> 'trylock' not wait forever until it can manage to acquire the lock. 
> However, according to '2caffbf1176256 drm/i915: Revoke mmaps and prevent 
> access to fence registers across reset', the current behaviour is 
> exactly how the code was originally written and intended. It hasn't just 
> mutated into some confused evolution a thousand patches later. So I 
> figure there is some subtle but important reason why it was named how it 
> is named and yet does what it does. Therefore it seemed safest to not 
> change it unnecessarily.

Yeah I looked at that but honestly I don't see the trylock semantics 
anywhere. The only failure to lock path comes from 
wait_event_interruptible. It could have easily been just a naming mishap.

And I find adding a retry parameter to something called trylock makes 
this even more non-intuitive and would personally rather rename it all. 
Proof in the pudding is that the trylock naming did bite during 
development and review of the code this patch is now fixing.

I do however understand your point about a degree of uncertainty but my 
feeling is to rather err on the side of obvious naming. Shall we ask for 
a third opinion?

>> Oh and might_sleep() shouldn't be there with the trylock version - I 
>> mean any flavour of the real trylock.
> You mean if the code is split into two completely separate functions? Or 
> do you just mean to wrap the might_sleep() call with 'if(!retry)'?
> 
> And just to be totally clear, the unconditional call to rcu_read_lock() 
> is not something that can sleep? One doesn't need a might_sleep() before 
> doing that lock?

Corrrect, rcu_read_lock() can not sleep - it just disables preemption. 
So leaving the unconditional might_sleep() would have opportunity for 
false positives.

Regards,

Tvrtko

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-01  9:58         ` Tvrtko Ursulin
@ 2022-11-01 16:56           ` John Harrison
  2022-11-02  8:17             ` Tvrtko Ursulin
  0 siblings, 1 reply; 14+ messages in thread
From: John Harrison @ 2022-11-01 16:56 UTC (permalink / raw)
  To: Tvrtko Ursulin, Intel-GFX; +Cc: DRI-Devel

On 11/1/2022 02:58, Tvrtko Ursulin wrote:
> On 31/10/2022 18:30, John Harrison wrote:
>> On 10/31/2022 05:51, Tvrtko Ursulin wrote:
>>> On 31/10/2022 10:09, Tvrtko Ursulin wrote:
>>>> On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>
>>>>> The engine busyness stats has a worker function to do things like
>>>>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>>>>> function flushes out this worker function to ensure no corruption
>>>>> happens during the reset. Unforunately, the worker function has an
>>>>> infinite wait for active resets to finish before doing its work. Thus
>>>>> a deadlock would occur if the worker function had actually started
>>>>> just as the reset starts.
>>>>>
>>>>> Update the worker to abort if a reset is in progress rather than
>>>>> waiting for it to complete. It will still acquire the reset lock in
>>>>> the case where a reset was not already in progress. So the processing
>>>>> is still safe from corruption, but the deadlock can no longer occur.
>>>>>
>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>> ---
>>>>>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 
>>>>> ++++++++++++++-
>>>>>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>>>>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>> index 3159df6cdd492..2f48c6e4420ea 100644
>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>>>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>>>>   }
>>>>> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>>>> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int 
>>>>> *srcu, bool retry)
>>>>>   {
>>>>>       might_lock(&gt->reset.backoff_srcu);
>>>>>       might_sleep();
>>>>> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt 
>>>>> *gt, int *srcu)
>>>>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>>>>           rcu_read_unlock();
>>>>> +        if (!retry)
>>>>> +            return -EBUSY;
>>>>> +
>>>>>           if (wait_event_interruptible(gt->reset.queue,
>>>>>                            !test_bit(I915_RESET_BACKOFF,
>>>>> &gt->reset.flags)))
>>>>
>>>> Would it be more obvious to rename the existing semantics to 
>>>> intel_gt_reset_interruptible(), while the flavour you add in this 
>>>> patch truly is trylock? I am not sure, since it's all a bit 
>>>> special, but trylock sure feels confusing if it can sleep forever...
>> To me, it would seem totally more obvious to have a function called 
>> 'trylock' not wait forever until it can manage to acquire the lock. 
>> However, according to '2caffbf1176256 drm/i915: Revoke mmaps and 
>> prevent access to fence registers across reset', the current 
>> behaviour is exactly how the code was originally written and 
>> intended. It hasn't just mutated into some confused evolution a 
>> thousand patches later. So I figure there is some subtle but 
>> important reason why it was named how it is named and yet does what 
>> it does. Therefore it seemed safest to not change it unnecessarily.
>
> Yeah I looked at that but honestly I don't see the trylock semantics 
> anywhere. The only failure to lock path comes from 
> wait_event_interruptible. It could have easily been just a naming mishap.
>
> And I find adding a retry parameter to something called trylock makes 
> this even more non-intuitive and would personally rather rename it 
> all. Proof in the pudding is that the trylock naming did bite during 
> development and review of the code this patch is now fixing.
>
> I do however understand your point about a degree of uncertainty but 
> my feeling is to rather err on the side of obvious naming. Shall we 
> ask for a third opinion?
Umesh had commented (internally) that the naming seems wrong and would 
be good to change it. So we already have a third :).

To be clear, you are thinking to keep the wrappers but rename to 
intel_gt_reset_trylock() [retry = false] and 
intel_gt_reset_interruptible() [retry = true]? Which will obviously 
involve updating all but one existing user to use the interruptible name 
as the existing name will change behaviour in a backwards breaking manner.

John.

>
>>> Oh and might_sleep() shouldn't be there with the trylock version - I 
>>> mean any flavour of the real trylock.
>> You mean if the code is split into two completely separate functions? 
>> Or do you just mean to wrap the might_sleep() call with 'if(!retry)'?
>>
>> And just to be totally clear, the unconditional call to 
>> rcu_read_lock() is not something that can sleep? One doesn't need a 
>> might_sleep() before doing that lock?
>
> Corrrect, rcu_read_lock() can not sleep - it just disables preemption. 
> So leaving the unconditional might_sleep() would have opportunity for 
> false positives.
>
> Regards,
>
> Tvrtko


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [Intel-gfx] [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset
  2022-11-01 16:56           ` John Harrison
@ 2022-11-02  8:17             ` Tvrtko Ursulin
  0 siblings, 0 replies; 14+ messages in thread
From: Tvrtko Ursulin @ 2022-11-02  8:17 UTC (permalink / raw)
  To: John Harrison, Intel-GFX; +Cc: DRI-Devel


On 01/11/2022 16:56, John Harrison wrote:
> On 11/1/2022 02:58, Tvrtko Ursulin wrote:
>> On 31/10/2022 18:30, John Harrison wrote:
>>> On 10/31/2022 05:51, Tvrtko Ursulin wrote:
>>>> On 31/10/2022 10:09, Tvrtko Ursulin wrote:
>>>>> On 28/10/2022 20:46, John.C.Harrison@Intel.com wrote:
>>>>>> From: John Harrison <John.C.Harrison@Intel.com>
>>>>>>
>>>>>> The engine busyness stats has a worker function to do things like
>>>>>> 64bit extend the 32bit hardware counters. The GuC's reset prepare
>>>>>> function flushes out this worker function to ensure no corruption
>>>>>> happens during the reset. Unforunately, the worker function has an
>>>>>> infinite wait for active resets to finish before doing its work. Thus
>>>>>> a deadlock would occur if the worker function had actually started
>>>>>> just as the reset starts.
>>>>>>
>>>>>> Update the worker to abort if a reset is in progress rather than
>>>>>> waiting for it to complete. It will still acquire the reset lock in
>>>>>> the case where a reset was not already in progress. So the processing
>>>>>> is still safe from corruption, but the deadlock can no longer occur.
>>>>>>
>>>>>> Signed-off-by: John Harrison <John.C.Harrison@Intel.com>
>>>>>> ---
>>>>>>   drivers/gpu/drm/i915/gt/intel_reset.c             | 15 
>>>>>> ++++++++++++++-
>>>>>>   drivers/gpu/drm/i915/gt/intel_reset.h             |  1 +
>>>>>>   drivers/gpu/drm/i915/gt/uc/intel_guc_submission.c |  6 ++++--
>>>>>>   3 files changed, 19 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/i915/gt/intel_reset.c 
>>>>>> b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> index 3159df6cdd492..2f48c6e4420ea 100644
>>>>>> --- a/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> +++ b/drivers/gpu/drm/i915/gt/intel_reset.c
>>>>>> @@ -1407,7 +1407,7 @@ void intel_gt_handle_error(struct intel_gt *gt,
>>>>>>       intel_runtime_pm_put(gt->uncore->rpm, wakeref);
>>>>>>   }
>>>>>> -int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
>>>>>> +static int _intel_gt_reset_trylock(struct intel_gt *gt, int 
>>>>>> *srcu, bool retry)
>>>>>>   {
>>>>>>       might_lock(&gt->reset.backoff_srcu);
>>>>>>       might_sleep();
>>>>>> @@ -1416,6 +1416,9 @@ int intel_gt_reset_trylock(struct intel_gt 
>>>>>> *gt, int *srcu)
>>>>>>       while (test_bit(I915_RESET_BACKOFF, &gt->reset.flags)) {
>>>>>>           rcu_read_unlock();
>>>>>> +        if (!retry)
>>>>>> +            return -EBUSY;
>>>>>> +
>>>>>>           if (wait_event_interruptible(gt->reset.queue,
>>>>>>                            !test_bit(I915_RESET_BACKOFF,
>>>>>> &gt->reset.flags)))
>>>>>
>>>>> Would it be more obvious to rename the existing semantics to 
>>>>> intel_gt_reset_interruptible(), while the flavour you add in this 
>>>>> patch truly is trylock? I am not sure, since it's all a bit 
>>>>> special, but trylock sure feels confusing if it can sleep forever...
>>> To me, it would seem totally more obvious to have a function called 
>>> 'trylock' not wait forever until it can manage to acquire the lock. 
>>> However, according to '2caffbf1176256 drm/i915: Revoke mmaps and 
>>> prevent access to fence registers across reset', the current 
>>> behaviour is exactly how the code was originally written and 
>>> intended. It hasn't just mutated into some confused evolution a 
>>> thousand patches later. So I figure there is some subtle but 
>>> important reason why it was named how it is named and yet does what 
>>> it does. Therefore it seemed safest to not change it unnecessarily.
>>
>> Yeah I looked at that but honestly I don't see the trylock semantics 
>> anywhere. The only failure to lock path comes from 
>> wait_event_interruptible. It could have easily been just a naming mishap.
>>
>> And I find adding a retry parameter to something called trylock makes 
>> this even more non-intuitive and would personally rather rename it 
>> all. Proof in the pudding is that the trylock naming did bite during 
>> development and review of the code this patch is now fixing.
>>
>> I do however understand your point about a degree of uncertainty but 
>> my feeling is to rather err on the side of obvious naming. Shall we 
>> ask for a third opinion?
> Umesh had commented (internally) that the naming seems wrong and would 
> be good to change it. So we already have a third :).
> 
> To be clear, you are thinking to keep the wrappers but rename to 
> intel_gt_reset_trylock() [retry = false] and 
> intel_gt_reset_interruptible() [retry = true]? Which will obviously 
> involve updating all but one existing user to use the interruptible name 
> as the existing name will change behaviour in a backwards breaking manner.

Yes, intel_gt_reset_lock_interruptible and intel_gt_reset_trylock.

I don't get the behaviour breaking part? Only the name will change.

And amount of churn does not seem a problem:

$ grep intel_gt_reset_trylock -r .
./gem/i915_gem_mman.c:  ret = intel_gt_reset_trylock(ggtt->vm.gt, &srcu);
./gt/uc/intel_guc_submission.c: ret = intel_gt_reset_trylock(gt, &srcu);
./gt/intel_reset.c:int intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)
./gt/intel_reset.h:int __must_check intel_gt_reset_trylock(struct intel_gt *gt, int *srcu)

Regards,

Tvrtko

> 
> John.
> 
>>
>>>> Oh and might_sleep() shouldn't be there with the trylock version - I 
>>>> mean any flavour of the real trylock.
>>> You mean if the code is split into two completely separate functions? 
>>> Or do you just mean to wrap the might_sleep() call with 'if(!retry)'?
>>>
>>> And just to be totally clear, the unconditional call to 
>>> rcu_read_lock() is not something that can sleep? One doesn't need a 
>>> might_sleep() before doing that lock?
>>
>> Corrrect, rcu_read_lock() can not sleep - it just disables preemption. 
>> So leaving the unconditional might_sleep() would have opportunity for 
>> false positives.
>>
>> Regards,
>>
>> Tvrtko
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-11-02  8:17 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-28 19:46 [PATCH 0/2] Fix for two GuC issues John.C.Harrison
2022-10-28 19:46 ` [Intel-gfx] " John.C.Harrison
2022-10-28 19:46 ` [PATCH 1/2] drm/i915/guc: Properly initialise kernel contexts John.C.Harrison
2022-10-28 19:46   ` [Intel-gfx] " John.C.Harrison
2022-10-28 19:46 ` [PATCH 2/2] drm/i915/guc: Don't deadlock busyness stats vs reset John.C.Harrison
2022-10-28 19:46   ` [Intel-gfx] " John.C.Harrison
2022-10-31 10:09   ` Tvrtko Ursulin
2022-10-31 12:51     ` Tvrtko Ursulin
2022-10-31 18:30       ` John Harrison
2022-11-01  9:58         ` Tvrtko Ursulin
2022-11-01 16:56           ` John Harrison
2022-11-02  8:17             ` Tvrtko Ursulin
2022-10-28 20:58 ` [Intel-gfx] ✗ Fi.CI.SPARSE: warning for Fix for two GuC issues Patchwork
2022-10-29  0:07 ` [Intel-gfx] ✗ Fi.CI.BAT: failure " Patchwork

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.